Build AI, push the limits

AI Approaches, Evals & Startups vs Incumbents | Kushal Prakash & Amit Goel

Build AI Podcast

Tune in to our podcast.

Build AI Podcast

Tune in to our podcast.

Tune in to gAI Ventures podcast

A podcast by gAI Ventures, exploring the latest in AI, startups, and innovation. Join us for expert insights, founder/investor stories, and deep dives into how AI is shaping the future.

Kushal Prakash and Amit Goel | gAI Ventures

[Amit]: Hello everyone, welcome to the third episode of Build AI. Today we are going to talk about various AI approaches to building products. And I have with us Kushal, who is the CTO and co-founder of Guy Ventures. Hi Kushal.

[Kushal]: Hi Amit.

[Amit]: So today we are going to talk about various approaches and methodology which are adopted for different types of products. We'll talk about various frameworks, design, evals, and what are the other issues which arise while building such products. And then towards the end, we'll also talk about business models, competition, and so on and so forth. So we'll keep it sweet and short, and hopefully, you derive a lot of value as a builder in the AI space.

All right, so maybe I can go first and ask you one question. By now, you have looked at three different products in AI, and you also looked at a lot of other AI products that people are building in the space. I wanted to understand that people have done prompt engineering on top of the foundational models, people have done RAG, adaptive RAG, people have done small language models, trained their own models. I'm sure there are so many different approaches based on all the research papers you have read, the three products you have built, and all that you have seen. What are the various types of AI approaches and methodology? And can you also specify what types of products are well-suited for a certain approach—approach one, approach two, approach three, and so on?

[Kushal]: Right, right. That's a great question. So, let's say we have companies or entities with lots of data which are all static. As a precursor, the way artificial intelligence started, the most basic layer was a neural network, ANN, right? That would basically divert attention to something specific based on the data it has seen. Similarly, an LLM is a compounding of multiple such things, and it's a larger model wherein there are too many attentions which we are trying to create.

So for companies where they have static data, and a lot of times they would require the models to be in-house, deployed on-premises, here training their own model or taking an SLM (a small language model) and fine-tuning it with the data they have is a good approach. Essentially, they are trying to create more attention which is diverted towards their data by using an SLM. An SLM already has a lot of parameters, but it has attentions on different parts. By fine-tuning, you can ensure that it only knows what your data is about, learns more about it, and can answer anything which is about that data you have fine-tuned it on.

Small language models are easier to deploy, and now there are multiple even smaller models, like Cohere has been launching some great models which can even run on a Macbook. So it's not as cost-intensive as it used to be earlier. So companies which have proprietary data and want to do on-premises deployment, it's better to fine-tune their own model. This is on the discretion that the data is not growing too much in the future. If it's growing a lot, or maybe the data is changing, then fine-tuning might not be the better approach there. Instead, just create a layer on top wherein the data is present and it's getting added as additional context to the query which is being passed on. That's RAG, or retrieval-augmented generation.

Essentially, we are creating a layer on top where all this data is stored. Once a query comes in, we try and retrieve some data which is specific to that particular company or an entity, and add that so that when the LLM is getting the query, there's additional data about the company or relevant context which is also present. Again, you're trying to create attention towards where you want it to be, but not exactly a fine-tuned model where you are disregarding a lot of parameters.

Fine-tuning is a better approach when the data is static. But if the data itself is growing and it's very dynamic, it's very cost-intensive. On top of that, you can't keep fine-tuning a fine-tuned model again and again. If the same data is passed on again in a different way, that messes up the whole model. For example, let's say we are training a model on FINRA guidelines. That is an evolving guideline; it keeps changing every quarter. If I specify something about the AML procedure and that has been fine-tuned, and if that changes after 3 months, I can't send the same again. Now the model is going to get confused because two things were told about the same thing in different ways. That is going to create issues when it's answering queries, because an LLM is not going to look at it as "first this was told and this was an upgrade." Both of these data points are supposed to be the same, right?

There are also methods where you can create attention which is based on time relevancy. That kind of fine-tuning can be done, but then that's going to create other issues like you will have to keep fine-tuning the model and deploying a newer model rather than having the same model deployed always. In my opinion, a RAG system is better when the data is evolving a lot and if it's a huge database as well. But if the data evolution is not as much and it's just going to be smaller tweaks, then we can handle the time relevancy on a fine-tuned model as well. So yeah, that's how I would look at it.

And even in RAG, there are multiple types of RAGs where you can ensure that you're focusing more on the accuracy, or on the output latency, or even how relevant and how comprehensive you want the outputs to be.

Now, if there's no data, prompt engineering is just a patient to create some attentions. That's where you're just trying to disregard a lot of things and telling the AI to only do something specific to what you want it to.

On top of that, there's also prompting methodologies now where a prompt can be... It's more like reinforcement learning, right? In AI, reinforcement learning basically means you're trying to tell something about the model which it might have made some mistakes with. Let's say it answers a query which Amit has asked, and we know that there's some mistake in it. We can do some reinforcement and tell it that no, that was not it; this is how it should have been. So the next time the same query or something similar is asked, it's going to improvise the answer based on what reinforcement we have done.

Reinforcing and ensuring that the model is getting better, or the RAG layer—the data layer—can also be reinforced, or the prompting itself. So prompting is something where reinforcement learning through human feedback, RLHF, is being used a lot right now. This can be more for companies who don't have too much data or they are working on generic data which LLMs are already trained on. In such cases, just a prompt, maybe with a prompter, you can expect better responses.

[Amit]: Got it. So just to make it very easy for the audience, especially the non-tech audience, I want to break it down even further. Let's say I will basically name a type of application, and then maybe we can say that usually the most general approach to building such an application is this particular method, right? Based on what you have explained. I think you have explained four things: there's normal prompt engineering, there's RAG, there are different types of RAG like adaptive RAG, there's fine-tuning your own models, and then there's reinforcement learning, which is a different sort of orthogonal approach to AGI, I believe.

So let me name a few different types of applications, and this will make it very interesting and easy to absorb for everyone. Let's say a simple AI agentic system which can carry out some actions like create emails, schedule meetings, and so on, like Howlyr AI does for that. Which approach is the best here?

[Kushal]: The data itself is going to be dynamic, right? We are trying to retrieve data from the email, and email content can change every minute based on how busy a person is. Here, there's no way we can fine-tune a model, or even if there is, it has to be continuous fine-tuning which is not feasible today. So the better approach is to have a retrieving layer. But the retrieval itself is live. We are retrieving the emails live when the query comes in, fetching some things, and based on that we are creating a response.

Let's say on Howlyr, you want to create an email or respond to someone. First, we have to see what was the email received from that person, and then based on that, create the response. So we first retrieve the data from the email, try and understand what exactly it is, pass that to an LLM, and the LLM is going to use additional context and the history.

Now, the data layer again can be in multiple ways. One is retrieving from the data sources like email or database. Another is the history itself. Since Howlyr is a consumer app and we store the chat history, we utilize that as well—the last few things which the user has asked—to see if there's something additional which we need to do while creating a response. Because, for instance, you have a specific way of writing emails and you usually tell it "no" to a lot of things which are over email, while I am a little more different. So based on how your chat history is, it's going to tailor the responses accordingly.

So yeah, two layers there: the chat history and the retrieval is live. It's a RAG system here. But we don't store any data, no emails, nothing stored. It's retrieved live as the query is asked, and that is passed on to the LLM which looks at it, uses the historical chats, and then figures out the response and creates it.

[Amit]: Okay. What about another type of AI agentic system which does a simple action of maybe just updating social media based on some content? So basically, it takes a query, creates some content, and then posts it on Twitter as an example. What approach would be taken there?

[Kushal]: Here, since the LLM has to deal with some other external applications, either you do something like RPA (robotic process automation) or you require APIs where you can push data into another app or system. Here, let's say Twitter or Notion, any of these. These are external apps. LLMs don't have direct access to them already. So what we do is we create tools or layers. And now there's MCP as well, which allows you to use them over the native ChatGPT cloud apps as well. But on the API side, we require some access to it so that the LLM creates the basic structure in which the responses need to be pushed to these external apps.

In the end, APIs require some structured input, right? If I need to create a tweet, then the Twitter API would require me to specify some format of the content, like what is the subject, what is the tweet, and should I add any things in the thread, something like that. In the same way, here the LLM will be creating that structure. And APIs are anyway static. So once the structure is created, you just call the tool and that's going to do the job. It's going to push it to either Twitter, Notion, wherever you would expect it to.

[Amit]: So the short answer is that this looks like a simple prompt engineering from an AI builder perspective. Tools and MCP and all that is a separate thing, but basically the core AI approach here is simple prompt engineering.

[Kushal]: Yes, definitely. Prompt engineering to create structures, like a JSON structure, or just telling it "give me this specific format so that I can send it to the API or perform some action based on it."

[Amit]: Okay, so this is prompt engineering, simple. But for Howlyr, because there was much more nuance, you said it's a RAG essentially.

Let's go a little bit more advanced and complex. Let's say applications like Harvey, which deal with legal docs, creation of legal docs, or let's say Audit AI, which is dealing with compliance and all those kind of documents. What type of approach usually or generally is taken there?

[Kushal]: When the data is huge, it's better to go with a fine-tuning approach. The reason being, no matter how many data layers you create on top, usually a vector database or a structured database might not be sufficient for whatever data retrieval we are doing. In the end, when we retrieve some data from a database, we assign some similarity scores. We take topics or some other metrics we define, and based on how the metrics are, we pick some documents and add that as additional context in a RAG system. This can miss a lot of relevant things as well, because data is skewed in most cases. In some corner of some huge document, the most required part of the information might be present, while in a retrieval layer it might be difficult to retrieve such deep things. It's a lot more difficult for the retrieval process itself to figure out and handle these data skews. There are methods to it, but in legal and audit spaces, the guidelines and laws have a lot of huge law books and everything which the AI needs to be aware of.

Fine-tuning will ensure that the parameters are tuned in such a way that they know what these are exactly, and based on that, the responses will be better. But in a RAG system, it's much more difficult.

Since we have been working on Audit AI and for now we don't have as much data, we are currently using a RAG system. But the plan is to eventually get into a fine-tuned setup, because that's where we would be able to handle all the data relevancies and the time precision as well. In legal, what would have happened in 1997, the law might have been much different than what it is today. But the AI needs to be aware of what it was before and now as well, because that might be required for generating some responses. Fine-tuning helps in ensuring all of these are captured very correctly. In a RAG system, we might just miss out on a lot of information because it all depends on how good the retriever layer is. And to make that really good, what metrics to define is another challenging thing. Right now, most of them work around cosine similarity or similarity matches, which seems to be the best approach, but that itself doesn't answer a lot of these skewed data or latency questions very well.

[Amit]: Got it. And maybe in that same thing, if we take it to another level. I've heard that one of the great use cases of AI is in drug discovery. So there's huge amount of data coming from clinical trials and various other things that they do in order to discover a new drug or create a new drug. In those kinds of things, I believe there's a lot of data which has to be processed and then some sort of modeling has to be done on top of it. What kind of approach might be taken in something like that?

[Kushal]: Those are much more complex use cases. While there are a lot of methods, I can answer maybe right now as an outsider because we have not been working on such hard deep tech problems yet. But here, a hybrid system of everything would make more sense. Because drug discovery is something where you need to continuously learn from what you're suggesting, and the AI needs to be aware of what has happened in the past as well. So it's a mix of having to know everything very precisely—it can't afford to lose out on anything—and also needs to be aware of what it itself is doing.

Like I mentioned for Howlyr, we maintain the history. Here, just having the history might not be sufficient. You might want to learn or fine-tune based on the history itself. Something very interesting I was reading about last week was that all of these LLMs, in the end, are neural networks, right? At the most granular level. In a neural network, there's an activation function which is usually static. That's what makes the LLM static. There are now newer methods where the activation functions themselves are adaptive or learnable. Maybe those are the methods in the future wherein we would be able to have just a single model which is self-learning based on whatever it's giving, and we don't need human feedback to be reinforced for any reinforcement learning. And it can be much more useful for these complicated use cases.

[Amit]: I'm just curious, like you said, if there is a lot of data which is not moving on a daily or weekly basis and there's a huge amount of data, wouldn't fine-tuning of the model work in the drug discovery case? Because I would imagine the data is not changing every day or week. What's your thought on that?

[Kushal]: Yeah, definitely. But also here, the suggestions made by the AI might be very sensitive, and that might need a feedback loop directly into the LLM or the AI system so that it is learning on its own outputs. That's where fine-tuning—you can't have a continuous fine-tuning setup, right? At least so far that hasn't come out. If there is, then that would be the best approach. So with learnable weights always based on the outputs, otherwise, maybe RAG plus fine-tune setup. Because like you mentioned, there is a lot of historic data, but whatever the outputs the AI is giving, that should also be learned, because drug discovery is an evolving process. One single response might not be sufficient for anything. It has to learn based on it and then create a complete chain of responses.

[Amit]: I see. And sorry, I forgot about one more use case, which is something like FastTracker AI, where it joins calls, takes meeting notes, stuff like that, and then there are tasks and actions that it takes in the CRM. What kind of an approach would be taken for that?

[Kushal]: Here, the challenges are these are long meeting notes. For a meeting bot, let's say a 1-hour meeting is going to be a lot of text once transcribed. So handling that is a challenge. While it's still relatively easier to have the notes generated—we just need to do some prompt engineering with some added context, and that is sufficient to generate very structured notes. But here, ensuring that the data chunking is happening from the transcript and all of them in the end are added together—that's the only challenging part. So in the end, it's just prompt engineering which can create the notes, correct?

[Amit]: Right. And obviously, for the benefit of the audience, one of the things that we have learned over the last 12 months is that there are things that could be done with just the simple automation of workflows. You don't need AI for everything. You need AI for places where there are things that were not possible before, and now you can do them 10x better, faster, cheaper. The best example that I can think of is speech to text. I have been playing around with Nuance speech-to-text models for like 10 years now. It's been a decade. Even when they were 60-70% accurate, you could not build a commercial application. Then with Google machine learning models for speech-to-text, you reach like 95-97% accuracy—like the one that you use when you are doing Google search, there's a voice button. But you can do searches, but you can't do commercial applications like an AI for doctor's clinics. Now, especially in the last 12 months, the accuracy has gone to like 99.9%, and that's why commercial applications are available.

So the way that builders have to think about AI is: does it solve something? Is there a delta? Is it like a 4x better, cheaper, faster way of solving a problem than ever before? And there only you should apply AI. I guess like people are also—I have seen in the market people are like, you know, it's like you have a hammer now and everything looks like a nail.

I had another question about AI systems, something that we have learned. In my discussions with a lot of AI companies, I have realized that one of the things AI company founders say is "evals, evals, evals." I know that you have been pushing internally as well. When we develop something in the software world, there used to be testing—a very important function. It started with manual testing, then automated testing came in, and so on. But these evaluations, evals, are different. It almost feels like if AI was a... I'm going to use very crude terms here. It almost feels like if AI was something you developed, most of the time it will be a hit, but sometimes it may be a miss because of various reasons. You need to do evals to check the output of whatever you have created, whichever approach you have taken, but check the output again and again to see how many hits and how many misses. You want to reduce the misses to an extent that it can become a production-ready commercial application.

My question to you is: how should CTOs and founders decide how to design these evals to remove hallucinations, and how to design the entire eval system to get the desired outcomes?

[Kushal]: Right. So interestingly, the new chain-of-thought reasoning models—this was an approach taken to reduce hallucination. Because the best way to solve a problem is to break it down into smaller sub-problems. Just like how we do it as humans, for the AI as well, that's the best way. It will understand the problem better. So chain-of-thought reasoning basically means it cuts it down into smaller sub-problems and answers one after another and then chains it. This is a new approach where hallucination is getting reduced, not just giving everything at once to the LLM, because that's when it confuses itself and ends up hallucinating a lot of responses.

While designing an eval system, it's a lot dependent on how the problem statement you're solving is. But there are some general methods which are usually adopted and are being done so far. There are a lot of metrics which can be defined for evaluation. This is based on how the AI is giving the output. You have another evaluator LLM which looks at the data input and response, and it requires a huge dataset which it can correlate with. It comes up with a coherence score based on what the response is, what the question was, and the data it already has.

Similarly, there are many metrics which can be defined which LLMs themselves are used to come up with the numbers, because that's the only way you can have an evaluation model which is continuous. Otherwise, it's all going to be via human. And with LLMs, it's very difficult to have human feedback going on every time for the evaluation scores. So the new approaches involve LLM systems which basically evaluate the responses based on the metrics they are required to look at. Based on these metrics, there can be another feedback which tells the LLM to improve, or just have it sent to the company and we go and rework on some parts of the AI again.

So essentially, going back to the attention mechanism which we discussed, that's what is required to ensure that the AI is very accurate. And breaking down into sub-problems is something which is working out well right now, because over time, more than latency, people want good accuracy from LLMs. So breaking down into sub-problems and chaining them is giving better responses. And we can have more LLMs which are just having a huge database of how the sample questions and responses need to be. Based on that, it can come up with some scores for the metrics we are defining. Those metrics are usually the evaluation scores, right? That can be looked at and we can go back and improvise the AI.

[Amit]: Right, right. Well, thanks. I think this will be helpful for founders and CTOs. Related question: AI systems are not the same as the software we have seen in the last 20-30 years. For example, I have seen and I have heard this from many other AI products, including one of ours, where the AI product was built, it was performing very well, everything was fine, no code was changed, and just 2 months down the line it started behaving differently. So now there's another challenge with such AI systems that new issues might arise. So how do you maintain these AI applications so that it does not happen? Do you now do continuous evals, or do you do something else?

[Kushal]: Yeah, all the evals need to be continuous, especially in AI. In the earlier traditional setup, unit tests and running the unit tests once when you create an API or any backend or frontend applications doesn't work in an AI setup. Because here, there's no fixed input and output, right? Everything is dynamic. And you don't know what it can be over time. They are static models, but over time, what happens is the way most systems are designed, you start a chat and there's already history associated with it. All of this ends up creating issues because the input is not static, and a static input will not give the same response every time. So yeah, over time you can expect different results.

So the eval needs to be continuous. For every input and output, you need to have another evaluator model which can keep scoring the responses every time. The entire AI framework you have created is going through a query and giving a response, so it has to be continuous. And I think that's where engineers need to keep looking at these metric scores, because that's what will allow us to know how well the models are performing.

Since I have a question for you, Amit. You have been talking to a lot of company founders lately, and a lot of these senior folks at larger companies as well. How has the AI adoption been so far? I think all of us know that some large companies like Amazon, Salesforce, ServiceNow, all of these incumbents are working on AI systems. How do you see it? You come from both the worlds, you have connects everywhere. Who do you think will win? Do you see anything, does anyone have an edge?

[Amit]: Yeah, I think it's not a very straightforward winner-takes-all scenario, and I don't even think it's a zero-sum game. I think there's space for both. But it's a very relevant question. If Box already has your HR documents, contracts, and all the other legal and other documents with them, and they have also been building AI agents to solve problems for customers, where is the scope for startups? Same thing with Salesforce and ServiceNow. They're all very aggressive. They have data, they have the customers. So the startup versus tech incumbent, who will win, is the gazillion-dollar question.

I'll start with something which is a fact. If you look at Salesforce Ventures, they have been investing in AI startups. I think they have invested some 300-500 million. There's a reason for that. Even players like Box and Salesforce know that they cannot capture all the opportunities.

Let me take an example of a company which is not exactly an AI company, but from a SaaS perspective a few years back: Gong.io. You must be familiar with Gong.io. What Gong.io did was they figured out that while Salesforce provides a lot of great tools for sales folks, the sales call intelligence is something that is still very primitive. So what they did was they deeply specialized in sales call intelligence. It would help them improve those sales calls by giving a lot of analysis and tools to take different types of actions and improve their performance on these sales calls. That has now become a very big company. The reason for that was that Salesforce overlooked that particular problem for a very long time. Even though they provided the entire journey, they did not go very, very deep in that one area.

So one of the things that I feel that startups would be able to do very well is to focus on industry-first and problem-first rather than technology-first. If you are planning as a startup to compete in some horizontal area, like creating images from text or creating sounds and so on and so forth, this is a very horizontal use case. I feel all these companies like OpenAI and DeepSeek, they are no longer foundational companies only. They are product companies. They're creating products in each of these areas. We have just seen in the last 3-4 days, if you go to Instagram or Twitter today, it's full of Studio Ghibli images. That shows how serious and quickly they are getting into these horizontal use cases of creating images.

But if you have an industry-first and a problem-first approach, let's say you go into legal and then you start looking at how do you create legal contracts, how do you review legal contracts. You do it for law firms, you do it for corporations, and you go so deep into it that your model, your approach, your domain expertise in that area is so much more that even if the incumbent comes there, it will take them a hell of a lot of time to hire those domain experts, to understand intricate processes and issues and nuances, and then develop something. That's where I think one of the biggest startup opportunities lies.

And then we have also seen in the past—I think the past tells us a lot of things—if you think about certain business model innovation, I think that is also very interesting. For example, Vanta. There has been some technology in SOC2 and ISO and these audits. But Vanta figured out that there's a huge opportunity in mixing product plus services to prepare companies to get the SOC2 audit done. They don't do the audit, but they prepare them over a period of time by making sure that they have all the bells and whistles to pass the audit. They go step by step, they created a process, they have human involvement, they have product taking care of the workflow. That's a very interesting use case of what an incumbent may not do because there's too much headache and management and services involved in that.

I feel like AI tailored for industries instead of generalized solutions—I think it's very hard to compete in horizontal use cases or generalized solutions. But wherever you require depth and domain expertise, specifically in industries like financial services, healthcare, manufacturing, there I feel like there is a lot of moat. And then there might be some very interesting tooling, infrastructure, API-related things. For example, Scale AI is just unbelievable. That company is a $4 billion company because they figured out that tech incumbents are good in structured data, but when it came to unstructured data, how do you handle it and how do you make it useful for AI? I think they built a massive business around unstructured data challenges. I think that is also very, very interesting.

So that's how I'm looking at it. But the market is moving so fast, Kushal, that I think in three months, you don't know what some of these incumbents might end up doing. I feel like a lot of financial services companies and other industry companies might do stuff in-house. So there is competition from there. Then there are tech incumbents like Salesforce and Box and Amazon. And then there are thousands of startups, and then there are earlier ERPs and software companies getting into it. So it's very, very competitive. Things can change in a couple of months.

[Kushal]: Very interesting. Do you see the amount of capital which incumbents have and the data which they sit with, would that give them any edge in the build process?

[Amit]: I think absolutely, right. Let's talk about data first. I think we have also seen that when you are developing these models, it basically comes down to how you have access to data. For example, in Audit AI, because we are very focused and we are building it with a domain expert who has 18 years of experience, just the sheer understanding of the data that we had initially and the work that we did with some of the initial audits itself is helping us to develop that product much faster.

Similarly, I would imagine if Box is managing so many different types of documents, from HR to legal and contractual data, their ability to come up with an AI for contract lifecycle management makes them very powerful. And already the companies trust them with the data, right? That is the other thing. Data security, data privacy is going to come up again and again in B2B. And if there is Salesforce or Box or ServiceNow where the client already trusts them with their data, and now they are also offering how can you make best use of this unstructured data and solve the problems, I think that's a very powerful combination.

But again, classic example: if you take an industry where there are a lot of small businesses, there is a cost of acquiring them and a cost of serving them. For example, in the fintech payments space, there was this company that came up probably 10 years back, Square. They said, "Okay, payments is all occupied by large companies. But what about the small merchants? I will build a device and a whole payment ecosystem for these small guys." And because it's very difficult for a JPMorgan to go down to that level and have a cost structure to acquire those small merchants and serve them with a payment instrument, I think there's still a lot of scope for startups.

[Kushal]: Makes sense. That's a very interesting view.

[Amit]: Yeah.

[Kushal]: Cool. I think this was a great chat, and thank you so much for defining all the different approaches. I am sure the audience would love it. It has a mix of both a lot of inputs on the various AI approaches and evals and all that which have to be done while building AI products, but it also has a little bit of what are the opportunities and business models and what is in it for startups versus tech incumbents and how to look at it. So thank you, and thanks everyone for watching us.

[Amit]: Thanks, Amit.