Build AI, push the limits

True Impact of AI on Software Engineering | Kushal Prakash & Amit Goel

Build AI Podcast

Tune in to our podcast.

Build AI Podcast

Tune in to our podcast.

Tune in to gAI Ventures podcast

A podcast by gAI Ventures, exploring the latest in AI, startups, and innovation. Join us for expert insights, founder/investor stories, and deep dives into how AI is shaping the future.

Kushal Prakash and Amit Goel | gAI Ventures

[Amit]: Welcome to another episode of Build AI by GI Ventures. Today we have our CTO, Kushal Prakash, with us. Hi Kushal.

[Kushal]: Hi Amit.

[Amit]: So, we'll keep it free-flowing and very interactive. I will be asking most of the questions. We have discussed some of these topics one-on-one, where it will be great to educate the audience and discuss some of these things from your perspective.

So, I'll jump right in. One of the questions that is there on everybody's mind today is that with AI tools, coding assistants, coding tools, coding IDEs like Cursor, Replit, Lovable—which is turning out to be the biggest use case of generative AI right now—with all these automating code generation, how will the engineering teams look like? How will they be structured, and how will they operate?

[Kushal]: Right. There are a lot of tools right now which ease the job. Like, what would earlier take hours of debugging or fixing smaller things... For instance, since I was running a fintech before, we used to create a lot of APIs. Each API would be a ticket for an engineer, and that would take him a couple of days to create. Not that it's that difficult, but then ensuring that all the smaller pieces are handled. Because in the end, a small piece of code doesn't mean it's just about the functionality. There are a lot of internal aspects to it. Just about how there's something called atomicity. Like, atom, basically we need to ensure everything is atomic and all the data does not interact simultaneously when there's multiple requests. And also consistency. The way, let's say for instance, there are 500 requests coming, hitting the server at the same point in time, all the data interactions within themselves will create a lot of issues, and there'll be a lot of data storage manipulations also on the database sense. So these are the smaller intricacies as well which need to be handled.

This is where these AI tools are automating a lot of stuff. Rather than automating, maybe easing the whole process of building. How I see it is, what would earlier be more like jumping in, then directly get into the code and then make the changes, maybe right now has shifted more towards take some time, ensure we know what to do and where, and then use the AI tools to build them.

I've been talking to a lot of engineers, and my common observation is a lot of them just directly ask an AI, let's say Cursor, to build some API. It does it, but then they don't know the functionality of it. Understanding that is a task for them which is more like an overhead delay. If they don't know what they're doing, in a way it simplifies a lot of stuff—the earlier time-consuming stuff which involved just creating smaller APIs, looking at the functionality aspects, or more like interaction between some of the functions within the code base. These get a lot simpler because we can tell the AI what each function does and define the interactions, and it'll write the code for us.

But it complicates when you're just trying to give it a whole huge problem statement and asking it to create like 15 APIs to do 15 things, or maybe 30 different things. There it's a big mess. Maybe it eventually does it as well, but then you will have to keep telling the AI multiple times, "this is not working, change this, change this." I think that's probably not how an engineer should be building or using any coding tools. Better to know and lay out the architecture right, understand things, and then use them.

Let's say for instance, I've been using Cursor a lot, especially for backend coding. It helps me in simplifying it. What would earlier require maybe two, three developers—for me to explain them, create these tickets, and tell them, "Okay, these are the APIs, help me with this"—I can just tell Cursor, and it does a lot of it. Obviously, it's not perfect all the time, but it's almost there. And I just need to relook at it, understand a little more, and fix some things. A lot of times, most of the times, it's very right as well.

The way we define the prompt for these coding tools also matters a lot. The way I do it is I give a sample code there itself because these are all trained on public data, right? So they are expected to use previous libraries. Some of them might be deprecated; some of them might not even exist anymore, and that code is very difficult to fix. You can't install the library which is already outdated. You can and make it work, but that's not going to sustain for sure. So I end up giving it a sample code and tell it "these are the functionalities," and then I define the prompt. This is something which I have figured out that this works better than just telling Cursor or Replit to just generate a piece of code. That's a lot of times buggy as well, and it's difficult for me to understand. And that's a time which maybe can be utilized elsewhere.

So, in my opinion, it helps us reduce the time which was taken earlier, but then we need to ensure that the way we are using it is also a little more optimal, so that yeah, it's actually making us efficient rather than just trying to understand an AI tool rather than understanding a code, which used to be the case.

[Amit]: Yeah, that's a refreshing take given all the hype around coding assistants. But I also want to sort of double-click on it and understand that how will the developer role itself change? And other things like testing frameworks and so on. How will all of this change?

[Kushal]: On the developer POV, earlier like I mentioned, there was a lot of grunt work. Let's say since I mentioned backend already, on the frontend side, just smaller design fixes like moving a button here, making this responsive—these were all time-consuming. And to be honest, no developer likes to work on CSS; it's considered to be boring. So these are some things where AI helps a lot. It ensures that it understands what you have written and it fixes the smaller glitches. That's one.

While on a much broader level, we don't have to currently spend as much time sitting and coding. Like, it would have taken me earlier. I need to ensure I'm putting in my time onto understanding what we are trying to build, how we are trying to lay out this whole system. The biggest problem with these AI coding tools is once the code base starts getting bigger, it's very difficult even for the AI to understand and fix some things for you, and the developer wouldn't know anything because AI has written most of the stuff there already, right? So that's where understanding how this whole architecture is in place and then taking the help of AI tool, in my opinion, makes more sense.

On the developer level, that means spending more time on understanding how to make this work and also the modularity of the code. Because if I ask Cursor to write something, it'll give a 2,000-line single code, right? That nobody would be able to understand. Within that, there'll be a lot of interactions, which is again going to take a lot of developer hours to even understand before building on top of that. And you can't give the whole description of an app and ask it to build at once. You have to make this modular. So better that the developer takes more time and figures out, "Okay, what are these modules I want to build? How will I want the interactions to be?" and then take the help of an AI coding assistant to create the smaller module.

So one, the understanding will also be a little more deeper, so debugging will be much more quicker after that. And another, in case some new things have to be built on top of that, that's going to be way quicker. It's not going to take as much time again, and there again he can make use of an AI tool.

This is something which I have understood with larger orgs as well. I've been interacting with a few engineers elsewhere, and all of them have this issue that their code base is already very huge since it's been built over a large number of years, and now they feel it's very difficult to even use any AI to make anything understand or change some things. So they are still following the traditional way of building. So maybe if the strategy is a little bit different from what used to be done earlier, things can be much more quicker and simpler as well.

[Amit]: On the testing side?

[Kushal]: Sorry. I believe a lot of people on the product and business side now, with these tools developing code which is non-optimal sometimes, not the best, and too many lines of codes written, is going to create a lot of problems for people like you.

[Amit]: Yes, yeah, definitely.

[Kushal]: If it's just about making something work, then it's very good. But then adding on things and building on top, or later on trying to ensure everything is optimal or we want to build something, I don't know, a little bit state-of-the-art, that's where an engineer's time would be more required. And I think it's time that engineers become a little more smarter than just "okay, I know React, I know Python." Maybe that's not all today.

For the second part of the question on the testing side, now it has evolved a lot. What was earlier just unit testing, API testing, or manual testing has now evolved into LLMs being used for testing as well. Starting from API testing, earlier you had to define things manually. We had to hardcode things like "if these are the input, these are the output, so test this." But now with an LLM, it can generate the complete test case in itself. That's not all. An LLM as a testing model is something which is growing very quickly now. One to test the LLM as well, that is to perform evals on an AI or maybe a RAG system which has been built.

Then things are not exactly on point today, with at least how much I have tried it out. But on the language side of things, things are growing and we are able to use AI to test another AI as well, to figure out "this was the tone we wanted, this is the type of response we wanted," and the LLM itself is predicting that and seeing if the output is aligning to what was expected.

But on the actionable side, that's where I think we still haven't gotten there. Because we have been building agentic a few products as well where one is just to do audits, where we need to see the complete workflow on how it's happening. And the problem is the whole workflow itself is not static. This itself is very dynamic. That needs to be adapting based on what type of data, where it's going, and what are the inferences made at every layer of the workflow which has been built. So far, and with my research, I haven't been able to figure out a definitive way of making these tests work with AIs. But maybe that's something which is going to be solved soon.

[Amit]: Right. Very interesting. As we very well know, I have also written a blog on this recently that after DeepSeek and all these Chinese models and other players also bringing models, all the ChatGPT foundational layer stuff which was pretty much the point of discussion in '23, '24, now suddenly it looks like the foundational layer is getting commoditized. More and more models are coming out. Again, the focus is back on the application layer, which happens every single time there's a new piece of technology which comes out, whether it was blockchain, fintech, whatever it was. It comes back to industry-first, problem-first kind of approach. And so suddenly the application layer and vertical AI is becoming very important.

Can you walk us through what is your overall approach to building AI in the application layer?

[Kushal]: Right. That's becoming very vast now. What was the application layer a year ago, today has evolved a lot. So on the application layer, one thing is we are utilizing the existing AI models to help us come up with some inference or some smaller specific things which we would require it to do.

Again, the way the broader way of looking at it is: an LLM is trained on multiple parameters. There are billions of parameters within which it has been trained on. But now I want to make it a little more specific for my smaller use case. For this, I might add additional context or some data or user data, any of it. But essentially, I'm trying to make it much more constrained to something and giving it explicit instructions that "this is the only thing which it has to do." So essentially, I'm reducing the number of parameters which it has been trained on and telling it, "I want you to pay attention to only this much."

So on the application layer, while building, essentially we are coupling multiple of these constrained models together to arrive at some conclusions. Just like I spoke about Audit, we have been building another wealth product as well called FastTracker, where we have multiple such modules. One is just to create an email based on the transcript of a complete conversation. Here, the way the model is built, we have ensured that it understands how email has to be written and what are the ways in which it has to be written on the wealth manager point of view to a client. Based on that, it looks at the transcript and it does that one specific task, right?

While this might also seem rule-based, there's much more complexities as well elsewhere. Let's say I want to summarize and understand the whole conversation which has happened and create some tasks out of it, and maybe understand the family tree or some things which are essentials from a wealth manager point of view. That's where an AI can help out in ensuring that things are done very properly. And having it on the application layer, not just telling a bigger model to do it, helps in ensuring these syntaxes are maintained properly. If I ask it to give it in a particular JSON, and it does it always right and there's not much deviations, but then if you go and ask a normal Claude or even GPT to do it, it's not going to do it always. It might do it once or twice, but it's not going to help us create a product on top.

So on the application layer, ensuring we have these constrained models—and here we can use either an LLM or an SLM as well, because we don't really need too many things for smaller tasks or smaller things to be inferred about—all added up together is what creates a complete product on the application layer, in my opinion. And that's what we have been doing as well at GI Ventures.

[Amit]: Right, right. Super interesting. And talking about the application layer, many AI systems that we see today, products that we see today, they rely on a lot of data wrappers and middleware layers such as LangChain, etc., etc. And I know that in the first few weeks of developing some of this stuff last year, you said we are not going to use any of these; we'll develop our own. I wanted to ask you, first of all, what are the challenges with some of these data wrappers and middleware layers, and why and how should we develop our own?

[Kushal]: Right. Well, I'll give my opinion, but there are a lot of people who are still using these frameworks because it helps them build quickly, and I think that is essential as well. But when I started using them and building with a couple of AI frameworks, I soon realized that it comes with a few more complications as well.

One is the libraries which are used within a lot of them get replicated, and it's difficult to expect each of these frameworks to be maintained because most of them are open source. Within them, let's say if there's a wrapper to create an agent that is built to ensure that it is handling a wide range of tools which it can handle. But we don't need to have a lot many things in our perspective. I might only want to use like three, four tools: one to access emails, another to read through calendars, or maybe even Google Drive or something like that. So for these use cases where I know these are the things I wanted to do, the frameworks mess up a lot because it's not built to just do these things. It's built for a common man where he should be able to build anything using these frameworks.

That's where I realized very soon that okay, frameworks are good to maybe build something quickly, but expecting it to be very reliable and ensuring that all the outputs we are getting are really great—I think we did a lot of evals and we saw a lot of issues as well while building that things were not working out. Sooner, it made sense to move out. While it took a little bit of time to create our own modules for everything and create the libraries ourselves, I think that's a good decision we made because we haven't really seen as many issues as we used to face earlier.

Plus, all the libraries which we are using are in our control. We decide on what to use, and we know how things are progressing on each of them as well. So I think the control is in our hands. And to be honest, in my opinion, these frameworks are not as complicated either. In the end, they're just trying to create an abstraction layer, but that abstraction layer is also not as hard to understand. In the end, there's an LLM and there are some workflows you are trying to build. And in an agent, it's literally that. You're giving it some tools and asking it to make some decisions on top of that. So yeah, that's one of the reasons why we moved out and it made sense for us to build our own.

[Amit]: Yeah, unique perspective, I would say. Not many people talk about it. And yes, evals, evals, evals—one thing we have learned now.

Shifting the discussion towards the big talk of the town is obviously chatbots and agents and autonomous agents. And I've always felt that somewhere semi-autonomous agents or copilots or something is what you see all around right today, where a human in the loop is required either to continuously work with the AI, like Perplexity, or even FastTracker or even Howlyr. But then the other thing is that people have started talking about autonomous agents who can do the actions pretty much on their own. You just give a goal, and then they will complete like 17 tasks and actions. And I actually tried out—I fell into the trap—I tried out a few of them, but none of them actually works autonomously, to be honest. And they fail most of the times. And now there's Manus. The Manus hype is real, it seems. But in your opinion, are we going to see autonomous agents execute very complex tasks on their own, or are there limitations and challenges to that?

[Kushal]: Maybe in the future we will be seeing those days, but today, with at least how much I have understood so far, it's still a bit difficult to have an AI which does a very wide range of tasks perfectly all the time. While for smaller workflows, it's great. We can still use them. That's why Manus, these AIs come into picture as well. They're great at doing—although I haven't really tried them firsthand—but with what I have seen so far, it is able to do quite some pretty decent complex tasks as well.

But I don't expect it to do it accurately across every domain, everything. That's where some specificities are needed. And that's, since you mentioned about copilot, it makes sense at least for today to have a human in the loop specifically for some kind of products where we can't have any fallacies, any wrong outputs. Because an AI comes up with the responses for sure, and maybe it doesn't mess up as much. But the way the current testing frameworks are, it's very difficult. Since AI mostly works on natural language, it's very difficult to have that 100% accuracy there. And putting the reliability for complex workflows is still maybe a task. It's still very difficult to have that kind of reliability on AI systems, even if it's an application layer product which is handling very, very big complex tasks.

[Amit]: Right, yeah. This is something to watch out for. I think it will be very interesting to see what happens.

Switching gears to RAG. RAG is being widely adopted to enhance LLMs by grounding responses in external knowledge sources, specific knowledge sources to the task. But everybody has played around with RAG. And I heard the founder of one of these big AI companies say that they were the first ones to work on RAG. Sometimes there are challenges, and I've heard you saying like there are many different types of RAGs to make them more efficient. For example, adaptive RAG is something that you are actually using in one of the products. Without going into the specificity or any confidentiality and all that, can you explain where we are with RAG and are there different types, some of them are more useful than others?

[Kushal]: Yeah, definitely. To start with, RAG was something which was used to make LLM specific, similar to what we were discussing about application layer already. The way it works is there is some question or some query which we are inputting to an LLM, but we want that to be a little more specific. So what we do is we have a database, we do some similarity match or some way of retrieving additional data about the query or something which is relevant to the query. We append, we add that as additional context before feeding it to the LLM. This is the way any RAG would work.

Over time, there are many different types of tasks. RAG essentially is a way in which we are making LLMs more specific by giving additional context to it. But over time, there are multiple use cases which have come out. Some are for conversations where we want the output to be way quicker and there's not much time which people want to wait for. So there's a specific RAG for that. And there are other places where we want results to be very, very accurate. There's something called Golden Retriever RAG, which is more about trying to ensure that the output is more accurate and time is not a constraint.

Similarly, there's adaptive RAG. There are multiple such RAGs, like there are over 15 now. Adaptive RAG is something where we are trying to make things adaptive. Let's say for instance, today there's some output given by the RAG system and there's some feedback which we are associating. According to that, based on that, the whole database needs to get updated so that the additional context which we are appending to the query that is coming—that usually comes out from a database, it can be a vector database or a SQL database, anything—so the data which is sitting there is something which we are making adaptive in an adaptive RAG.

Based on it, there are multiple types of feedback. Like RLHF, reinforcement learning through human feedback, wherein the output is looked at by a human and he gives some feedback telling, "No, there were some discrepancies. This is how you need to do it moving forward." So based on whatever feedback is coming, we use another LLM to go and find out what are those things present in our database and tweak them or update them based on the feedback which has been given by the human.

Additionally, there's something called prompt breeding as well, now which is getting very popular. It used to be earlier a single prompt, huge prompt, given in a RAG system or just to an LLM telling to do only these things. Based on the feedback, these can be updated as well. The whole prompt itself keeps evolving based on whatever feedback is coming.

So over time, essentially the whole RAG system gets better, right? Because in a RAG, we are not really training the model or fine-tuning a model. And fine-tuning is also a little bit expensive, and we can't be fine-tuning on a daily basis, right? So this seems a better approach where the system needs to be evolving. And also there is someone who can give feedback, and we can make the whole database adaptive. It keeps adapting based on the responses which it's giving over time.

[Amit]: Right, very interesting. Also, one more topic that I want to discuss was Anthropic had come out with this whole MCP protocol. And we had briefly discussed back then, but suddenly in the last 10 days, I'm seeing there's a lot of chatter about what is MCP, and what is this whole client and server configuration that you can access other apps. What is the difference between APIs and MCP, and so on and so forth. When we built Howlyr, there was a lot of need to talk to the other tools like calendar, email, Notion, and you had done it in a certain way. But now there is MCP, so it should only ease out things. I wanted to know what is your view of MCP and how should we be looking at it?

[Kushal]: Right. So MCP is a standard which has been created for doing these kinds of interactions between an agent or LLM with other tools or other apps, any data connection layer which you would essentially want to integrate. So in Howlyr, we did with Gmail, calendar, Notion as well. So we had to individually use the APIs and create a layer on top. While with MCP, it kind of makes everything modular or more democratized. Anyone can create some layers and host it on the server, and anyone can access that directly. So they don't have to individually go and integrate into these systems separately.

So it helps in having direct layers created where I can just build on top of my LLM and create these agents directly connected to these layers. And it just saves time essentially, and we don't have to do the individual linking, individual API, or create a layer on top of these APIs separately.

It's more like there are a lot of these companies now which are creating MCP servers. They are hosting these layers, data connection layers, within their servers and giving that as an abstraction to create any agent or anyone who's working on it. So with my understanding, it just makes things quicker. And it's a standard, so there are enough things handled within it like the security and the data communication protocols as well. That's why I think we're going to see more such protocols which will come up just for the interaction between the LLMs and the agents in the future as well. But here, this just makes it easier for us to build on top of LLMs.

[Amit]: Right, especially in the B2B vertical AI world where we have to embed AI in the workflows, I think it's very important. Integrations is important.

To wrap this up, Kushal, any parting thoughts on what we should be looking forward to and what will be the sort of highs and lows of AI in the next several quarters?

[Kushal]: Yep. I think the reasoning model which is being talked about a lot now, that's where it all started with chain of thought. In the end, it's chain of thought reasoning which is happening with it. So earlier, it was more about the latency of the LLM. Now I think it's shifting more towards the accuracy, where reasoning models are allowed to take some more time, give much more structured, better responses. So over time, I feel the accuracy is going to matter a lot, especially for anyone building application layer products on top of it, where there's not much room for any error.

And another is the testing side of things, because that's where I believe things are still a lot more nascent than what it could have been. So both these evolving and the reasoning model itself gives a lot of capabilities to build agents on top. Earlier, if we just switch from GPT-3.5 Turbo or GPT-4 to, I don't know, DeepSeek R1, there's already an upgrade for the application layer product. So over time, I think on the foundational model, that's where things are going to go towards.

And similar to how Manus is working on the actionable side of things, ensuring that the actions taken based on the responses from the LLM are being right, that will help when the reasoning models are better. So I think that's where the evolution is going to be more.

And security is a big thing as well. So that's something which is going to be looked at a lot more, and MCP was, I think, the starting point there.

[Amit]: Right. It's an exciting time in AI. And at GI Ventures, we decided that we should start this podcast, which is a no-BS podcast from the builders who are building AI solutions, and an unfiltered view. So thanks everyone for joining and listening to the podcast. Thanks, Kushal, for sharing your very technical views, some of which I don't understand. Looking forward to discussing more on this Build AI podcast.

[Kushal]: Thanks, Amit.