gAI Ventures | AI Venture Builder and Fund

Transcript

I'm very excited today as we have Carter Huffman who is the CTO and co-founder of Modulate AI. We're a company that builds extremely extremely costefficient and accurate voice understanding models. Can you talk through like who are the customers? As an example, we start by keeping people from harassing each other in Call of Duty or Grand Theft Auto Online or Rec Room or some of these other companies. Like these games can have hundreds of millions of hours of voice conversations happening on their platform each month. you just want to transcribe that stuff, you're going to pay like tens of millions of dollars a month.

Nobody nobody's going to pay that, right? That is such a high cost. We're able to solve those voice analysis problems at 100 to a,000x cheaper. And then we moved into preventing things like return fraud, preventing things like security breaches, finding if an AI agent is going off the rails and doing something it shouldn't be doing. I tried using sales voice agent of almost like top six of the seven companies and all of them miserably failed because lot of them got stuck at the voicemail and and then as Kushell said there were latency issues first of all there's no motivation to talk to AI if if you are actually a sales agent the person on the other side actually is demotivated like why am I talking to somebody who's trying to sell me what is your advice to people building these kind of you know voice agents I think 2017 the first transformer architecture was proposed right And I see you started in 2018. You mentioned about data.

Is that all about training data set? Did you start outright from transformer architecture? Was that your base? How did you approach realtime voice models? Uh we actually started at the completely opposite end. We started with like raw DSP and like a couple basic classifiers and predictors.

So like imagine you're playing a video game or you're hosting a Zoom call or something like that. your computer's doing a lot of work and we're trying to run a real-time voice analysis or synthesis model live on your local machine that needs to never ever ever drop frames. So you're like processing 20 millisecond buffers for example, right? You have 20 milliseconds to process that buffer. If you take 1 millisecond too long, that'll cause a frame drop and then you get all this like glitchy kind of of weird disrupted audio. It's a horrible experience.

I might have a little more technical question here. Let's roll. Love it. How do you filter these contexts like do you actually have smaller SLMs looking at these more smaller verticals like what type of conversations do you filter based on those or is it more of a broader level filtering and then you have a graph created all together. Is the ensemble that way or is it more of a horizontal approach? Fantastic question.

Both approaches have advantages. So we do both. You are an MIT engineer and then you basically worked at NASA in rocket propulsion and you are a experimental or a theoretical physicist. You know who has a very similar profile. It's Howard Wallowitz from Big Bang Theory. I wanted to ask you like who copied whom?

I I believe you have some big news like some some big announcement that you would like to probably make today on this podcast. Push the limit. Convers. Hello everyone. Welcome to yet another episode of build AI. I'm very excited today as we have Carter Huffman who is the CTO and co-founder of modulate AI.

I think today we are going to go very deep, very technical because we also have Kushell who is the CTO and co-founder of GI Ventures also an ex AAI researcher, ex CTO of a wellunded company and now has been building uh you know uh AI solutions at GI ventures. So welcome welcome Carter. Welcome to the pod. Thank you so much. Super super excited to be here. I love talking tech.

I love talking AI. So I think we're going to have a really fun conversation. Right. And welcome Kushell. Thanks. Hi.

Hi Carter. Right. So let me actually start with something very interesting. I read about your profile and you know we were prepping up. So you know so you you are you are an MIT engineer and then you basically worked at NASA in rocket propulsion and you are a experimental or a theoretical physicist. Uh you know you'll clarify that later on.

You know who has a very similar profile? It's Howard Wowitz from Big Bang Theory. I wanted to ask you like who copied whom? Well, you have no idea how many times I get, you know, people people are talking to me and they're like, "Oh, do you watch Big Bang Theory or do you watch, you know, Silicon Valley or these shows and things like that?" And I'm actually I like don't watch those shows because it's too real for me, right? It's like it's like I basically live that every day. Like I need my entertainment to be something completely different like, you know, Victorian England or something or sci-fi or you know, whatever.

Like I I can't do that like oh that's basically my exact life. So, um I think they came first, but I wouldn't quite say I copied. Um because I just it's too real. It's too real. Yeah. Yeah.

I think the episode where he actually goes to the ISS came much later than probably your time at NASA. Credit goes to you. But it was it was like we had so much fun like you know just reading about your background and the depth of the kind of stuff you have done from NASA to AI. Now I I think a good segue would be that we we actually talk about your uh current company modulate AI and the first thing I would like to understand is that when did this company get started and what is the problem you guys are solving? Fantastic question. So the research behind modulate has been going on for about a decade now but the company proper only started in the beginning of 2019.

We raised our first seed round. We have always been focused on voice AI and making voice AI accurate, scalable, low latency, and especially cost effective. That being said, we went through a couple different iterations of the company. We started out in kind of like generative voice AI before people were talking about generative AI a ton, doing voice conversion. So, make my voice sound like Katy Perry's or Taylor Swifts or something like that, right? Live as I'm talking.

We thought this was super cool. It was awesome technology. It worked really well. We tried to sell it. Nobody was really interested. We were trying to sell it to like the video game world and say like, "Sound like the character you're playing." And everyone was like, "Ah, that sounds kind of risky.

I don't know if people would really like that." But as we were doing that, as we were talking to these companies, we were finding out that their biggest problems were actually analyzing what people were saying on their platform and doing something about it if it kind of crossed a line. So, in the gaming space, this was finding toxicity, harassment, radicalization, all the way up to like child grooming, child safety stuff, like really serious stuff. They were having a big problem with that on their platform. And we said, we're really good at voice AI, that's what we do. We can actually analyze and moderate all of the voice content on your platform. And think like, you know, Call of Duty, Grand Theft Auto Online, the like these games can have hundreds of millions of hours of voice conversations happening on their platform each month.

You think about like off-the-shelf transcription costs, right? You just want to transcribe that stuff. You're going to pay like tens of millions of dollars a month. Nobody nobody's going to pay that, right? That is such a high cost. But because of our ability to really focus in on these voice AI problems and solve exactly what you're looking for and do that repeatably at extremely high scale, we're able to solve those voice analysis problems at 100 to a,000x cheaper than trying to do that with off-the-shelf components and off-the-shelf technologies, which has led us to where we are today.

We're a company that builds extremely extremely cost-efficient and accurate voice understanding models to solve specific problems like toxicity and harassment, fraud in phone calls, is your voice AI agent going off the rails, all sorts of stuff. That's really fascinating. I think the use cases could be numerous here. But I want to ask like something very basic. I am I'm not at all technical in front of you two guys. So I'm going to ask a very naive question here and then I think Kushell will probably have more because Kushell built a machine learning based u uh you know uh like a recording system back in 2017 18 maybe Kushell you can talk about it but basically my question is that there was this era of machine learning and then that led us to deep learning and then transformers came out and we have alms now when when you talk about the evolution of modulate AI how have you seen that and how technology has enabled and changed during interesting.

Fantastic question. I see there there is a big kind of arc of evolution, but there are actually some constants that remain sort of the backbone or undercurrent of machine learning and AI that have always been true and continue to be true. One of the really really big ones is understand your domain, understand your problem statement and build your priors into your system. Right? It's very very difficult to get repeatable, highly accurate, deterministic results out of a system if it's something super super general and you're just throwing your data at it and you're kind of hoping it works, right? Like LLMs are magic.

They actually do that sometimes. It's so cool, right? It's nuts. But you run an LLM and try to get it to do the same task 10,000 times, 9,000 times it's going to do a great job, and then a thousand of the times it's going to run off and do something weird or different or stupid. and you're just like, why did that happen? I don't really understand it.

That's always been true in machine learning throughout all of these different things. Neural nets do some of the weirdest things I've ever seen in my entire life. Know you try to get it to do X and it'll like almost figure out a way to like adversarially disobey your instructions and go do Y instead. They're so like creative at not doing what you thought they should do. And so the more you can bake in your priors, the more you can bake in your expert knowledge and inject into how you build and construct your systems to achieve the purpose with the required resources, knowledge, context, whatever that you're trying to do. The more you can do that, the better results you're going to get.

That's always been true. That's been a fundamental reason why our company can do these voice AI tasks very well because they we have that knowledge and because we can bake it in. And the fact that you know we used to have like like like SVMs and then we went to neural networks and then we went to these big transformer models that still remained the same the whole time. I think I can relate to it a lot Carter. While I think 2017 the first transformer architecture was proposed right and I see you started in 2018. You mentioned about data and yeah is that all about the training data set or how did you look at it and did you start outright from transformer architecture was that your base or uh how did you approach the whole of building the voice realtime voice models?

Uh we actually started at the completely opposite end. We we started with DSP. We started with like raw DSP and like a couple basic classifiers and predictors. Um the what one of the things about voice AI that I find really really really cool is how many applications in voice AI require extremely low latency or extremely low compute resources. Like I talked about that scale problem earlier. You know you want to run a massive transformer on a 100 million hours of audio a month.

Like that's just going to be expensive. It doesn't matter how smart you are about it. That's just going to be an expensive operation to do. We've been spending some time also running ondevice models too. So like imagine you're playing a video game or you're hosting a Zoom call or something like that. Your computer's doing a lot of work and we're trying to run a real time voice analysis or synthesis model live on your local machine that needs to never ever ever drop frames.

So you're like processing 20 millisecond buffers for example, right? You have 20 milliseconds to process that buffer. If you take one millisecond too long, that'll cause a frame drop and then you get all this like glitchy kind of of weird disrupted audio. It's a horrible experience. And so voice AI models tend to have these interesting constraints that a lot of other AI models don't have around needing to be extremely resource efficient, costefficient. Like imagine you're playing Call of Duty and trying to run a neural net at the same time, right?

You know, can you use the GPU even? The GPU is playing Call of Duty, right? like how many resources do you really have there? And so you're very very constrained in these different ways. And so a lot of folks when they think about AI, I think they say let's go straight to the biggest model. You know, that's our that's our first guess.

But for tons of voice AI applications, a really really big expensive model is just completely impossible. You know, you can't run that in the resource constrained environments you're working with. And so you have to actually start at the other end. What can you do with DSP? What can you do with shallow networks? what absolutely absolutely requires a deep network or even a transformer and when you hit that point how can you shrink it down to be the smallest and most efficient possible model that can still achieve that objective.

So I would say actually we started completely from the other side and built up very interesting. So is the first layer within the device itself are you like trying to infer something out of the voice which is there and then taking the call on whether to send it to the cloud to or down to the deeper networks. How is it working exactly? Right now with our primary products around like our our voice analysis platform and our APIs, we are doing everything kind of on our cloud side just because it's difficult once you start saying like, hey, okay, we're going to need you to install this SDK. Hey, we're going to need you to like have some compute resources. God forbid you say, hey, to use our our application, you need a GPU.

Like for enterprises and stuff like that, like the more requirements you layer on, the more friction you're adding. So it does make more sense for us to do everything in in in the cloud for our current products, but we actually have exactly that same layered architecture under the hood. It's behind our API, but it still achieves exactly the same purpose of cost efficiency, low latency, high scale, right? It's like it hit like like as soon as audio comes into our system, it hits extremely efficient, extremely fast, super super scalable networks that can do all of this processing on again like tens of thousands of hours of audio in parallel across, you know, all these different devices. And then you triage it and you say, "Okay, great. What did we find out with this really, really tiny machine learning model?" It's not always right, but it provides you interesting information.

What do we do with that information? What models do we run next? And you kind of triage it out through an ensemble of models that only consume resources when you need to know something new. So that's kind of how we do it. It's the same philosophy even though it's all in the cloud. I think it's fascinating how just over the last decade uh technology has progressed because when I was working on this the algorithms were fairly nent.

I remember that CMU had this spinx swings model and I love CMU Spinx. What cool technology? Yeah, all of them were for running it locally over the devices. I I kind of worked on speechto text model which could be which was lightweight and could be run on local systems. So yeah, I did an ensemble architecture as well. But I think just 3 years after that, the way AI progressed, I don't think I could have ever imagined that it would have been so easier now.

But it's it but it but it's interesting because I think the like that that that like super efficient ensemble approach, it's coming back. I mean, you know, from our perspective, it never left. But I think from the from the from the broad like ethos of you know how do you do AI like what what is what are AI best practices just in the general technical mindset that approach is coming back and that's what I'm so excited about because because you the same speech recognition model you could have two two of the same models one's a big general one and it costs a thousand times more to run than a tiny build super targeted model and on a specific domain let's say a customer service phone call or a Discord chat or something like that on an individual domain. Those two models can get exactly the same performance, the one's a thousand times cheaper. The problem is those smaller models, they have more constraints than the bigger models. They're not necessarily less accurate, but they have more constraints.

And so if you use those small models in the right context, then they can be equally powerful, sometimes even better because they might not hallucinate as much. They're not as general. Sometimes even better, and again, massively more scalable and more cost-effective. And so I think that approach like like 2425 you know just just one or two years ago like like I think there was a lot of thinking oh just run the run the big model just run the big model pipe it all through chat GPT you know that just does everything and I think in 26 the kind of new new realization for a lot of folks is hey that's not always the right solution to the problem. Sometimes you need the right model. And the real trick and why I'm so excited about ensembles is how can you run the right model in the right context without your customers or end users needing to be experts on the details of all of those models.

How can you do that? Well, that's the big problem in my opinion. I might have a little more technical question here. Let's roll. Love it. How do you filter these contexts?

Like do you actually have smaller SLMs looking at these more smaller verticals like what type of conversations do you filter based on those or is it more of a broader level fil level filtering and then you have a graph created al together is the ensemble that way or is it more of a horizontal approach fantastic question both approaches have advantages so we do both I think that like that's the that's the really cool thing is like like you always bring more information in to help you make better decisions and behave more optimally. And when you've been doing this for 5 years like we have, there's a lot of time to go in and find all those optimizations. So, we actually do multiple different layers of like triaging, orchestration, and routing. It starts it again, you know, I referenced, you know, good old classical DSP. We do some basic DSP work on the incoming audio before it hits any model. And we can figure out like how noisy does this sound, right?

You know, what what are what are some of the audio properties? We already know like we have some transcription models that work great on 48 kHz highquality VoIP calls. We have some models that work great if you've got a fan running in the background. We've got some models that work great on, you know, 8 kHz muon encoded 8bit, you know, telefan calls. They're different models. We can learn from remarkably efficient DSP to run tiny CPU whatever so efficient.

what subset of our ensemble is going to be more or less appropriate to this audio conversation. So we run that first and then as this is coming through then we run some subset of that ensemble and we see how they perform. We get confidences and we get outputs and then we actually do some results fusion and evaluation. Okay, we had this guess based on our DSP which of these models are going to be most appropriate. Now let's double check. Let's see like which you know which of these ensemble models have outputs that make sense and are high enough confidence and which don't and from that not only do we get results very quickly because we're actually running the models but we also learn more about the data we're processing live as we're processing it right like we know that if model A per you know created some gibberish like okay we're not in model A's like ideal domain so we've learned more about the data we're processing maybe we shouldn't use model A right now right and then We take that context out and based on this like conversation we're processing or data we're processing then we send that to other models maybe now it's time for a language model for example right like what are these people talking about what even is this call what even is this conversation right and then based on what we learn from that that can feed all the way back into the initial ensemble and help us do an even better job at selecting which models we're going to run and it's super super dynamic because like as the data is coming in either batch mode or streaming mode, you're processing different parts of the conversation at different times, and you can take what you learn from processing some of the conversation and use it to do a more efficient, effective, and accurate job at processing the rest of the conversation.

It's like an exploration versus exploitation kind of optimization problem live on the data you're processing. And because it all happens under the hood, to the end user, they're just calling the model. They don't have to worry about all that stuff. But to us, we're constantly optimizing our own like data pipeline and data flow graph as we're processing the data. Very interesting. One thing since you mentioned model ensemble a lot.

One I I've been working on uh a lot of ensemble approaches as well but not exactly in voice. But one common observation is when there is model ensembling where there's multiple AIs trying to interact or one response getting into another usually the latency increases a little bit at least right uh this is as against using a single model to do to do the whole of it while I do agree that SLMs will take much lesser time it should be much more focused but how do you exactly handle the latency here because ensembling itself will add to some overhead delays how is it exactly Fantastic question. So the the the the general idea here is optimize for the just kind of like feed forward asyclic processing pipeline to get results as quickly as possible. be fault tolerant um but feed back the results so that as the new data comes in you're doing a better job right so it's like okay you know you get stuff come in you run the DSP the DSP is so like I don't know as long as you're doing something smart you know you you you write your low-level stuff in C or you use low you know super super low overhead like communication pro uh protocols or whatever like your DSP costs nothing like literally nothing compared to running these models and then your ensemble orchestration like we're fanning data out to these clusters anyway, no matter what. Fan that all out like proactively. Start sending the data as soon as you get it and then throw away the data you don't need because bandwidth is pretty cheap compared to GPUs.

And so we're fanning this data out. We're running it through these models. Collect stuff back together in a best effort kind of way. So a lot of mistakes that folks make with their ensembles is they're not robust enough to false in a distributed system, right? So like if you want to run an ensemble of five different models and you definitely have to wait until all five models return results, then that's going to work awesome 99.5% of the time. 4% of the time it's going to take 10 times longer than you think it did because one of your machines hiccuped.

And then.1% of the time it's just going to fail because one of those models just isn't going to work. So, so you need to have enough like distributed systems uh sort of sort of expertise that you can say it's okay if some of our models fail or some of our models take longer. We're going to get the results out as quickly as possible, but then you have time, not userfacing time to feed those results back in and improve things later. So, it's like, okay, I'm streaming some data to you and I want a transcription and I want it as fast as possible. Like, awesome. Great.

I'm gonna give you that best effort transcription from that feed forward as cyclic pass like as soon as you get it. But you're just going to notice that that gets better over time because in the background I'm feeding that data and context back in. And so maybe the first five seconds you're talking I'm just as good as running one model but the next five seconds you're talking now this is even higher quality. And under the hood for us it's lower latency and lower cost. So I think that's the approach like you have to combine your distributed systems knowledge of like how do I make good latency guarantees? How do I make good throughput guarantees in a distributed system because an ensemble is a distributed system with your ML knowledge of like how can I construct ensembles that can produce results that I can fuse together into the best result in a fault tolerant way so that when I hit distributed system problems my ML system doesn't fall apart.

You sort of have to do both. It's a really cool problem space. Definitely. I think uh accuracy and latency are like the two extreme ends of the spectrum always because the more uh I try to reduce the latency, there's some compromise on the accuracy. It's interesting what you're doing here. Uh the whole approach it's pretty uh I think it's pretty uh state-of-the-art.

It it's awesome. And I think like like there's there's always going to be latency and accuracy and cost mind you, right? you know, um, sort of sort of and these are all interrelated, right? Like there's always going to be a spectrum. There's always going to be a trade-off. But I think that the thing that I'm and maybe this is just my soap box or something like that, right?

But the thing that I think is really really cool as an opportunity for like for like AI companies and AI practitioners right now, right? The thing I think is really cool is that there is so much wasted capacity in these big foundation models. There's so much, you know, I am at like I can upload a recording to chat GPT and ask it like, hey, was I an in this conversation? Like, you know, whatever, right? Like, and I I can upload that and I'm asking the most sophisticated algorithm, most power- hungry algorithm like ever invented by human beings that can do a billion different things. It could manage my checkbook, right?

You know, it could write me, you know, a a video streaming application. It can do all those things. And I'm asking it to transcribe something and do a little bit of language modeling, right? like it's capable of that, but it is capable of so much. And in order to have all of that capability, you have so much capacity built into these these foundation models. And you don't need that capacity for these tasks.

If my problem statement is I'm going to ask this model to do anything at any time, no priors, then sure, yeah, you need the big foundation model. But if you know what you're going to do ahead of time, using those big models, like we've way overshot, you don't need all that capacity to do a good job. And so yes, there's an accuracy versus latency or cost tradeoff, but right now we're spending if you're just using a one big foundation model for a repeatable known task, you are spending so much latency and cost and you're not buying any accuracy. You're buying wasted capacity. You don't need to do that. So there's that spectrum.

And then people are way off the spectrum, way off the spectrum spending latency and cost they don't need to when you could get the equivalent accuracy with a better system. You know I I I have I have to chip here chip in here that I am sort of guilty of using like for example anybody using chat GPT atlas browser they are basically doing like simple Google searches using the most powerful around right so to your point I I think it's being totally underutilized but that's actually a good segue into something very uh you know sort of interesting and useful for the audience of this podcast which is a lot of founders in the AI space people who are building people who have ambitions of building in the AI space and I think um there are a lot of learnings here and and the way that I'll probably sort of orchestrate this discussion from here on is that maybe I can ask about both your space which is voice and then also a lot of stuff which is being done on the non-voice site in in the AI space and like what is your advice how do you think about things right so um let's start with voice one of the things that um we have seen in the last couple of years is a lot of voice agents AI voice agents and usually um uh you know the founders who are building that may not have the kind of background that you have had in in in you know voice. So for example I tried using the uh sales voice agent of almost like top six of the seven companies and all of them miserably failed because lot of them got stuck at the voicemail and and then as Kushell said there were latency issues and and so on and so forth. So what is your advice to people who are building these like sales voice agents and you know those kind of agents because the person on the other hand does not have that kind of motivation to like you know just just sit there and and see all the goof ups happening right by AI first of all there's no motivation to talk to AI if if you are actually a sales agent the person on the other side actually is demotivated like why am I talking to somebody who's trying to sell me what is your advice to people building these kind of you know voice agents great question it's a really nent field and I think people are still figuring a lot of this stuff out. I think I'd offer two different pieces of advice if you're kind of trying to build in the voice AI space specifically. The first piece is I think specific to voice AI.

It is so difficult to make voice AI feel natural. It is still right now today impossible to have a completely natural feeling conversation with even the most advanced like voice AI capability. So I would say that a lot of folks focus on things like like cost um or things like capability like how many skills can this voice agent perform, how many how many tasks can it do? That kind of thing. What are the integrations? Like I think right now voice AI is immature enough you need to be focused on quality and accuracy first and foremost and that includes latency and it includes naturalness.

So for example, from our perspective, our big focus, you know, we do one thing really well. That's voice understanding. We don't even synthesize voice. So you can't, you know, we're not a voice agent. You can't talk to us. You can talk at us.

We'll understand what you said, but we're not going to talk back. That's not what we do. But we do have the most cost-effective by 100x or greater and most accurate conversational transcription and conversation understanding models on the planet. Right? That still matters because when your voice I sorry when your voice AI agent misunderstands something that somebody's saying, you don't really get a second shot. These agents make mistakes that people don't make.

And so the the the kind of veil will immediately be pierced, right? Like people will be like, "Oh, wow. I'm talking to a stupid bot." And the whole conversation changes. They stop saying like, "I'd like to, you know, I'd like some help solving this problem." They say, you know, "Talk to agent. talk to agent, right? It immediately changes as soon as it messes up.

Focus on quality, which includes latencies. So, when you're building these things, I would say don't go out and look for the things that promise the most. Go out and look for the tools that are accuracy and quality first. We were just I was just talking to the founder of a company building extremely high quality speech synthesis. They're called Rhyme AI. Uh, and they they they don't do understanding.

They don't do actions. They just do really really high quality text to speech, right? And so that sort of thing, you know, when you use a high quality system like ours, like Rhymes, like other folks that really focus on this accuracy quality stuff, that's where the voice AI agents perform better because it feels more natural, right? Like we're built, this is a little bit of a tangent, so stop me if this isn't fun. But working in voice, it's a really like psychological kind of field, right? like like human beings, you know, you think like image synthesis, right?

You make an image, you can see a face in anything. You can see a face in anything. So, if I make a synthetic face and it's kind of wonky, whatever, you still know it's a face. We are really well tuned to finding weird sounding voices and we're immediately put off by it. There's an evolutionary argument for this of like, hey, if your voice sounds weird, you might be trying to lie to me. You might be sick, you know, but having a weird or slightly off voice interaction or a voice conversation is an immediately red flag way way back kind of in your primal part of your brain.

And so when a voice AI agent messes up, it's not just an inconvenience, it triggers something in people's heads that's like this isn't right. This system isn't right. Like I feel like I should be talking to a person, but I'm not. Really deep back in there. And so really really focus in on that quality and accuracy piece. You can't just pull something off the shelf and expect it to have perfect quality and accuracy.

And when your quality and accuracy suffers, the whole experience suffers. Do your diligence. Is it fun to talk to this thing? Who cares like how capable it is or or all the integrations it has out of the box? Try talking to it. Does it feel like a natural conversation?

If you can do that, everything else you can figure out. Everything else you can build together. like you're a builder, right? Like you can hook in integrations, you can you can string these things together. Everything else you can solve. Does it feel like a natural thing to talk to?

That's the first thing, right? And I think throughout this conversation, you spoke a lot about accuracy and quality and all that. And actually, we can relate to it quite a bit because we are very focused on vertical AI. So we we are actually building products for financial services and commerce and so on. And there if something has to be done a 100,000 times uh you know throughout the week it has to be on point it has to be correct it has to be accurate and all of those things have to be there that actually leads me to the next question which is let's talk a little bit about on nonvoice AI applications lot of people actually building that there's this whole debate about whether you use large language models versus training your own proprietary models small language models and so on um one argument especially in vertical AI is that there are companies like for example in the legal space a very successful company is Harvey and to the best of our understanding they were trying to train their own models and proprietary models but the foundation models are actually sort of growing in terms of capability exponentially so it looks like some of the other competitors in the market they were always relying on on these LLMs and looks like even Harvey is changing the approach from SLMs to actually using these LLMs as their capability grows exponentially how should a founder trying to figure out like what should I do between these two things actually think about it especially in the non-voice side you you you got to understand your problem space right like you're you're a technical builder like you got to understand what problem you're trying to solve you know for for legal document review for example like you're reviewing a document what what task is that that is language modeling right like you want to use a good language model like it is it it should shock nobody that the most sophisticated language models are the best at doing this sort of thing Now mind you, you could train a legal document only language model and you could probably have a lot of success if you put a lot of effort into this. But but then you're again you're competing against like like like I am trying to like classify and analyze a string of text.

Like that's literally language modeling. So you're fighting the big language models at language modeling. However, you know, again like like you might be doing that or you might be doing like a really really restricted task, right? Like maybe you're just trying to do something where you're trying to classify this document into one of five different types or something like that. Sure, that's still language modeling, but that's a much much much easier task. You probably don't need a ton of nuance for that.

You can probably like as a human being just like glance over it and say, "Okay, all right, I easily know which one of these things it is." Because that is so much of an easier task. You don't need all that sophistication. And so once again, you're paying for a ton of capacity that isn't required to solve that particular task. So I would say like get an understanding of how hard your task is, right? Benchmark everything, right? Like try a bunch of different models.

See which ones are are more are are like how accurate they are um and how cost-effective they are, right? But like but like think about your task like are you trying to solve an easy problem or are you trying to solve a hard problem? And if you're trying to solve a hard problem, can you break it down into easier problems? Right? This is a big thing we did because like conversation understanding, you know, wow, that's a pretty general like hard problem to talk about. Can you break it down into easier problems?

Can you break that down into a transcription step? Okay. No, you actually can't because transcription loses a ton of nuance, right? Like I could say something, but I could say it sarcastically or I could say it when I'm angry or whatever. And you lose a ton of nuance. So, no, you can't just start with a transcription problem.

But what if we capture the emotion as well? What if we capture the tone as well? What if we capture these other things as well? Okay, now you can break it down into a representation of that voice signal in these different model outputs that are much much easier to work with than a raw audio signal. Okay, so now if we're smart about it, we can break down the problem and then you pass it through other classifiers. Then you do some language modeling, then you do some other stuff.

How can you break that really really hard problem down into easier problems? Sometimes you can't. The biggest use case for a like super super large language model or generative AI system is if you are putting it in unknown situations where you're not sure what it's going to have to do, then you actually need that capacity to do anything because you're not sure what kind of task it's going to be required to perform. So like a coding agent is a great example for this. You have no idea what people are going to try to code. It needs to be extraordinarily general.

You should just be using the best model. That capacity isn't wasted. It's part of the quality of your product. But in other circumstances where you don't need that extraordinary flexibility, then say, okay, even if I'm solving a hard problem right now, how can I break it down into some easier problems and how can I tackle those easier problems with models that don't need that crazy level of capacity? That's what I'd say. And and I think and I think that is where if you're building a company or building a product in AI, that is where you're going to get your edge.

Don't go try to compete with open AI at building an extremely capable general language model. Like you're not going to win at that. Don't try to do that. But do find places where that extra capacity is wasted. People still need intelligence. People still need AI.

People love the abilities that chat GPD has. But you don't need that extraordinary flexibility in generality. So now can you take your reduced more narrow problem statement and break it up into these smaller problems that don't need the flexibility and still do a good job. That's where your opportunity is. So Carter, I also wanted to understand from you a little bit more about the company and uh can you talk through like who are the customers and you know what has been the growth trajectory whatever you can share in terms of revenue, fund raise, people you know the typical business profile of a company. So, so the audience can actually understand about your company.

Fantastic. Awesome. Yeah. Yeah. So, so we're a 45ish person company right now. We've raised 30 to 40 million over the past again, you know, we've been around since 2019.

So, that's like seven seven years. Um, uh, so far our trajectory has mostly been starting in a beach head of solving problems like anti-toxicity, child safety, etc. in the gaming and social space and then expanding out to understanding voice conversations everywhere and finding stuff that you need to take action on that you need to care about and helping companies solve those problems all across the board. So in his as an example, we start by keeping people from harassing each other in Call of Duty or Grand Theft Auto Online or Rec Room or some of these other companies and then we moved into preventing things like return fraud, preventing things like security breaches when people manipulate an IT desk agent into resetting their password or adding an MFA device, finding if an AI agent is going off the rails and doing something it shouldn't be doing. So the common theme of analyze and understand a conversation, extract insights from that conversation that you can act on in real time to make these conversations and make your platform better. But starting in that kind of restricted gaming and social space and expanding out into all conversations everywhere.

Got it. I I I have come across something very interesting about your background and I wanted to ask you why. So I heard that I I actually read in your profile during the prep that you have read Lord of the Rings 55 times or or something like that. 55 or 65 54 and I do keep count like the two questions I have right so one is why and the second thing is that every time that you read LOTR what is the incremental information learning fun entertainment that you derive from it like what is that incremental thing you get every time for me it's like a comfort place right it's just like I know that book there was a time uh not right now because I'm running a company and it's takes a lot of time but there was a time where you could read a single sentence from any of the chapters in Lord of the Rings and I can tell you exactly where that sentence is and what's going on. Like it's just it's just such a comfortable place for me. And so like all day I'm dealing with like novel stuff, novel information, novel situations.

I'm like having to think really hard, work really hard at all of this stuff and getting back to that like especially at night. I love to read like one or two chapters at night and just like that kind of calms me down, resets my brain and it's a comfortable space. But every single time I read through it, I find like a sentence or a paragraph or something that I don't remember. And it's like and it's like, you know, when I read, I'm like kind of skipping around. Like I try to read like in in order, but like I'll be like, "Okay, I know what's coming next." And I like, you know, my eyes like skip across the page or something like that. And so I find that I've often like missed a sentence or two.

You know, Lord of the Rings, like what the the single volume I use is like 1,100 was it 20 pages or something like that. Uh, and and and you know, invariably I'm always missing like a sentence here or there and I because I know the book so well, I'll recognize it and I'll be like, whoa, like I haven't seen this. Did somebody like come in and put this here? Like what, you know, this is this is new. And so that's always a it's always a little bit of fun. But yeah, it's really a comfort place for me just like I can reset my brain.

You know, I kind of know what to expect. I love it. and and Tolken's writing, you know, if you if you read the stuff this guy writes, like it is so melodic. Like the the the kind of the mastery of the English language, like those are comfortable sentences to read. They sound good, right? Like I have, you know, that kind of internal voice in my head.

So I kind of like narrate everything. So I'm reading aloud in my head, right? And so as I'm reading it, it just sounds good. Like the stuff rhymes. It has this kind of great flow to it. And so it's just like really It's almost like listening to music, right?

It's almost like listening to music. like you know you have your favorite song you listen to. I read the Lord of the Rings out loud in my head. That makes a lot of sense. Thanks for sharing it by the way. Now really my last question and and then I'll turn over to Kushell if he has any more.

You know basically time is the limit. We could talk to you for hours together. I I reckon but it will be great to to know I I believe you have some big news like some some big announcement that you would like to probably make today on this podcast. Absolutely. Okay. So, the super super excited thing that I'm I'm I'm really really really thrilled to share is that we are publishing our very first broadstrokes conversation understanding model.

It's called Velma, and it does 30% better and 100 to 1,000x more coste effective at consuming, transcribing, and understanding a voice conversation than any of the other foundation models out there. This is all this like ensemble stuff that I was talking about throughout this podcast. All of those are pieces to this ensemble listening model architecture that we've developed to really really efficiently and intelligently consume a very broad array of conversations and do the most cost-effective and most accurate operations on those conversations to get the best understanding with absolutely no wasted capacity compared to those big foundation models. So that's the thing I'm super super excited about this ensemble listening model architecture to do analysis extremely efficiently and accurately and our specific model called Velma which does this 30% more accurately and 100 to 1000x more cost-ffectively than the competing leading foundation models. That's great. That's fascinating.

Congratulations. And you heard it first on Build AI podcast. Thanks for sharing it. Kushell, do you have any other questions? I know you might want to talk to Carter for a very very long time but any last question that you have for him? Uh I think I'm good for now but this was great talking Carter and Amit like this was I think uh too much fun and would love to keep talking Carter.

I had a wonderful time as well. Thank you so much for having me and uh I'm happy to chat anytime. I will talk tech for hours so uh anytime you want to chat just hit me up. Great. Thank you so much Carter. Thanks for your time and we really enjoyed this conversation.

With that, uh, we'll end this podcast. Thank you.

End of episode · Ep #12

ShareX / Twitter LinkedIn

GTA, Call of Duty Uses His Voice AI Models

She Raised $50M by Breaking the VC Rulebook

$50M Fund II, Investing in Enterprise SaaS

More conversations on vertical AI,
every week.

The Truth About Enterprise AI Nobody Is Saying Out Loud

AI for Investment Memorandums & M&A Pitch Decks

From Google to Cornell: Rethinking AI Workflows