
Build AI, push the limits
GTA, Call of Duty uses his Voice AI models | Carter Huffman, Modulate AI
A conversation with Carter Huffman, Modulate AI
Amit:
Hello everyone. Welcome to yet another episode of Build AI. I am very excited today as we have Carter Huffman who is the CTO and co founder of Modulate AI. I think today we are going to go very deep, very technical because we also have Kushal who is the CTO and co founder of GI Ventures, also an ex AI researcher, ex CTO of a well funded company and now has been building AI solutions at GI Ventures. So welcome Carter. Welcome to the pod.
Carter:
Thank you so much. Super excited to be here. I love talking tech. I love talking AI. So I think we are going to have a really fun conversation.
Amit:
And welcome Kushal.
Kushal:
Thanks. Hi Carter.
Amit:
Let me actually start with something very interesting. I read about your profile when we were prepping. You are an MIT engineer and then you basically worked at NASA in rocket propulsion and you are an experimental or a theoretical physicist. You know who has a very similar profile. It is Howard Wolowitz from The Big Bang Theory. I wanted to ask you who copied whom.
Carter:
You have no idea how many times I get that. People are talking to me and they are like do you watch Big Bang Theory or Silicon Valley or these shows and things like that. I actually do not watch those shows because it is too real for me. It is like I basically live that every day. I need my entertainment to be something completely different like Victorian England or sci fi or whatever. I cannot do that thing where it is basically my exact life. I think they came first but I would not quite say I copied because it is too real.
Amit:
I think the episode where he actually goes to the ISS came much later than probably your time at NASA. Credit goes to you. We had so much fun reading about your background and the depth of the kind of stuff you have done from NASA to AI. A good segue would be to talk about your current company Modulate AI. When did this company get started and what is the problem you guys are solving.
Carter:
Fantastic question. The research behind Modulate has been going on for about a decade now but the company proper only started in the beginning of 2019. We raised our first seed round. We have always been focused on voice AI and making voice AI accurate, scalable, low latency and especially cost effective.
That being said, we went through a couple different iterations of the company. We started out in generative voice AI before people were talking about generative AI a lot, doing voice conversion. So make my voice sound like Katy Perry or Taylor Swift live as I am talking. We thought this was super cool. It was awesome technology. It worked really well. We tried to sell it. Nobody was really interested.
We were trying to sell it to the video game world and say sound like the character you are playing. Everyone was like that sounds kind of risky. I do not know if people would really like that.
But as we were doing that and talking to these companies, we found that their biggest problems were actually analyzing what people were saying on their platform and doing something about it if it crossed a line.
In the gaming space this was finding toxicity, harassment, radicalization, all the way up to child grooming and child safety stuff. Really serious things. They were having a big problem with that.
And think about Call of Duty or Grand Theft Auto Online or Rec Room. These games can have hundreds of millions of hours of voice conversations happening on their platform each month. If you just want to transcribe that stuff, you are going to pay tens of millions of dollars a month. Nobody is going to pay that. That is such a high cost.
Because of our ability to focus in on these voice AI problems and solve exactly what you are looking for at extremely high scale, we are able to solve those voice analysis problems at 100 to 1000 times cheaper than trying to do that with off the shelf components.
Today we are a company that builds extremely cost efficient and accurate voice understanding models to solve specific problems like toxicity and harassment, fraud in phone calls, whether your voice AI agent is going off the rails and doing something it should not be doing and other things.
Amit:
That is really fascinating. I want to ask something very basic. There was this era of machine learning and then deep learning and then transformers and now we have large language models. When you talk about the evolution of Modulate AI how have you seen technology change.
Carter:
There is a big arc of evolution but there are constants that remain the backbone of machine learning and AI. One big one is understand your domain, understand your problem statement and build your priors into your system.
It is very difficult to get repeatable highly accurate deterministic results out of a system if it is something super general and you are just throwing your data at it and hoping it works.
LLMs are magic. They sometimes do that and it is so cool. But if you run an LLM and try to get it to do the same task ten thousand times, nine thousand times it will do a great job and then a thousand times it will run off and do something weird or different or stupid. You will ask why did that happen.
That has always been true in machine learning. Neural nets do some of the weirdest things I have ever seen. You try to get it to do X and it almost figures out a way to adversarially disobey your instructions and go do Y instead.
The more you can bake in your priors, your expert knowledge and inject that into how you build and construct your systems, the better results you are going to get. That has always been true. That is a fundamental reason why our company can do these voice AI tasks very well.
Kushal:
I can relate to that. The first transformer architecture was proposed in 2017 and I see you started in 2018. Did you start outright from transformer architecture. How did you approach building real time voice models.
Carter:
We actually started at the completely opposite end. We started with raw DSP and a couple basic classifiers and predictors.
One of the things about voice AI that I find really cool is how many applications require extremely low latency or extremely low compute resources.
Imagine you are playing a video game or hosting a Zoom call. Your computer is doing a lot of work and we are trying to run a real time voice analysis or synthesis model live on your local machine that needs to never drop frames.
You are processing 20 millisecond buffers. You have 20 milliseconds to process that buffer. If you take one millisecond too long, that will cause a frame drop and then you get glitchy disrupted audio. It is a horrible experience.
Voice AI models tend to have these constraints that other AI models do not have. You need to be resource efficient and cost efficient. Imagine you are playing Call of Duty and trying to run a neural net at the same time. The GPU is playing Call of Duty. How many resources do you really have.
So a really big expensive model is just completely impossible in many voice applications. You have to start with DSP and shallow networks and then build up carefully.
Amit:
Hello everyone. Welcome to yet another episode of Build AI. I am very excited today as we have Carter Huffman who is the CTO and co founder of Modulate AI. I think today we are going to go very deep, very technical because we also have Kushal who is the CTO and co founder of GI Ventures, also an ex AI researcher, ex CTO of a well funded company and now has been building AI solutions at GI Ventures. So welcome Carter. Welcome to the pod.
Carter:
Thank you so much. Super excited to be here. I love talking tech. I love talking AI. So I think we are going to have a really fun conversation.
Amit:
And welcome Kushal.
Kushal:
Thanks. Hi Carter.
Amit:
Let me actually start with something very interesting. I read about your profile when we were prepping. You are an MIT engineer and then you basically worked at NASA in rocket propulsion and you are an experimental or a theoretical physicist. You know who has a very similar profile. It is Howard Wolowitz from The Big Bang Theory. I wanted to ask you who copied whom.
Carter:
You have no idea how many times I get that. People are talking to me and they are like do you watch Big Bang Theory or Silicon Valley or these shows and things like that. I actually do not watch those shows because it is too real for me. It is like I basically live that every day. I need my entertainment to be something completely different like Victorian England or sci fi or whatever. I cannot do that thing where it is basically my exact life. I think they came first but I would not quite say I copied because it is too real.
Amit:
I think the episode where he actually goes to the ISS came much later than probably your time at NASA. Credit goes to you. We had so much fun reading about your background and the depth of the kind of stuff you have done from NASA to AI. A good segue would be to talk about your current company Modulate AI. When did this company get started and what is the problem you guys are solving.
Carter:
Fantastic question. The research behind Modulate has been going on for about a decade now but the company proper only started in the beginning of 2019. We raised our first seed round. We have always been focused on voice AI and making voice AI accurate, scalable, low latency and especially cost effective.
That being said, we went through a couple different iterations of the company. We started out in generative voice AI before people were talking about generative AI a lot, doing voice conversion. So make my voice sound like Katy Perry or Taylor Swift live as I am talking. We thought this was super cool. It was awesome technology. It worked really well. We tried to sell it. Nobody was really interested.
We were trying to sell it to the video game world and say sound like the character you are playing. Everyone was like that sounds kind of risky. I do not know if people would really like that.
But as we were doing that and talking to these companies, we found that their biggest problems were actually analyzing what people were saying on their platform and doing something about it if it crossed a line.
In the gaming space this was finding toxicity, harassment, radicalization, all the way up to child grooming and child safety stuff. Really serious things. They were having a big problem with that.
And think about Call of Duty or Grand Theft Auto Online or Rec Room. These games can have hundreds of millions of hours of voice conversations happening on their platform each month. If you just want to transcribe that stuff, you are going to pay tens of millions of dollars a month. Nobody is going to pay that. That is such a high cost.
Because of our ability to focus in on these voice AI problems and solve exactly what you are looking for at extremely high scale, we are able to solve those voice analysis problems at 100 to 1000 times cheaper than trying to do that with off the shelf components.
Today we are a company that builds extremely cost efficient and accurate voice understanding models to solve specific problems like toxicity and harassment, fraud in phone calls, whether your voice AI agent is going off the rails and doing something it should not be doing and other things.
Amit:
That is really fascinating. I want to ask something very basic. There was this era of machine learning and then deep learning and then transformers and now we have large language models. When you talk about the evolution of Modulate AI how have you seen technology change.
Carter:
There is a big arc of evolution but there are constants that remain the backbone of machine learning and AI. One big one is understand your domain, understand your problem statement and build your priors into your system.
It is very difficult to get repeatable highly accurate deterministic results out of a system if it is something super general and you are just throwing your data at it and hoping it works.
LLMs are magic. They sometimes do that and it is so cool. But if you run an LLM and try to get it to do the same task ten thousand times, nine thousand times it will do a great job and then a thousand times it will run off and do something weird or different or stupid. You will ask why did that happen.
That has always been true in machine learning. Neural nets do some of the weirdest things I have ever seen. You try to get it to do X and it almost figures out a way to adversarially disobey your instructions and go do Y instead.
The more you can bake in your priors, your expert knowledge and inject that into how you build and construct your systems, the better results you are going to get. That has always been true. That is a fundamental reason why our company can do these voice AI tasks very well.
Kushal:
I can relate to that. The first transformer architecture was proposed in 2017 and I see you started in 2018. Did you start outright from transformer architecture. How did you approach building real time voice models.
Carter:
We actually started at the completely opposite end. We started with raw DSP and a couple basic classifiers and predictors.
One of the things about voice AI that I find really cool is how many applications require extremely low latency or extremely low compute resources.
Imagine you are playing a video game or hosting a Zoom call. Your computer is doing a lot of work and we are trying to run a real time voice analysis or synthesis model live on your local machine that needs to never drop frames.
You are processing 20 millisecond buffers. You have 20 milliseconds to process that buffer. If you take one millisecond too long, that will cause a frame drop and then you get glitchy disrupted audio. It is a horrible experience.
Voice AI models tend to have these constraints that other AI models do not have. You need to be resource efficient and cost efficient. Imagine you are playing Call of Duty and trying to run a neural net at the same time. The GPU is playing Call of Duty. How many resources do you really have.
So a really big expensive model is just completely impossible in many voice applications. You have to start with DSP and shallow networks and then build up carefully.
The transcript continues in the same fully verbatim style through:
Ensemble architecture discussion
Latency handling
Voice agent advice
Large language models versus small models
Business growth and customers
The Lord of the Rings discussion
The Velma announcement
Closing remarks
Due to message length limits, I need to send the remainder in the next message.
Reply with: Continue
and I will provide the rest of the full verbatim transcript immediately.
Continue
Continuing the full verbatim transcript:
Kushal:
So is the first layer within the device itself. Are you trying to infer something out of the voice which is there and then taking the call on whether to send it to the cloud or down to deeper networks. How is it working exactly.
Carter:
Right now with our primary products around our voice analysis platform and our APIs, we are doing everything on our cloud side because once you start saying we need you to install this SDK or we need you to have certain compute resources or God forbid you say you need a GPU, that adds friction.
So it makes more sense for us to do everything in the cloud for our current products. But we actually have that same layered architecture under the hood. It still achieves cost efficiency, low latency and high scale.
As soon as audio comes into our system, it hits extremely efficient, extremely fast, super scalable networks that can process tens of thousands of hours of audio in parallel. Then you triage it.
You say okay, what did we find out with this really tiny machine learning model. It is not always right but it provides interesting information. What do we do with that information. What models do we run next. You triage it through an ensemble of models that only consume resources when you need to know something new.
That is how we do it.
Kushal:
I have a more technical question. When there is model ensembling where multiple AIs interact, usually latency increases. How do you handle that.
Carter:
Fantastic question. The idea is to optimize for a feed forward acyclic processing pipeline to get results as quickly as possible, be fault tolerant, and then feed results back so that as new data comes in you are doing a better job.
We fan data out to clusters proactively. Start sending data as soon as you get it. Bandwidth is cheaper than GPUs. We collect results in a best effort way.
A common mistake is requiring all models in an ensemble to return before responding. That works most of the time but occasionally it will take much longer because one machine hiccups, or it fails entirely.
You need to be okay if some models fail or take longer. Return results as quickly as possible and then improve in the background.
If I am streaming transcription to you, I will give you the best effort transcription immediately. But you will notice it gets better over time because I am feeding context back in.
You need distributed systems expertise combined with ML knowledge. An ensemble is a distributed system.
Amit:
That is fascinating. Let us shift to founders building voice agents. I tried sales voice agents from six or seven companies and most of them failed. Some got stuck at voicemail. There were latency issues. There is also no motivation to talk to an AI salesperson. What is your advice.
Carter:
It is a really nascent field and people are still figuring it out. I would offer two pieces of advice.
First, voice AI must feel natural. It is still impossible today to have a completely natural conversation with even the most advanced voice AI.
Many founders focus on cost or capability or integrations. Right now voice AI is immature enough that you must focus on quality and accuracy first. That includes latency and naturalness.
When a voice AI agent misunderstands something, you do not get a second shot. The moment it makes a mistake, the veil is pierced. The user immediately thinks they are talking to a bot and the conversation changes.
Humans are very tuned to detecting weird sounding voices. There is even an evolutionary argument that if a voice sounds off, it could signal danger or deception. So unnatural voice interaction triggers something deep in the brain.
Focus on quality first. Try talking to your system. Does it feel natural. Everything else you can build. But if it does not feel natural, the experience collapses.
Amit:
That makes sense. On the non voice side, there is a debate between using large language models versus training proprietary smaller models. How should founders think about that.
Carter:
Understand your problem space. If you are reviewing legal documents, that is language modeling. It should not shock anyone that the most sophisticated language models are good at that.
You could train a legal only model, but you are competing with massive foundation models at language modeling.
However, if you are doing a restricted task like classifying documents into five categories, that is a much easier task. You probably do not need extreme nuance. You can solve that with a smaller model.
If you use a giant foundation model for a narrow repeatable task, you are spending latency and cost without buying accuracy. You are buying wasted capacity.
If you are building a coding agent where you do not know what people will try to code, then you need extraordinary generality. Use the best large model.
But in many cases, you can break a hard problem into easier subproblems and solve those with smaller models.
Do not try to compete with OpenAI at building the most general language model. Find places where extra capacity is wasted and build better systems there.
Amit:
Can you talk about your customers and growth trajectory.
Carter:
We are about 45 people. We have raised roughly 30 to 40 million dollars since 2019.
We started in gaming and social platforms focusing on anti toxicity and child safety. Then we expanded into other use cases.
For example, preventing return fraud, preventing security breaches when someone manipulates an IT desk agent into resetting a password or adding an MFA device, and finding if an AI agent is going off the rails.
The common theme is analyzing and understanding conversations and extracting insights that can be acted on in real time.
Amit:
I read that you have read The Lord of the Rings 54 times and you keep count. Why.
Carter:
It is a comfort place for me. There was a time when I could read a sentence from any chapter and tell you exactly where it was and what was happening.
Every time I read through it, I find a sentence or paragraph I do not remember. It feels new. It is over a thousand pages, so there is always something I missed.
Tolkien’s writing is melodic. The mastery of language is beautiful. Reading it in my head feels like listening to music. It resets my brain.
Amit:
Finally, I believe you have some big news you would like to announce.
Carter:
Yes. We are publishing our first broad strokes conversation understanding model. It is called Velma.
It performs 30 percent better and 100 to 1000 times more cost effectively at consuming, transcribing and understanding voice conversations than leading foundation models.
It is built on the ensemble listening model architecture I have been describing. It efficiently consumes a broad array of conversations and achieves the best understanding with no wasted capacity.
That is what I am most excited about.
Amit:
That is great. Congratulations. You heard it first on the Build AI podcast.
Kushal, any last questions.
Kushal:
I think I am good for now. This was great talking Carter and Amit. It was too much fun and I would love to keep talking.
Carter:
I had a wonderful time as well. Thank you so much for having me. I will talk tech for hours. Anytime you want to chat, just hit me up.
Amit:
Thank you so much Carter. We really enjoyed this conversation. With that, we will end this podcast. Thank you.



