How OpenAI delivers low-latency voice AI at scale

(openai.com)

71 points | by Sean-Der 1 hour ago

20 comments

Sean-Der 14 minutes ago
Very grateful that OpenAI published the article/publicized their usage of Pion[0] a library I work on. If you aren't familiar with WebRTC it's a super fun space. I work on a book WebRTC for the Curious [1] that details how it works.
[0] https://github.com/pion/webrtc
[1] https://webrtcforthecurious.com
[-]
- thatxliner 5 minutes ago
  slightly unrelated but what’s with storing the entire codebase in the root directory instead of a nested src folder? It makes getting to the README a lot more difficult
legohead 15 minutes ago
The low latency is more of a pain point than a good thing, the way they have it implemented. Trying to have a casual conversation with it, as humans we naturally pause, and GPT will take this as you are "done" and start blabbing away.
I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.
[-]
- cj 0 minutes ago
  [delayed]
- richardw 1 minute ago
  [delayed]
- zamadatix 2 minutes ago
  I think these are 2 very different meanings of "latency" in that the low latency they refer to here is strictly better, even for what you're talking about.
thimabi 1 hour ago
> Voice AI only feels natural if conversation moves at the speed of speech […] At OpenAI’s scale, that translates into three concrete requirements: Global reach for more than 900 million weekly active users
Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?
That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.
[-]
- stuartmemo 46 minutes ago
  Yeah, that's why they've used "reach" - the total number of users who could be exposed to the feature regardless of engagement.
  [-]
jonahs197 1 minute ago
Who cares? Their company is dying.
Aeroi 58 minutes ago
if anyone is looking to get into this. pipecat is a great open-source repo and community. https://github.com/pipecat-ai/pipecat
[-]
- pncnmnp 50 minutes ago
  I wish I had known about Pipecat a lot sooner. I found out about it a few weeks back, and since Gemma 4 launched, I've been building my own entirely local voice assistant using Gemma 4 + Kokoro TTS + Whisper from scratch - https://github.com/pncnmnp/strawberry.
  Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3
  [-]
  - AnthOlei 39 minutes ago
    What do you have going on the hardware side? I want to plug this into hass but don’t know what hardware I need for reasonable latency
    [-]
    - Sean-Der 11 minutes ago
      Check out [0]. You can do 'Voice AI' on small/cheap hardware. It's the most fun you can have in the space ATM :) It's been a while, but posted a demo here [1]
      [0] https://github.com/pipecat-ai/pipecat-esp32
      [1] https://www.youtube.com/watch?v=6f0sUEUuruw
- BoxedEmpathy 56 minutes ago
  I've been looking at this! Great project.
didibus 20 minutes ago
I wouldn't mind waiting longer for answers that would go through a better model with more thinking. As long as it has good support for interrupting and also it doesn't start answering as soon as I pause for 1 second and it's smart about knowing I'm done speaking.
qrush 7 minutes ago
Am I reading this right that OpenAI is not using Livekit for WebRTC/audio anymore?
charisma123 18 minutes ago
If a transceiver crashes during a stream, how is the active session recovered? Does the system automatically re-establish the context in a new WebRTC session?
[-]
- Sean-Der 13 minutes ago
  It doesn't today, but you could with sometime like this [0]. You can save/suspend all WebRTC state and bring it back with the next process.
  [0] https://github.com/pion/webrtc-zero-downtime-restart
CrzyLngPwd 10 minutes ago
It's bad enough having to speed-read the waffle of its written answers; even when told to be concise, the thought of having to listen to it waffle on in its smarmy, sycohpantic fashion makes me want to reach for the sick bag.
furyofantares 29 minutes ago
> Global reach for more than 900 million weekly active users
lol, definitely didn't need to know there's 900M weekly users for this post. I mean yeah, there's a lot of users and they serve globally, that's relevant. But this is just pulling out your biggest stat because you can. How many voice users you have would actually be relevant and interesting but, to baselessly speculate on motivation here, might be a number that doesn't add as much fuel to an upcoming IPO as reminded people that you're almost at a billion users does.
doctorpangloss 44 minutes ago
what i learned from making a webrtc+kubernetes game streaming product:
- openai is wrong. almost of the issues they described are issues with libwebrtc, not with webrtc, kubernetes, network architecture, etc. the clue was when they said "the conventional one-port-per-session WebRTC model."
- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.
- libwebrtc is the only game in town.
- they haven't discovered libwebrtc feature flags or how it works with candidates, which directly fix a bunch of latency issues they are discovering. a correct feature flag can instantly reduce latency for free, compared to pay for twilio network traversal style solutions
- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)
this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.
[-]
- Sean-Der 10 minutes ago
  Did you use libwebrtc on the backend? When you say `libwebrtc` is the only game in town are you talking about clients or servers?
  Even for clients you have things like libpeer that libwebrtc can't hit.
- jiggawatts 17 minutes ago
  Something I noticed is that companies that are vibe-coding their products miss out on the intelligence that (still) only humans can bring to bear. Just the knowledge cutoff alone puts AI at a serious disadvantage in any rapidly changing field.
flakiness 31 minutes ago
Should I or shouldn't I be glad to see zero mention on Codex.
[-]
- mock-possum 21 minutes ago
  Shouldn’t, I think - advanced voice is a surprisingly slick feature, and if you’re someone who feels that they can think and speak more naturally than when they think and type, AI voice transcription is kind of huge.
  [-]
  - gyanchawdhary 6 minutes ago
    100% .. as a product designer/developer, i use it heavily for early feature ideation .. i’ll do a loose, exploratory back and forth on a long walk .. then pass the transcript to claude to validate and turn into a spec ..
anzerarkin 1 hour ago
I hate the voice ai though, it's so much dumber
[-]
- NikolaNovak 54 minutes ago
  Fwiw - I found the advanced AI voice feature to be actually detrimental. It's good if you just want a single sentence answer. I've turned it off though when I want a more detailed, structured, considered answer.
  [-]
  - drusepth 48 minutes ago
    Interestingly, that kind of parallels the real world too: if you want a quick and high level answer, talk to someone in person; if you want something detailed and info-dense, get them to write it down.
rvz 6 minutes ago
OpenAI uses Go for the networking implementation for the relays and the services, which makes a ton of sense, instead of something immature as TypeScript / Node or whatever.
Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.
AIorNot 1 hour ago
so is the answer
WebRTC + Kubernetes
Dorrell 52 minutes ago
[flagged]
testing_auth 11 minutes ago
[dead]
testing_auth 2 minutes ago
[dead]
cdrnsf 1 hour ago
It's missing the part where they explain how they obtained the training data for their voice AI.