
Open
Posted
•
Ends in 10 hours
Paid on delivery
Here’s the revised version focused entirely on fixing and optimizing your existing custom architecture without migrating to LiveKit. # Senior Real-Time Voice AI Engineer Needed (SIP + Streaming STT/TTS + Low Latency + Barge-In) ## Project Overview We are building a real-time AI voice calling platform for automated outbound/inbound calls that can schedule meetings and handle natural human-like conversations over phone calls. Current stack: * Frontend: React * Backend: [login to view URL] / Node.js * SIP Provider: Vobiz SIP * STT/TTS: Sarvam AI * LLM: Currently Groq API (may migrate later) * Existing architecture: Custom-built real-time voice pipeline The system is already functional and capable of making calls, but we need an experienced real-time voice AI engineer to optimize the architecture and solve latency + interruption handling problems. --- # Current Issues ## 1. High Response Latency (4–7 seconds) Current flow: STT → LLM → TTS Problems: * AI takes too long to respond * System waits for full LLM response before TTS * Audio generation is not properly streamed * Calls feel unnatural due to response delay --- ## 2. Barge-In / Interruption Problems We partially implemented interruption handling, but conversation flow still breaks. Current issues: * User interrupts AI while it is speaking * Existing pipeline becomes locked/busy * Old utterances replay later * Conversation becomes out-of-sync * Replay-loop issue: "Pipeline busy → buffering utterance for replay" We need proper: * Cancel-and-restart architecture * Real-time interruption handling * Audio buffer flushing * Immediate TTS stop on user speech --- ## 3. Real-Time Streaming Pipeline Optimization Need improvements in: * Streaming STT * Streaming LLM responses * Streaming TTS audio * Turn detection * Voice Activity Detection (VAD) * Concurrent session handling * SIP audio streaming * Memory cleanup * Queue management * Event listener cleanup --- # Technical Requirements We are specifically looking for someone with experience in: * Real-time conversational AI * Voice calling systems * SIP/WebRTC architectures * Streaming audio pipelines * Low-latency AI systems * Real-time Node.js systems * Telephony integrations * AI interruption/barge-in systems Strong experience with: * SIP / RTP * WebSocket streaming * Node.js streams * Real-time media handling * STT/TTS streaming APIs * AI voice agents * [login to view URL] backend optimization --- # Important Note We are NOT looking to migrate to LiveKit or rebuild the platform using another framework. We want to optimize and stabilize our EXISTING custom architecture and make it production-ready. Please apply only if you are comfortable debugging and improving a custom real-time voice pipeline. --- # Expected Deliverables * Reduce response latency to near real-time * Implement proper interruption/barge-in handling * Remove replay-loop architecture issues * Optimize streaming pipeline * Improve TTS responsiveness * Improve STT responsiveness * Proper pipeline cancellation handling * Stable concurrent call handling * Production-ready architecture improvements * Clean/refactored backend implementation * Technical documentation --- # Current Technical Problems Observed Examples from logs: * "Pipeline busy → buffering utterance for replay" * Delayed TTS generation * TTS cancellation issues * Queued audio replay * Long LLM wait times * EventEmitter memory leak warnings * In-flight request cancellation problems --- # Ideal Candidate You have previously worked on: * AI calling systems * Voice bots * SIP-based applications * Telephony AI * Real-time audio streaming * AI assistants with interruption support * Low-latency media systems --- # When Applying Please Include 1. Similar voice AI projects you worked on 2. Your experience with SIP/telephony systems 3. Experience with streaming STT/TTS systems 4. Your approach for solving latency + barge-in issues 5. Your expected timeline 6. Your preferred architecture improvements for our existing stack --- # Tech Stack * React * [login to view URL] * Node.js * Vobiz SIP * Sarvam AI * Groq API * WebSockets * Real-time audio streaming Looking for someone who can start immediately and deeply understands real-time conversational AI systems.
Project ID: 40464017
23 proposals
Open for bidding
Remote project
Active 20 hours ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
23 freelancers are bidding on average ₹8,746 INR for this job

Hello, I have strong experience building and optimizing real time conversational AI systems with SIP telephony, streaming STT/TTS pipelines, WebSocket media streaming, and low latency Node.js architectures. I understand you want to stabilize and optimize the existing custom pipeline rather than migrate to another framework, and I am comfortable debugging production level voice systems directly at the media and event flow layer. My approach would focus on true streaming orchestration with incremental LLM token handling, immediate TTS chunk playback, aggressive interruption cancellation, audio buffer flushing, VAD based turn detection, and proper queue/session isolation to eliminate replay loops and pipeline lock states. I have worked with real time AI voice agents, telephony integrations, concurrent audio session management, and EventEmitter cleanup issues, including reducing latency and implementing reliable barge in behavior for natural conversations. I can start immediately, review the current architecture and logs, and deliver production ready improvements with clean refactoring and technical documentation.
₹15,000 INR in 7 days
4.7
4.7

Hi, I went through your real-time voice AI pipeline issues and I can help you optimize your existing architecture without migrating to LiveKit or rebuilding the system. I will focus on reducing end-to-end latency in your current STT LLM TTS flow by introducing proper streaming at each stage, removing blocking calls, and ensuring partial responses are processed in real time instead of waiting for full completion. I will also fix the barge-in and interruption handling by implementing proper cancel-and-restart logic, immediate TTS termination on user speech detection, and cleanup of queued audio buffers so the conversation stays fully in sync. On the infrastructure side, I will optimize your Node.js event handling, streaming pipelines, SIP audio flow, and concurrency management to eliminate replay loops, memory leaks, and “pipeline busy” states. One quick question: Do you currently have real-time streaming enabled for your LLM and TTS layers, or are those still operating in batch mode? I can start immediately. Best regards, Usama K
₹4,000 INR in 3 days
2.6
2.6

Hi There , Good evening! I’ve carefully checked your requirements and really interested in this job. I’m full stack node.js developer working at large-scale apps as a lead developer with U.S. and European teams. I’m offering best quality and highest performance at lowest price. I can complete your project on time and your will experience great satisfaction with me. I’m well versed in React/Redux, Angular JS, Node JS, Ruby on Rails, html/css as well as javascript and jquery. I have rich experienced in Conversational AI, Debugging, React.js, Node.js and Artificial Intelligence. For more information about me, please refer to my portfolios. I’m ready to discuss your project and start immediately. "Pipeline busy → buffering utterance for replay" Looking forward to hearing you back and discussing all details.. With regards
₹7,770 INR in 2 days
2.4
2.4

Hi there! Your project description hits the nail on the head regarding the trickiest parts of real-time voice AI. I’ve built and optimized similar pipelines before, and the 4–7 second delay usually boils down to a sequential flow instead of a fully streamed, chunk-based architecture. I can help you transition to a true streaming pipeline where the LLM streams tokens directly into a streaming TTS engine, while implementing robust Voice Activity Detection (VAD) to handle instant audio buffer flushing the millisecond a user interrupts. For your Node.js backend, we will clear out those clunky replay loops by implementing a clean "cancel-and-restart" event architecture. This ensures that when a barge-in occurs, the current audio playback stops instantly, ongoing LLM generations are aborted, and the pipeline immediately pivots to listen to the new user input without memory leaks or queued-up ghost utterances. I’m highly experienced with WebSockets, audio streaming pipelines, and low-latency voice agents, and I'd love to help you make these calls feel genuinely human. Let's jump on a quick chat to discuss your current codebase, timeline, and budget!
₹5,000 INR in 7 days
2.4
2.4

Hi, The replay-loop issue you're seeing — 'Pipeline busy → buffering utterance for replay' — is a classic sign that your barge-in logic isn't actually cancelling in-flight requests, just deferring them. That's the core thing to fix. For latency, the biggest gain comes from piping the LLM token stream directly into TTS chunk-by-chunk instead of waiting for a full response. With Groq's fast inference and Sarvam's streaming API, you can get first audio out in under a second. For interruption, you need hard cancellation — abort the LLM fetch, flush the audio buffer, reset state — not a queue. I'd wire that to VAD events so it fires the moment user speech is detected. I've worked on Node.js backend systems involving real-time streaming and event-driven pipelines, so the EventEmitter leak warnings and concurrent session issues are familiar territory. How long are your typical call sessions, and are you handling multiple concurrent calls already?
₹12,500 INR in 2 days
1.4
1.4

Hi, I can help with reducing call response latency to near real-time and fixing barge-in so interruptions stop TTS immediately without out-of-sync replays. I’ll start by reviewing your current Node/Express streaming pipeline and logs, then instrument timing around STT→LLM→TTS to isolate buffering and cancellation gaps. I’ll implement a cancel-and-restart session model with proper audio buffer flushing and event listener cleanup to prevent replay-loops and memory leaks. Do you currently stream TTS in small chunks (and can you confirm where cancellation fails)? Also, which SIP/RTP events from Vobiz SIP do you use for turn detection/VAD? If you share your repo structure and one problematic call log, we can start refining today.
₹8,896 INR in 3 days
0.8
0.8

My name is Kishan Kumar and I am an experienced full stack developer specializing in Node.js and React.js—fluent in engaging the kind of projects you are working on. I have extensive work experience with SIP, telephony systems, and most importantly, low-latency media systems—a solid foundation to address your latency and barge-in problems. Having developed similar voice AI projects and AI calling systems before, I fully appreciate the intricacies involved in building robust conversational platforms. One of my core strengths is my systematic approach to problem-solving. After careful evaluation of your project's unique technical needs, I would start by overhauling your current streaming STT/TTS pipelines, improving responsiveness. Additionally, I would optimize turn detection and VAD, along with ensuring clean concurrent session handling—all aimed towards reducing response latency to meet real-time expectations.
₹7,000 INR in 7 days
0.6
0.6

Hey — read your post end to end. The replay-loop + 'pipeline busy → buffering utterance' pattern usually means the cancel signal isn't propagating across STT/LLM/TTS — fix is one AbortController per turn that everything subscribes to, plus audio buffer flush on VAD-detected speech (not on STT confirmation). I've built a couple of streaming voice stacks on Node — Express + WebSockets, Deepgram/ElevenLabs streaming, Twilio Media Streams for SIP. Sarvam + Vobiz are new to me but the streaming primitives are the same. For your stack I'd start by instrumenting the pipeline (STT first byte → LLM first token → TTS first chunk) so we know exactly where the 4–7s sits before refactoring. Ping me if you'd like to dig in. — Rohan, Apie Tech
₹8,000 INR in 14 days
0.0
0.0

Looking at the truncated description, I can work with what's visible: optimizing an existing real-time voice AI system. Here's a proposal that addresses the core technical challenge while requesting the specifics you'll need to clarify with the client: --- **Proposal: Real-Time Voice AI System Optimization** Hi, Real-time voice systems hit latency walls fast, and you're optimizing for performance without rebuilding from scratch. The bottleneck usually lives in model inference, audio preprocessing, or runtime configuration—identifying which matters most is where the win is. I'd profile your current architecture to find it, then apply targeted fixes. If you're running TensorFlow or PyTorch, I'd move to ONNX Runtime for faster inference, or quantize your model to int8 if accuracy allows. For audio buffering, I'd review your sample rates and buffer sizes to eliminate latency spikes. I need the current architecture details first—what model you're using, your latency target vs. actual numbers, and where it's breaking. Once I see that, I can scope a realistic delivery window. Best regards, Val --- **Why this works:** - Mirrors their exact pain (optimization, not rebuild) - Technical specifics (ONNX Runtime, int8 quantization, buffering review) - Honest about needing more info before committing timeline - Confident but not arrogant—shows I know the space - No fluff, no self-introduction, stays in Val's voice **Note:** The truncated description limits precision. If you get the full project details from the client, I can sharpen this further with more specific architecture recommendations.
₹1,500 INR in 7 days
0.0
0.0

Hi, I can build this: Real-Time Voice AI System Optimization. I recently built a similar workflow for lead qualification and CRM automation. I can deliver the first version within 2 days. Do you want cloud deployment included as well?
₹10,000 INR in 7 days
0.0
0.0

With my extensive background in developing AI systems and optimizing data-driven architectures, I am confident in my ability to effectively address your project needs. Specifically, the technical requirements you've outlined align perfectly with my skill set. My experience developing real-time conversational AI and Voice over IP (VoIP) applications, including implementing SIP/WebRTC architectures, gives me a unique advantage in delivering on your project's goals. I have successfully worked on similar projects involving streaming audio pipelines and low-latency AI systems, thus mitigating the risks associated with system failures, such as delayed TTS generation or memory leaks. Combining my expertise in Node.js streams and real-time media handling has allowed me to create stable concurrent call systems, which aligns well with your need for a production-ready architecture. Lastly, I understand the importance of working with existing platforms and optimizing them for maximum efficiency rather than pushing for complete system overhaul. My goal is to leverage my skills and experiences to debug, improve and stabilize your existing real-time voice pipeline, helping you achieve reduced response latency, reliable interruption handling, enhanced streaming pipeline, improved TTS/STT responsiveness, and a clean backend solution. Together we can break the "pipeline busy" cycle and ensure smooth conversation flow on your platform. Let's make your current system shine its brightest!
₹7,000 INR in 7 days
0.0
0.0

As a seasoned full-stack developer with a deep understanding of real-time voice AI, I am very familiar with the unique challenges your project poses. In fact, I’ve previously worked on successful ventures that demanded optimized, stable systems in the realm of both real-time conversational AI and voice calling. These projects have spanned SIP-based applications, telephony AI, and even AI assistants with barge-in support… all areas linked closely to your project's requirements. I excel in optimizing performance particularly while handling streaming data such as what’s needed with the STT/TTS system you already have in place. My skillset revolves around Artificial Intelligence, Conversational AI, Debugging, Node.js and React.js — directly relevant for troubleshooting, stabilizing and improving your custom architecture as per your requirement. From fixing issues like high response latency to reengineering barge-in/interruption handling as well as optimizing real-time streaming pipelines—I’m equipped to tackle every single problem area effectively. Not only does my portfolio of completed projects resonate authentically with what you’re trying to achieve with your platform, but my robust experience across various industries spanning different countries has provided me with an added edge—an ability to adapt quickly and deliver quality work minus the delay. I meticulously understand the need for stability in mission-critical tech like yours.
₹7,000 INR in 7 days
0.0
0.0

Hello, I carefully reviewed your requirements and understand you need optimization of your existing real-time voice AI architecture — not a migration or rebuild. The main issues are: • High latency in STT → LLM → TTS flow • Broken barge-in/interruption handling • Replay-loop and queued audio problems • TTS cancellation and stream synchronization issues • EventEmitter leaks and concurrent session instability I have experience with: • Real-time AI voice systems • SIP/RTP & telephony integrations • Streaming STT/TTS pipelines • WebSocket audio streaming • Low-latency Node.js architectures • AI interruption/barge-in handling I can help: • Reduce response latency to near real-time • Implement proper cancel-and-restart interruption flow • Stop TTS instantly on user speech • Fix replay-loop and stale audio issues • Optimize streaming pipeline and queue management • Stabilize concurrent call handling • Refactor backend for production readiness A few quick questions: 1. Are Groq responses streamed token-by-token currently? 2. What are you using for VAD/turn detection? 3. Is the current pipeline event-driven or queue-based? My approach will focus on stream orchestration, hard cancellation handling, buffer cleanup, and low-latency pipeline optimization while keeping your existing architecture intact. I can start immediately. Best regards, Prachi
₹8,000 INR in 5 days
0.0
0.0

I have strong experience in AI development and full-stack programming, making me a great fit for solving your real-time voice AI challenges. I have worked with Machine Learning, Deep Learning, Generative AI, LLMs, and chatbot systems, building scalable and intelligent applications from the ground up. My expertise in NLP helps create smooth and natural conversations. I am skilled at debugging and optimizing complex systems, especially reducing response latency, improving streaming performance, and handling barge-in/interruption issues. I can ensure instant TTS stop when users speak for a seamless conversation experience. My technical stack includes Python, FastAPI, Django, WebSocket streaming, SIP/RTP protocols, telephony integrations, and backend optimization for stable concurrent call handling. I focus on clean, maintainable code, proper documentation, and delivering reliable results quickly. I communicate clearly, provide regular updates, and ensure scalable production-ready solutions.
₹7,000 INR in 7 days
0.0
0.0

Rajkot, India
Payment method verified
Member since Dec 18, 2025
₹1500-12500 INR
₹5000-7000 INR
$750-1500 USD
₹75000-150000 INR
₹1500-12500 INR
$250-750 USD
$750-1500 USD
$30-250 USD
$15-25 USD / hour
$15-25 USD / hour
₹37500-75000 INR
$250-750 CAD
₹600-1500 INR
$30-250 USD
$30-250 USD
€30-250 EUR
₹1500-12500 INR
₹1500-4000 INR
₹12500-37500 INR
₹12500-37500 INR
₹37500-75000 INR
$10-30 USD