Zero to real-time text-to-speech: The complete Orpheus + WebSockets tutorial

Orpheus is an open-source text-to-speech model released by Canopy Labs. In this post, we'll provide a quick overview of why WebSockets make sense for real-time text to speech and walk through a complete implementation example on Baseten you can adopt in any application today.

Traditional HTTP request-based inference creates significant friction for streaming applications due to connection overhead and latency. WebSockets navigates around this by maintaining a persistent, bidirectional connection between the server and client, enabling seamless real-time communication. Rather than requiring headers and new connections for each request, WebSockets establish a single channel for continuous data exchange.

This makes Websockets ideal for real-time applications such as voice-enabled customer support, language translation, or interactive voice assistants for accessibility; clients can stream audio chunks while simultaneously receiving transcription results without interruption.

Let’s step through a code example of how to stream text from the Baseten Orpheus endpoint to audio that can play aloud.

Step 1: Deploy Orpheus TTS

Baseten created a custom implementation of Orpheus with support for streaming over WebSockets. Start by deploying this implementation from the model library.

After the model appears in your account, save the Baseten model ID and generate a Baseten API key if you don’t already have one.

Click the model ID on the model overview page to copy it for later use.

Then, create a Baseten API key from the API keys page and save it as an environment variable

export BASETEN_API_KEY=my_api_key

Step 2: Understanding WebSocket streaming in practice:

While Orpheus starts up on Baseten, let’s understand the code we will run that plays the text in audio that we receive from the endpoint.

We first need to install the required packages required for using websockets, capturing audio, and converting data. You can paste this into your terminal:

pip install pyaudio aiohttp

We expect you to set the Baseten API Key you grabbed from Step 1 as the $BASETEN_API_KEY environment variable. And you can now paste in the model ID for the deployed Orpheus model into the MODEL_ID variable. We also set parameters we want for streaming.

1import os, json, asyncio
2import urllib3
3import pyaudiofrom aiohttp 
4import ClientSession, WSMsgType
5from dotenv import load_dotenv
6
7load_dotenv()
8
9API_KEY = os.getenv("BASETEN_API_KEY")
10MODEL_ID = "BASETEN_MODEL_ID"
11WS_URL = f"wss://model-{MODEL_ID}.api.baseten.co/environments/production/websocket"
12
13VOICE       = "tara"    
14MAX_TOKENS  = 2000
15BUFFER_SIZE = 5         # words / chunk
16SAMPLE_RATE = 24000     # audio quality (24kHz)
17WIDTH       = pyaudio.paInt16
18CHANNELS    = 1

Now this is the main utility function, stream_tts, where we first open up a speaker output stream with a pyaudio object. Then we connect to the website with the appropriate headers for authentication, and send the metadata we set above.

1async def stream_tts(text: str):   
2  pa = pyaudio.PyAudio()   
3  stream = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE, output=True)   
4  
5  headers = {"Authorization": f"Api-Key {API_KEY}"}   
6  print(f"🔗 Connecting to WebSocket: {WS_URL}")   
7  async with ClientSession(headers=headers) as sess:       
8    try:           
9      async with sess.ws_connect(WS_URL) as ws:               
10        print("✅ WS connected")               
11        # send metadata once               
12        await ws.send_json({                   
13          "voice": VOICE,                   
14          "max_tokens": MAX_TOKENS,                   
15          "buffer_size": BUFFER_SIZE,               
16        })               
17        print("📤 metadata sent")

We also define an asynchronous audio receiver function that continuously listens for audio data from the server. When the WebSocket receives audio bytes from the Orpheus endpoint, the data is played immediately by writing to the stream we opened above with pyaudio through your computer speaker.

Since the WebSocket is a bi-direction channel, we can also send words in the for loop, and we have sent some sample words for Orpheus to turn into audio below.

1               ...
2               # start audio receiver
3               async def receiver():                   
4                 async for msg in ws:                       
5                   if msg.type == WSMsgType.BINARY:                           
6                     print(f"⏯️  playing {len(msg.data)} bytes")                           
7                     stream.write(msg.data)                       
8                   elif msg.type in (WSMsgType.CLOSE, WSMsgType.CLOSED):                           
9                     print("🔒 server closed")                           
10                     return               
11                     
12                recv = asyncio.create_task(receiver())               
13                
14                # send words               
15                for w in text.strip().split():                   
16                  await ws.send_str(w)               
17                print("📤 words sent")               
18                
19                # signal end-of-text               
20                await ws.send_str("__END__")               
21                print("📤 END sentinel sent — waiting for audio")               
22                
23                # wait until server closes               
24                await recv       
25            except Exception as e:           
26              print(f"❌ Connection error: {e}")   
27          stream.stop_stream()   
28          stream.close()   
29          pa.terminate()   
30          print("🎉 done")
31     if __name__ == "__main__":   
32        sample = (       
33          "Nothing beside remains. Round the decay of that colossal wreck, "       
34          "boundless and bare, The lone and level sands stretch far away."   
35        )     
36        async def main():       
37          # run TTS       
38          await stream_tts(sample)     
39        asyncio.run(main())

Step 3: Running the script

Go into the terminal, paste in all the parts of the code and run python call.py. You will now hear the text dictated realistically through your computer speaker.

To recap, we showed how to run real-time speech-to-text via WebSockets that concurrently send text data and receive audio data from a Baseten production-grade Orpheus endpoint. We can’t wait to see what you build.

Streaming text-to-speech in production

You can run real-time speech-to-text via WebSockets that concurrently send text data and receive audio data from our Orpheus implementation. With WebSockets, each user maintains a connection with the server for the entire duration of their session. In production, it’s important to set appropriate autoscaling settings so that you can support more concurrent users as the number of active connections grows.

Whatever text you want to turn into speech, run it on Baseten with production-grade real-time text-to-speech inference with Whisper V3 on WebSockets without worry.