MARS8-Flash
MARS8-Flash is an ultra-low latency model designed for real-time conversational AI agents offering multilingual support.
Model details
Example usage
Quickstart
Here is a basic example payload:
1{
2 "text": "The quick brown fox jumps over the lazy dog.",
3 "language": "en-us",
4 "output_duration": null,
5 "reference_language": "en-us",
6 "reference_audio": "https://github.com/Camb-ai/mars6-turbo/raw/refs/heads/master/assets/example.wav",
7 "output_format": "flac"
8}
9
10For example, sending such a request in python:
11
12
13python
14import httpx
15url = "<YOUR PREDICTION ENDPOINT>"
16headers = {"Authorization": "Api-Key <YOUR API KEY>"}
17
18data = {
19 "text": "The quick brown fox jumps over the lazy dog.",
20 "language": "en-us",
21 "output_duration": None,
22 "reference_language": "en-us",
23 "reference_audio": "https://github.com/Camb-ai/mars6-turbo/raw/refs/heads/master/assets/example.wav",
24 "output_format": "flac"
25}
26
27prediction = []
28with httpx.stream("POST", url, headers=headers, json=data, timeout=300) as r:
29 print(r.status_code, r.headers)
30 dt = time.time()
31 for chunk in r.iter_bytes(4096*16):
32 if chunk:
33 # each chunk is a bytes object of the next audio chunk.
34 # If you want you can render the output piece by piece, or use an async receiver.
35 print(f"Received chunk of size {len(chunk)} at {time.time() - st:.2f}s. w/o network delay: {time.time() - dt:.2f}s")
36 prediction.append(chunk)
37et = time.time()
38full_audio_bytes = b"".join(prediction)
39# you can now playback the full audio bytes as they are received, or save them as they are received. E.g. in a notebook:
40# ipd.display(ipd.Audio(full_audio_bytes)2. Payload parameters
2.1 Mandatory parameters:
These parameters are strictly required to generate audio.
text(str): the text to generate.language(str): the ISO language-locale code. E.g. "en-us", "nl-nl", "de-de", ... See valid list at the end.output_duration(float or null/None): a suggested output duration, if known. If specified, we try make the generated audio close to the specified duration in seconds.If unspecified, the model infers an optimal output duration. This is useful if you want the output speed to be spoken at a faster or slower rate intentionally.
Extreme/unreasonable values may lead to unpredictable outcomes.
E.g. 2.154, None/null, 15.25, ...
Note: does not work well with very long text inputs where the output duration would be >30s.
reference_audio(str): the reference audio to load, specified either as:a publicly accessible URL, e.g. "https://github.com/Camb-ai/mars6-turbo/raw/refs/heads/master/assets/example.wav".
a base64 encoded audio file in a supported format, e.g. "flac", "wav", "adts", "m4a", etc.
Note: re-used reference audio is recommended for speed. i.e. using the same reference audio string multiple times will drastically speed up the inference time of the model as it has a caching mechanism internally.
reference_language: the language of the reference audio, in the same format aslanguage. E.g. "en-us"
2.2 Optional parameters
Reference specification
The model requires a reference voice to define the tone, timbre, and style of the speech.
Parameter: reference_offset
Type: float
Default: 0.0
Description: Start time (in seconds) to begin reading the reference audio file, when the reference audio file is specified as a public URL.
Parameter: reference_duration
Type: float
Default: None
Description: How many seconds of the reference audio to use. If None, uses the whole file (or until end). Only applicable to public URLs.
Parameter: ref_loudness_target_db
Type: float
Default: -24.0
Description: Target loudness (decibels) to normalize the reference audio to before processing. In dB LUFS.
Parameter: apply_ref_loudness_norm
Type: bool
Default: False
Description: Whether to apply loudness norm. Setting to True might increase stability, but at the cost of worse speaker similarity.
Output specification
Control the format and delivery of the generated file.
Parameter: output_format
Type: string
Default: "flac"
Description: Audio file format. Common options: "flac", "wav", "adts" or headerless media (one of 'pcm_f32le', 'pcm_f32be', 'pcm_s16be', 'pcm_s16le', 'pcm_s32be', 'pcm_s32le'). For headerless media, output format is always 22050Hz mono.
Inference and style settings
Advanced settings to tweak the performance, creativity, and stability of the voice generation.
Parameter: temperature
Type: float
Default: 1.0
Description: Controls randomness. Lower values make speech more stable/deterministic; higher values make it more varied. Recommended to keep it at 1.0
Parameter: cfg_weight
Type: float
Default: 4.2
Description: Guidance scale. Determines how strictly the model adheres to the style of the reference_audio and the text given. Lower values might be more expressive, but less stable.
Parameter: campp_speaker_nudge
Type: float
Default: 0.0
Description: Change the speaker to one of the default speakers by this fraction. This is useful if you do not want to clone the target speaker exactly, but just want to get vaguely close to the cloned voice. E.g. a value of 0.3 means that 30% of the voice identity will come from the default speakers, and 70% from the audio reference given. Aka speaker_similarity.
Parameter: acoustic_nudge
Type: tuple[float, float]
Default: (3.6, 3.5)
Description: The positive and negative classifier scores to use for acoustic quality. If you reference is noisy or abnormal (e.g. screaming, crying), or you notice other problems, setting this to have a higher differential may improve acoustic quality at the cost of worse speaker similarity. E.g. using [4.2, 1.5] will make the audio output sound 'more HD' at the cost of worse speaker similarity. This can help 'ground' unstable references.
Text processing specification
Helpers for handling specific words or text cleanup.
Parameter: apply_ner_nlp
Type: boolean
Default: True
Description: Enables Named Entity Recognition to better pronounce names, places, and acronyms. Depending on how hard the name is, this will increase runtime, sometimes by a lot. Recommended to leave it to false and pass in any hard names via the pronunciation dictionary.
Parameter: pronunciation_dictionary
Type: dict[str, str]
Default: {}
Description: A dictionary mapping words to specific IPA pronunciation to use (see guidelines below) (e.g., {"गजराज": "ɡədʒrˈɑːdʒ"}).
Parameter: apply_ref_mpsenet
Type: boolean
Default: False
Description: Applies a speech enhancement network to the reference audio before cloning (useful if reference is noisy). Can slightly increase inference time. Aka enhance_audio_reference_quality
Parameter: extended_number_verbalization_exp
Type: bool
Default: False
Description: Whether or not to apply experimental number normalization, where we try to correctly predict how to say certain numbers, e.g. $1,000.22, telephone numbers. It is experimental because it does not fully work on all languages in all scenarios.
3. Implementation notes
Supported languages
We support the below languages/locales in Mars8. These are the valid arguments for language and reference_language in the payload.
You are free to experiment with other languages, but they are not guaranteed to work.
en-us
en-in
zh-cn
fr-fr
de-de
ja-jp
ru-ru
ko-kr
es-es
es-mx
pt-pt
pt-br
ar-sa
id-id
hi-in
nl-nl
ar-xa
ar-sy
it-it
ar-eg
ta-in
te-in
bn-in
ar-ma
mr-in
kn-in
bn-bd
as-in
ml-in
fr-ca
pl-pl
tr-tr
pa-in
IPA pronunciation dictionary guidelines
When specifying custom IPA pronunciations, please follow the guidelines below for best results:
For the IPA transcription, do not include syllable breaks, but do maintain spaces between words.
As convention, any stress symbols (if present) should be placed before the stressed vowel, and not at the beginning of the stressed syllable. E.g. "Buddha" should become "/bˈʊdə/" and not "/ˈbʊdə/".
For names with prefixed / titles (e.g. "Dr.", "Mrs.", "Prof.", "Miss"), ensure their IPA pronunciation is included in the IPA transcription as well. E.g. "Miss Lebas" should become "/mˈɪs ləbˈa/".
Do not use tie's in IPA. MARS8 does not support IPA tie symbols properly and can cause issues. E.g. for "Charlie", rather use
tʃˈɑːrliinstead oft͡ʃˈɑːrli.
Tips for best outputs
MARS8 interprets punctuation from most languages, and they can be used to help control the pacing and style of the speech. E.g. including exclamations, commas, "...", and their equivalents in other languages can all help introduce pauses or specific styles/emotions.
The ideal reference duration is 8–20 seconds. Using an expressive noise-free reference will yield the best results.
MARS8 supports specific breath pauses with the special word
<|breath|>. Wherever this word is included in the text, the generated output will have a breath pause at that location. E.g. "And then he said to her <|breath|> 'don't go'."
1import base64
2from camb.client import CambAI, save_stream_to_file
3
4client = CambAI(
5 tts_provider="baseten",
6 provider_params={
7 "api_key": "YOUR_BASETEN_API_KEY",
8 "mars_pro_url": "https://model-xxxxxx.api.baseten.co/environments/production/predict"
9 }
10)
11
12def main():
13 response = client.text_to_speech.tts(
14 text="Hello World and my dear friends",
15 language="en-us",
16 speech_model="mars-pro",
17 request_options={
18 "additional_body_parameters": {
19 "reference_audio": base64.b64encode(open("audio.wav", "rb").read()).decode('utf-8'),
20 "reference_language": "en-us" # required
21 },
22 "timeout_in_seconds": 300
23 }
24 )
25 save_stream_to_file(response, "tts_output.mp3")
26 print("Success! Audio saved to tts_output.mp3")
27
28if __name__ == "__main__":
29 main()1null