Skip to content

Problem with transferring language to the TTS server. #46

@mitrokun

Description

@mitrokun

I faced an interesting challenge when creating a server with a new model - in addition to the voice, I needed to transmit the language of the pipeline.

Since this isn't implemented in the system HA client, and there's no reference implementation available, I had to experiment. I used my custom, failover TTS proxy as a basis. I successfully managed to extract the language from the pipeline and get it to the event creation stage. But then I ran into problems (either with my implementation or with the library).

I'm attaching an AI summary that will highlight the problem areas. I hope you can explain why I wasn't able to use the built-in function.


Technical Analysis: Wyoming Library Serialization/Deserialization Inconsistency

Context:
This analysis involves communication between a Custom Wyoming Client Implementation and a Wyoming Server (wyoming-supertonic).

Problem Summary:
A critical failure occurs in the transmission of nested data structures (specifically the language parameter within the voice object) when relying on the standard wyoming library abstractions. The library fails to correctly serialize multi-part messages as defined by the protocol, resulting in data loss.

1. Client-Side: The Custom Stream Processor

The "Standard" Implementation (Relying on Library Abstractions):
Initially, the custom client was implemented strictly adhering to the library's class structures. We relied on the library to handle the serialization of complex nested objects.

# INITIAL IMPLEMENTATION (Failed)
# We instantiate the library's data classes, expecting them to be serialized correctly.

voice_obj = SynthesizeVoice(name=server_info["voice"], language=language)
start_request = SynthesizeStart(voice=voice_obj)

# We pass the high-level object to the writer.
# EXPECTATION: The library generates a valid multi-part message containing the language.
# REALITY: The library generates a malformed message sequence where the 'language' attribute is lost during serialization.
await async_write_event(start_request.event(), writer)

The "Robust" Implementation (Workaround):
To resolve this, we bypassed the library's object serialization logic. Instead of using SynthesizeStart and SynthesizeVoice classes, we manually construct the data dictionary and wrap it in a raw Event object. This forces the library to send a single-part JSON message, effectively bypassing the multi-part handling bug.

# FIXED IMPLEMENTATION (Workaround)
# We manually construct the dictionary to ensure all data is present.

voice_data = {}
if server_info.get("voice"):
    voice_data["name"] = server_info["voice"]
if language:
    voice_data["language"] = language

# We instantiate a raw Event object directly.
# This guarantees the data payload is embedded in the primary JSON object, avoiding multi-part split errors.
start_event = Event(type="synthesize-start", data={"voice": voice_data})

await async_write_event(start_event, writer)

After digging into the code, it seems that the problem lies in these strict restrictions.

def to_dict(self) -> Dict[str, str]:
        if self.name is not None:
            voice = {"name": self.name}
            if self.speaker is not None:
                voice["speaker"] = self.speaker
        elif self.language is not None:
            voice = {"language": self.language}
        else:
            voice = {}

        return voice

And a similar problem for the from_dict method, so I had to implement manual data extraction

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions