Text to Speech for HoloLens

I briefly mentioned the HoloToolkit for Unity before in my article Fragments found my couch, but how? The toolkit is an amazing bit of open source code from Microsoft that includes a large set of MonoBehaviors that make it easier to build HoloLens apps in Unity.

The toolkit makes it easy to support speech commands through the use of KeywordManager (or if you’d like more direct control you can use the SDK class KeywordRecognizer directly). But one of the features I found glaringly missing from the docs and the toolkit was the ability to generate speech using Text to Speech.

The Universal Windows Platform (UWP) offers the ability to generate speech using the SpeechSynthesizer class, but the main problem with using that class in Unity is that it doesn’t interface with the audio device directly. Instead, when you call SynthesizeTextToStreamAsync the class returns a SpeechSynthesisStream. In a XAML application you would then turn around and play that stream using a MediaElement, but in a pure Unity (non-xaml) app we don’t have access to MediaElement.

I asked around on the Unity forums and Tautvydas Zilys was kind enough to point me in the right direction. He suggested I somehow convert the SpeechSynthesisStream to a float[] array that could then be loaded into an AudioClip. This worked, it was just harder than I expected.

 

Technical Details

I’ll explain how the work was done here for those who would really like to dig into the technical interop between Unity and the UWP. If you’re only interested in using Text to Speech in your HoloLens application, feel free to jump down to the section Using TextToSpeechManager.

 

Audio Formats

The data in the stream returned by SpeechSynthesizer basically contains an entire .WAV file, header and all. But in order to dynamically load audio at runtime into Unity, this data needs to be converted into an array of floats that range from -1.0f to +1.0f.  Luckily I found an excellent sample provided by Jeff Kesselman in the Unity forums that showed how to load a WAV file from disk into the float[] that Unity needs. I removed the ability to load from disk and focused on the ability to load from a byte[]. Then I added a method called ToClip that would convert the WAV audio into a Unity AudioClip. That code became my Wav class, which you can find on GitHub here.

The main code I added to convert the WAV data to an AudioClip is:

public AudioClip ToClip(string name)
{
 // Create the audio clip
 var clip = AudioClip.Create(name, SampleCount, 1, Frequency, false); // TODO: Support stereo

 // Set the data
 clip.SetData(LeftChannel, 0);

 // Done
 return clip;
}

Stream to Bytes to WAV

Now that I had the ability to load WAV data and convert it to an AudioClip, now I needed a way to convert the SpeechSynthesisStream to a byte[]. This wasn’t overly complex but it did involve some intermediary classes.

First, you need to call GetInputStreamAt to access the input side of the SpeechSynthesisStream. That gives you an IInputStream which can be passed to a DataReader. From there, finally, we can call ReadBytes to read the contents of the stream into a byte[].

// Get the size of the original stream
var size = speechStream.Size;

// Create buffer
byte[] buffer = new byte[(int)size];

// Get input stream and the size of the original stream
using (var inputStream = speechStream.GetInputStreamAt(0))
{
 // Close the original speech stream to free up memory
 speechStream.Dispose();

 // Create a new data reader off the input stream
 using (var dataReader = new DataReader(inputStream))
 {
 // Load all bytes into the reader
 await dataReader.LoadAsync((uint)size);

 // Copy from reader into buffer
 dataReader.ReadBytes(buffer);
 }
}

// Load buffer as a WAV file
var wav = new Wav(buffer);

 

Threading and Async

The final hurtle was the biggest and it was due to the differences between the UWP threading model and Unity’s threading model. Many UWP methods follow the new async and await pattern which is extremely helpful to developers when building multi-threaded applications, but isn’t supported by Unity. In fact, Unity is very much a single-threaded engine and it uses tricks like coroutines to try and simulate support for multi-threading. This is where things get a little hairy.

TextToSpeechManager has a method called Speak, but Speak cannot be marked async since Unity doesn’t understand async.  Inside this method, however, we need to call other UWP methods that need to be awaited. Methods like SpeechSynthesizer.SynthesizeTextToStreamAsync.

The way we work around this is to have the Speak method return void, but inside the Speak method we spawn a new Task with an inline async anonymous function. That sounds really complicated, but the words in that sentence are probably more complicated to type than the actual code. Here’s what it looks like:

// Need await, so most of this will be run as a new Task in its own thread.
// This is good since it frees up Unity to keep running anyway.
Task.Run(async ()=>
{
 // We can do stuff here that needs to be awaited
}

As the comments above call out, the good news here is that anything running inside that block will run on it’s own thread which leaves the Unity thread completely free to continue rendering frames (a critically important task for any MR/AR/VR app).

The last “gotcha” comes when the speech data has been converted and it’s time to hand it back to Unity. Remember, Unity is single-threaded only and Unity classes must be created on Unity’s thread. This is true even of classes that don’t do any rendering, like AudioClip. If we try to create the AudioClip inside that Task.Run code block above, we will be creating it in a Threadpool thread created by UWP. This will fail with an error message in the Unity Error window explaining that an attempt was made to create an object outside the Unity thread. But we can’t put the code outside of the Task.Run block either because anything outside of the Task.Run block will run in parallel with code inside the block! We would be trying to create the AudioClip before speech was even done generating!

So we need to be able to run something on Unity’s thread, but we need to be able to do it inside the Task.Run block, and that can be accomplished by an undocumented helper method called UnityEngine.WSA.Application.InvokeOnAppThread. That helper method takes another anonymous block of code and will schedule that to run back on Unity’s main thread. It looks like this:

// The remainder must be done back on Unity's main thread
UnityEngine.WSA.Application.InvokeOnAppThread(()=>
{
 // Convert to an audio clip
 var clip = wav.ToClip("Speech");

 // Set the source on the audio clip
 audioSource.clip = clip;

 // Play audio
 audioSource.Play();
}, false);

 

Using TextToSpeechManager

If all that code above sounds confusing, don’t worry! We’ve made it super simple for you. Right now the code is only available in my github fork of the toolkit but I’ve already submitted a pull request to have it added to the main branch. You can either use my fork or you can just download the two files (Wav.cs and TextToSpeechManager.cs) until the request has been accepted.

Once you’ve got the two scripts, here’s what you need to do in order to get speech working:

  1. Create an AudioSource somewhere in the scene. I recommend creating it on an empty GameObject that is a child of the Main Camera and is positioned about 0.6 units above the camera.Voice Hierarchy
    Voice SceneWherever you place this AudioSource is where the sound of the voice will come from. Placing it 0.6 units above the camera and having it be a child of the camera means you’ll always hear the voice and the voice seems to come from a similar location as it does when hearing Cortana in the OS.
  2. Recommended: Configure the AudioSource to have a Spatial Blend of full 3D.

    Speech Frame InspectorThis will allow it to be positioned accurately in 3D space.

  3. Add a TextToSpeechManager component anywhere in your scene and link (drag and drop) the audio source you created in the last step to link it with the manager.TTS Manager Inspector
  4.  Anywhere in code, call TextToSpeechManager.Instance.Speak(“Your Text Here!”);

 

Known Issues

Here are the current known issues:

  • As mentioned above, currently this code only exists in my fork so you’ll need to get it from there or wait till the pull request is completed.
  • TextToSpeechManager currently doesn’t support SSML. I just didn’t think to add it. Once they accept my first pull request I’ll perform an update and add a SpeakSsml method to the manager.
  • Big one: Speech is not supported in the editor! Sorry, text to speech is a UWP API and therefore does not work in the Unity editor. You’ll see warnings in the debug window along with the text that would have been spoken.

You May Also Like

12 thoughts on “Text to Speech for HoloLens

    1. The .Instance property was from when I was originally treating TextToSpeech as a singleton. Before adding it to the toolkit we decided NOT to treat it as a singleton because you may want multiple subjects speaking and with different voices. The TextToSpeech component is now part of the Mixed Reality Toolkit. You can find it here.

  1. Hi Jared, great- I was looking for just this. However, the links to the files TextToSpeechManager.cs and Wav.cs are both yielding a 404 error. Also the same for the github toolkit. Could you pls send me the files to my email id or give me another link where these exist? Thanks!

  2. Is it still true that Speech is not supported in the editor? Also, TextToSpeechManager does not exist anymore, it’s just TextToSpeech. Do we need to include any header to use this? Thanks

    1. TextToSpeech is now part of the Mixed Reality Toolkit here. It still does not work within the Unity Editor because it requires an API that is not supported in Unity at design time. Instead you’ll see a message in the log window that reports what would have been said.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.