Text to speech - Wio Terminal
In this part of the lesson, you will convert text to speech to provide spoken feedback.
Text to speech
The speech services SDK that you used in the last lesson to convert speech to text can be used to convert text back to speech.
Get a list of voices
When requesting speech, you need to provide the voice to use as speech can be generated using a variety of different voices. Each language supports a range of different voices, and you can get the list of supported voices for each language from the speech services SDK. The limitations of microcontrollers come into play here - the call to get the list of voices supported by the text to speech services is a JSON document of over 77KB in size, far to large to be processed by the Wio Terminal. At the time of writing, the full list contains 215 voices, each defined by a JSON document like the following:
{
"Name": "Microsoft Server Speech Text to Speech Voice (en-US, AriaNeural)",
"DisplayName": "Aria",
"LocalName": "Aria",
"ShortName": "en-US-AriaNeural",
"Gender": "Female",
"Locale": "en-US",
"StyleList": [
"chat",
"customerservice",
"narration-professional",
"newscast-casual",
"newscast-formal",
"cheerful",
"empathetic"
],
"SampleRateHertz": "24000",
"VoiceType": "Neural",
"Status": "GA"
}
This JSON is for the Aria voice, which has multiple voice styles. All that is needed when converting text to speech is the shortname, en-US-AriaNeural
.
Instead of downloading and decoding this entire list on your microcontroller, you will need to write some more serverless code to retrieve the list of voices for the language you are using, and call this from your Wio Terminal. Your code can then pick an appropriate voice from the list, such as the first one it finds.
Task - create a serverless function to get a list of voices
-
Open your
smart-timer-trigger
project in VS Code, and open the terminal ensuring the virtual environment is activated. If not, kill and re-create the terminal. -
Open the
local.settings.json
file and add settings for the speech API key and location:"SPEECH_KEY": "<key>",
"SPEECH_LOCATION": "<location>"Replace
<key>
with the API key for your speech service resource. Replace<location>
with the location you used when you created the speech service resource. -
Add a new HTTP trigger to this app called
get-voices
using the following command from inside the VS Code terminal in the root folder of the functions app project:func new --name get-voices --template "HTTP trigger"
This will create an HTTP trigger called
get-voices
. -
Replace the contents of the
__init__.py
file in theget-voices
folder with the following:import json
import os
import requests
import azure.functions as func
def main(req: func.HttpRequest) -> func.HttpResponse:
location = os.environ['SPEECH_LOCATION']
speech_key = os.environ['SPEECH_KEY']
req_body = req.get_json()
language = req_body['language']
url = f'https://{location}.tts.speech.microsoft.com/cognitiveservices/voices/list'
headers = {
'Ocp-Apim-Subscription-Key': speech_key
}
response = requests.get(url, headers=headers)
voices_json = json.loads(response.text)
voices = filter(lambda x: x['Locale'].lower() == language.lower(), voices_json)
voices = map(lambda x: x['ShortName'], voices)
return func.HttpResponse(json.dumps(list(voices)), status_code=200)This code makes an HTTP request to the endpoint to get the voices. This voices list is a large block of JSON with voices for all languages, so the voices for the language passed in the request body are filtered out, then the shortname is extracted and returned as a JSON list. The shortname is the value needed to convert text to speech, so only this value is returned.
💁 You can change the filter as necessary to select just the voices you want.
This reduces the size of the data from 77KB (at the time of writing), to a much smaller JSON document. For example, for US voices this is 408 bytes.
-
Run your function app locally. You can then call this using a tool like curl in the same way that you tested your
text-to-timer
HTTP trigger. Make sure to pass your language as a JSON body:{
"language":"<language>"
}Replace
<language>
with your language, such asen-GB
, orzh-CN
.
💁 You can find this code in the code-spoken-response/functions folder.
Task - retrieve the voice from your Wio Terminal
-
Open the
smart-timer
project in VS Code if it is not already open. -
Open the
config.h
header file and add the URL for your function app:const char *GET_VOICES_FUNCTION_URL = "<URL>";
Replace
<URL>
with the URL for theget-voices
HTTP trigger on your function app. This will be the same as the value forTEXT_TO_TIMER_FUNCTION_URL
, except with a function name ofget-voices
instead oftext-to-timer
. -
Create a new file in the
src
folder calledtext_to_speech.h
. This will be used to define a class to convert from text to speech. -
Add the following include directives to the top of the new
text_to_speech.h
file:#pragma once
#include <Arduino.h>
#include <ArduinoJson.h>
#include <HTTPClient.h>
#include <Seeed_FS.h>
#include <SD/Seeed_SD.h>
#include <WiFiClient.h>
#include <WiFiClientSecure.h>
#include "config.h"
#include "speech_to_text.h" -
Add the following code below this to declare the
TextToSpeech
class, along with an instance that can be used in the rest of the application:class TextToSpeech
{
public:
private:
};
TextToSpeech textToSpeech; -
To call your functions app, you need to declare a WiFi client. Add the following to the
private
section of the class:WiFiClient _client;
-
In the
private
section, add a field for the selected voice:String _voice;
-
To the
public
section, add aninit
function that will get the first voice:void init()
{
} -
To get the voices, a JSON document needs to be sent to the function app with the language. Add the following code to the
init
function to create this JSON document:DynamicJsonDocument doc(1024);
doc["language"] = LANGUAGE;
String body;
serializeJson(doc, body); -
Next create an
HTTPClient
, then use it to call the functions app to get the voices, posting the JSON document:HTTPClient httpClient;
httpClient.begin(_client, GET_VOICES_FUNCTION_URL);
int httpResponseCode = httpClient.POST(body); -
Below this add code to check the response code, and if it is 200 (success), then extract the list of voices, retrieving the first one from the list:
if (httpResponseCode == 200)
{
String result = httpClient.getString();
Serial.println(result);
DynamicJsonDocument doc(1024);
deserializeJson(doc, result.c_str());
JsonArray obj = doc.as<JsonArray>();
_voice = obj[0].as<String>();
Serial.print("Using voice ");
Serial.println(_voice);
}
else
{
Serial.print("Failed to get voices - error ");
Serial.println(httpResponseCode);
} -
After this, end the HTTP client connection:
httpClient.end();