Speech to text - Wio Terminal
In this part of the lesson, you will write code to convert speech in the captured audio to text using the speech service.
Send the audio to the speech service
The audio can be sent to the speech service using the REST API. To use the speech service, first you need to request an access token, then use that token to access the REST API. These access tokens expire after 10 minutes, so your code should request them on a regular basis to ensure they are always up to date.
Task - get an access token
-
Open the
smart-timer
project if it's not already open. -
Add the following library dependencies to the
platformio.ini
file to access WiFi and handle JSON:seeed-studio/Seeed Arduino rpcWiFi @ 1.0.5
seeed-studio/Seeed Arduino rpcUnified @ 2.1.3
seeed-studio/Seeed_Arduino_mbedtls @ 3.0.1
seeed-studio/Seeed Arduino RTC @ 2.0.0
bblanchon/ArduinoJson @ 6.17.3 -
Add the following code to the
config.h
header file:const char *SSID = "<SSID>";
const char *PASSWORD = "<PASSWORD>";
const char *SPEECH_API_KEY = "<API_KEY>";
const char *SPEECH_LOCATION = "<LOCATION>";
const char *LANGUAGE = "<LANGUAGE>";
const char *TOKEN_URL = "https://%s.api.cognitive.microsoft.com/sts/v1.0/issuetoken";Replace
<SSID>
and<PASSWORD>
with the relevant values for your WiFi.Replace
<API_KEY>
with the API key for your speech service resource. Replace<LOCATION>
with the location you used when you created the speech service resource.Replace
<LANGUAGE>
with the locale name for language you will be speaking in, for exampleen-GB
for English, orzn-HK
for Cantonese. You can find a list of the supported languages and their locale names in the Language and voice support documentation on Microsoft docs.The
TOKEN_URL
constant is the URL of the token issuer without the location. This will be combined with the location later to get the full URL. -
Just like connecting to Custom Vision, you will need to use an HTTPS connection to connect to the token issuing service. To the end of
config.h
, add the following code:const char *TOKEN_CERTIFICATE =
"-----BEGIN CERTIFICATE-----\r\n"
"MIIF8zCCBNugAwIBAgIQAueRcfuAIek/4tmDg0xQwDANBgkqhkiG9w0BAQwFADBh\r\n"
"MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3\r\n"
"d3cuZGlnaWNlcnQuY29tMSAwHgYDVQQDExdEaWdpQ2VydCBHbG9iYWwgUm9vdCBH\r\n"
"MjAeFw0yMDA3MjkxMjMwMDBaFw0yNDA2MjcyMzU5NTlaMFkxCzAJBgNVBAYTAlVT\r\n"
"MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKjAoBgNVBAMTIU1pY3Jv\r\n"
"c29mdCBBenVyZSBUTFMgSXNzdWluZyBDQSAwNjCCAiIwDQYJKoZIhvcNAQEBBQAD\r\n"
"ggIPADCCAgoCggIBALVGARl56bx3KBUSGuPc4H5uoNFkFH4e7pvTCxRi4j/+z+Xb\r\n"
"wjEz+5CipDOqjx9/jWjskL5dk7PaQkzItidsAAnDCW1leZBOIi68Lff1bjTeZgMY\r\n"
"iwdRd3Y39b/lcGpiuP2d23W95YHkMMT8IlWosYIX0f4kYb62rphyfnAjYb/4Od99\r\n"
"ThnhlAxGtfvSbXcBVIKCYfZgqRvV+5lReUnd1aNjRYVzPOoifgSx2fRyy1+pO1Uz\r\n"
"aMMNnIOE71bVYW0A1hr19w7kOb0KkJXoALTDDj1ukUEDqQuBfBxReL5mXiu1O7WG\r\n"
"0vltg0VZ/SZzctBsdBlx1BkmWYBW261KZgBivrql5ELTKKd8qgtHcLQA5fl6JB0Q\r\n"
"gs5XDaWehN86Gps5JW8ArjGtjcWAIP+X8CQaWfaCnuRm6Bk/03PQWhgdi84qwA0s\r\n"
"sRfFJwHUPTNSnE8EiGVk2frt0u8PG1pwSQsFuNJfcYIHEv1vOzP7uEOuDydsmCjh\r\n"
"lxuoK2n5/2aVR3BMTu+p4+gl8alXoBycyLmj3J/PUgqD8SL5fTCUegGsdia/Sa60\r\n"
"N2oV7vQ17wjMN+LXa2rjj/b4ZlZgXVojDmAjDwIRdDUujQu0RVsJqFLMzSIHpp2C\r\n"
"Zp7mIoLrySay2YYBu7SiNwL95X6He2kS8eefBBHjzwW/9FxGqry57i71c2cDAgMB\r\n"
"AAGjggGtMIIBqTAdBgNVHQ4EFgQU1cFnOsKjnfR3UltZEjgp5lVou6UwHwYDVR0j\r\n"
"BBgwFoAUTiJUIBiV5uNu5g/6+rkS7QYXjzkwDgYDVR0PAQH/BAQDAgGGMB0GA1Ud\r\n"
"JQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjASBgNVHRMBAf8ECDAGAQH/AgEAMHYG\r\n"
"CCsGAQUFBwEBBGowaDAkBggrBgEFBQcwAYYYaHR0cDovL29jc3AuZGlnaWNlcnQu\r\n"
"Y29tMEAGCCsGAQUFBzAChjRodHRwOi8vY2FjZXJ0cy5kaWdpY2VydC5jb20vRGln\r\n"
"aUNlcnRHbG9iYWxSb290RzIuY3J0MHsGA1UdHwR0MHIwN6A1oDOGMWh0dHA6Ly9j\r\n"
"cmwzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5jcmwwN6A1oDOG\r\n"
"MWh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5j\r\n"
"cmwwHQYDVR0gBBYwFDAIBgZngQwBAgEwCAYGZ4EMAQICMBAGCSsGAQQBgjcVAQQD\r\n"
"AgEAMA0GCSqGSIb3DQEBDAUAA4IBAQB2oWc93fB8esci/8esixj++N22meiGDjgF\r\n"
"+rA2LUK5IOQOgcUSTGKSqF9lYfAxPjrqPjDCUPHCURv+26ad5P/BYtXtbmtxJWu+\r\n"
"cS5BhMDPPeG3oPZwXRHBJFAkY4O4AF7RIAAUW6EzDflUoDHKv83zOiPfYGcpHc9s\r\n"
"kxAInCedk7QSgXvMARjjOqdakor21DTmNIUotxo8kHv5hwRlGhBJwps6fEVi1Bt0\r\n"
"trpM/3wYxlr473WSPUFZPgP1j519kLpWOJ8z09wxay+Br29irPcBYv0GMXlHqThy\r\n"
"8y4m/HyTQeI2IMvMrQnwqPpY+rLIXyviI2vLoI+4xKE4Rn38ZZ8m\r\n"
"-----END CERTIFICATE-----\r\n";This is the same certificate you used when connecting to Custom Vision.
-
Add an include for the WiFi header file and the config header file to the top of the
main.cpp
file:#include <rpcWiFi.h>
#include "config.h" -
Add code to connect to WiFi in
main.cpp
above thesetup
function:void connectWiFi()
{
while (WiFi.status() != WL_CONNECTED)
{
Serial.println("Connecting to WiFi..");
WiFi.begin(SSID, PASSWORD);
delay(500);
}
Serial.println("Connected!");
} -
Call this function from the
setup
function after the serial connection has been established:connectWiFi();
-
Create a new header file in the
src
folder calledspeech_to_text.h
. In this header file, add the following code:#pragma once
#include <Arduino.h>
#include <ArduinoJson.h>
#include <HTTPClient.h>
#include <WiFiClientSecure.h>
#include "config.h"
#include "mic.h"
class SpeechToText
{
public:
private:
};
SpeechToText speechToText;This includes some necessary header files for an HTTP connection, configuration and the
mic.h
header file, and defines a class calledSpeechToText
, before declaring an instance of that class that can be used later. -
Add the following 2 fields to the
private
section of this class:WiFiClientSecure _token_client;
String _access_token;The
_token_client
is a WiFi Client that uses HTTPS and will be used to get the access token. This token will then be stored in_access_token
. -
Add the following method to the
private
section:String getAccessToken()
{
char url[128];
sprintf(url, TOKEN_URL, SPEECH_LOCATION);
HTTPClient httpClient;
httpClient.begin(_token_client, url);
httpClient.addHeader("Ocp-Apim-Subscription-Key", SPEECH_API_KEY);
int httpResultCode = httpClient.POST("{}");
if (httpResultCode != 200)
{
Serial.println("Error getting access token, trying again...");
delay(10000);
return getAccessToken();
}
Serial.println("Got access token.");
String result = httpClient.getString();
httpClient.end();
return result;
}This code builds the URL for the token issuer API using the location of the speech resource. It then creates an
HTTPClient
to make the web request, setting it up to use the WiFi client configured with the token endpoints certificate. It sets the API key as a header for the call. It then makes a POST request to get the certificate, retrying if it gets any errors. Finally the access token is returned. -
To the
public
section, add a method to get the access token. This will be needed in later lessons to convert text to speech.String AccessToken()
{
return _access_token;
} -
To the
public
section, add aninit
method that sets up the token client:void init()
{
_token_client.setCACert(TOKEN_CERTIFICATE);
_access_token = getAccessToken();
}This sets the certificate on the WiFi client, then gets the access token.
-
In
main.cpp
, add this new header file to the include directives:#include "speech_to_text.h"
-
Initialize the
SpeechToText
class at the end of thesetup
function, after themic.init
call but beforeReady
is written to the serial monitor:speechToText.init();
Task - read audio from flash memory
-
In an earlier part of this lesson, the audio was recorded to the flash memory. This audio will need to be sent to the Speech Services REST API, so it needs to be read from the flash memory. It can't be loaded into an in-memory buffer as it would be too large. The
HTTPClient
class that makes REST calls can stream data using an Arduino Stream - a class that can load data in small chunks, sending the chunks one at a time as part of the request. Every time you callread
on a stream it returns the next block of data. An Arduino stream can be created that can read from the flash memory. Create a new file calledflash_stream.h
in thesrc
folder, and add the following code to it:#pragma once
#include <Arduino.h>
#include <HTTPClient.h>
#include <sfud.h>
#include "config.h"
class FlashStream : public Stream
{
public:
virtual size_t write(uint8_t val)
{
}
virtual int available()
{
}
virtual int read()
{
}
virtual int peek()
{
}
private:
};This declares the
FlashStream
class, deriving from the ArduinoStream
class. This is an abstract class - derived classes have to implement a few methods before the class can be instantiated, and these methods are defined in this class.✅ Read more on Arduino Streams in the Arduino Stream documentation
-
Add the following fields to the
private
section:size_t _pos;
size_t _flash_address;
const sfud_flash *_flash;
byte _buffer[HTTP_TCP_BUFFER_SIZE];This defines a temporary buffer to store data read from the flash memory, along with fields to store the current position when reading from the buffer, the current address to read from the flash memory, and the flash memory device.
-
In the
private
section, add the following method:void populateBuffer()
{
sfud_read(_flash, _flash_address, HTTP_TCP_BUFFER_SIZE, _buffer);
_flash_address += HTTP_TCP_BUFFER_SIZE;
_pos = 0;
}This code reads from the flash memory at the current address and stores the data in a buffer. It then increments the address, so the next call reads the next block of memory. The buffer is sized based on the largest chunk that the
HTTPClient
will send to the REST API at one time.💁 Erasing flash memory has to be done using the grain size, reading on the other hand does not.
-
In the
public
section of this class, add a constructor:FlashStream()
{
_pos = 0;
_flash_address = 0;
_flash = sfud_get_device_table() + 0;
populateBuffer();
}This constructor sets up all the fields to start reading from the start of the flash memory block, and loads the first chunk of data into the buffer.
-
Implement the
write
method. This stream will only read data, so this can do nothing and return 0:virtual size_t write(uint8_t val)
{
return 0;
} -
Implement the
peek
method. This returns the data at the current position without moving the stream along. Callingpeek
multiple times will always return the same data as long as no data is read from the stream.virtual int peek()
{
return _buffer[_pos];
} -
Implement the
available
function. This returns how many bytes can be read from the stream, or -1 if the stream is complete. For this class, the maximum available will be no more than the HTTPClient's chunk size. When this stream is used in the HTTP client it calls this function to see how much data is available, then requests that much data to send to the REST API. We don't want each chunk to be more than the HTTP clients chunk size, so if more than that is available, the chunk size is returned. If less, then what is available is returned. Once all the data has been streamed, -1 is returned.virtual int available()
{
int remaining = BUFFER_SIZE - ((_flash_address - HTTP_TCP_BUFFER_SIZE) + _pos);
int bytes_available = min(HTTP_TCP_BUFFER_SIZE, remaining);
if (bytes_available == 0)
{
bytes_available = -1;
}
return bytes_available;
} -
Implement the
read
method to return the next byte from the buffer, incrementing the position. If the position exceeds the size of the buffer, it populates the buffer with the next block from the flash memory and resets the position.virtual int read()
{
int retVal = _buffer[_pos++];
if (_pos == HTTP_TCP_BUFFER_SIZE)
{
populateBuffer();
}
return retVal;
} -
In the
speech_to_text.h
header file, add an include directive for this new header file:#include "flash_stream.h"
Task - convert the speech to text
-
The speech can be converted to text by sending the audio to the Speech Service via a REST API. This REST API has a different certificate to the token issuer, so add the following code to the
config.h
header file to define this certificate:const char *SPEECH_CERTIFICATE =
"-----BEGIN CERTIFICATE-----\r\n"
"MIIF8zCCBNugAwIBAgIQCq+mxcpjxFFB6jvh98dTFzANBgkqhkiG9w0BAQwFADBh\r\n"
"MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3\r\n"
"d3cuZGlnaWNlcnQuY29tMSAwHgYDVQQDExdEaWdpQ2VydCBHbG9iYWwgUm9vdCBH\r\n"
"MjAeFw0yMDA3MjkxMjMwMDBaFw0yNDA2MjcyMzU5NTlaMFkxCzAJBgNVBAYTAlVT\r\n"
"MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKjAoBgNVBAMTIU1pY3Jv\r\n"
"c29mdCBBenVyZSBUTFMgSXNzdWluZyBDQSAwMTCCAiIwDQYJKoZIhvcNAQEBBQAD\r\n"
"ggIPADCCAgoCggIBAMedcDrkXufP7pxVm1FHLDNA9IjwHaMoaY8arqqZ4Gff4xyr\r\n"
"RygnavXL7g12MPAx8Q6Dd9hfBzrfWxkF0Br2wIvlvkzW01naNVSkHp+OS3hL3W6n\r\n"
"l/jYvZnVeJXjtsKYcXIf/6WtspcF5awlQ9LZJcjwaH7KoZuK+THpXCMtzD8XNVdm\r\n"
"GW/JI0C/7U/E7evXn9XDio8SYkGSM63aLO5BtLCv092+1d4GGBSQYolRq+7Pd1kR\r\n"
"EkWBPm0ywZ2Vb8GIS5DLrjelEkBnKCyy3B0yQud9dpVsiUeE7F5sY8Me96WVxQcb\r\n"
"OyYdEY/j/9UpDlOG+vA+YgOvBhkKEjiqygVpP8EZoMMijephzg43b5Qi9r5UrvYo\r\n"
"o19oR/8pf4HJNDPF0/FJwFVMW8PmCBLGstin3NE1+NeWTkGt0TzpHjgKyfaDP2tO\r\n"
"4bCk1G7pP2kDFT7SYfc8xbgCkFQ2UCEXsaH/f5YmpLn4YPiNFCeeIida7xnfTvc4\r\n"
"7IxyVccHHq1FzGygOqemrxEETKh8hvDR6eBdrBwmCHVgZrnAqnn93JtGyPLi6+cj\r\n"
"WGVGtMZHwzVvX1HvSFG771sskcEjJxiQNQDQRWHEh3NxvNb7kFlAXnVdRkkvhjpR\r\n"
"GchFhTAzqmwltdWhWDEyCMKC2x/mSZvZtlZGY+g37Y72qHzidwtyW7rBetZJAgMB\r\n"
"AAGjggGtMIIBqTAdBgNVHQ4EFgQUDyBd16FXlduSzyvQx8J3BM5ygHYwHwYDVR0j\r\n"
"BBgwFoAUTiJUIBiV5uNu5g/6+rkS7QYXjzkwDgYDVR0PAQH/BAQDAgGGMB0GA1Ud\r\n"
"JQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjASBgNVHRMBAf8ECDAGAQH/AgEAMHYG\r\n"
"CCsGAQUFBwEBBGowaDAkBggrBgEFBQcwAYYYaHR0cDovL29jc3AuZGlnaWNlcnQu\r\n"
"Y29tMEAGCCsGAQUFBzAChjRodHRwOi8vY2FjZXJ0cy5kaWdpY2VydC5jb20vRGln\r\n"
"aUNlcnRHbG9iYWxSb290RzIuY3J0MHsGA1UdHwR0MHIwN6A1oDOGMWh0dHA6Ly9j\r\n"
"cmwzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5jcmwwN6A1oDOG\r\n"
"MWh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5j\r\n"
"cmwwHQYDVR0gBBYwFDAIBgZngQwBAgEwCAYGZ4EMAQICMBAGCSsGAQQBgjcVAQQD\r\n"
"AgEAMA0GCSqGSIb3DQEBDAUAA4IBAQAlFvNh7QgXVLAZSsNR2XRmIn9iS8OHFCBA\r\n"
"WxKJoi8YYQafpMTkMqeuzoL3HWb1pYEipsDkhiMnrpfeYZEA7Lz7yqEEtfgHcEBs\r\n"
"K9KcStQGGZRfmWU07hPXHnFz+5gTXqzCE2PBMlRgVUYJiA25mJPXfB00gDvGhtYa\r\n"
"+mENwM9Bq1B9YYLyLjRtUz8cyGsdyTIG/bBM/Q9jcV8JGqMU/UjAdh1pFyTnnHEl\r\n"
"Y59Npi7F87ZqYYJEHJM2LGD+le8VsHjgeWX2CJQko7klXvcizuZvUEDTjHaQcs2J\r\n"
"+kPgfyMIOY1DMJ21NxOJ2xPRC/wAh/hzSBRVtoAnyuxtkZ4VjIOh\r\n"
"-----END CERTIFICATE-----\r\n"; -
Add a constant to this file for the speech URL without the location. This will be combined with the location and language later to get the full URL.
const char *SPEECH_URL = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=%s";
-
In the
speech_to_text.h
header file, in theprivate
section of theSpeechToText
class, define a field for a WiFi Client using the speech certificate:WiFiClientSecure _speech_client;
-
In the
init
method, set the certificate on this WiFi Client:_speech_client.setCACert(SPEECH_CERTIFICATE);
-
Add the following code to the
public
section of theSpeechToText
class to define a method to convert speech to text:String convertSpeechToText()
{
} -
Add the following code to this method to create an HTTP client using the WiFi client configured with the speech certificate, and using the speech URL set with the location and language:
char url[128];
sprintf(url, SPEECH_URL, SPEECH_LOCATION, LANGUAGE);
HTTPClient httpClient;
httpClient.begin(_speech_client, url); -
Some headers need to be set on the connection:
httpClient.addHeader("Authorization", String("Bearer ") + _access_token);
httpClient.addHeader("Content-Type", String("audio/wav; codecs=audio/pcm; samplerate=") + String(RATE));
httpClient.addHeader("Accept", "application/json;text/xml");This sets headers for the authorization using the access token, the audio format using the sample rate, and sets that the client expects the result as JSON.
-
After this, add the following code to make the REST API call:
Serial.println("Sending speech...");
FlashStream stream;
int httpResponseCode = httpClient.sendRequest("POST", &stream, BUFFER_SIZE);
Serial.println("Speech sent!");This creates a
FlashStream
and uses it to stream data to the REST API. -
Below this, add the following code:
String text = "";
if (httpResponseCode == 200)
{
String result = httpClient.getString();
Serial.println(result);
DynamicJsonDocument doc(1024);
deserializeJson(doc, result.c_str());
JsonObject obj = doc.as<JsonObject>();
text = obj["DisplayText"].as<String>();
}
else if (httpResponseCode == 401)
{
Serial.println("Access token expired, trying again with a new token");
_access_token = getAccessToken();
return convertSpeechToText();
}
else
{
Serial.print("Failed to convert text to speech - error ");
Serial.println(httpResponseCode);
}This code checks the response code.
If it is 200, the code for success, then the result is retrieved, decoded from JSON, and the
DisplayText
property is set into thetext
variable. This is the property that the text version of the speech is returned in.If the response code is 401, then the access token has expired (these tokens only last 10 minutes). A new access token is requested, and the call is made again.
Otherwise, an error is sent to the serial monitor, and the
text
is left blank. -
Add the following code to the end of this method to close the HTTP client and return the text:
httpClient.end();
return text; -
In
main.cpp
call this newconvertSpeechToText
method in theprocessAudio
function, then log out the speech to the serial monitor:String text = speechToText.convertSpeechToText();
Serial.println(text); -
Build this code, upload it to your Wio Terminal and test it out through the serial monitor. Once you see
Ready
in the serial monitor, press the C button (the one on the left-hand side, closest to the power switch), and speak. 4 seconds of audio will be captured, then converted to text.--- Available filters and text transformations: colorize, debug, default, direct, hexlify, log2file, nocontrol, printable, send_on_enter, time
--- More details at http://bit.ly/pio-monitor-filters
--- Miniterm on /dev/cu.usbmodem1101 9600,8,N,1 ---
--- Quit: Ctrl+C | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H ---
Connecting to WiFi..
Connected!
Got access token.
Ready.
Starting recording...
Finished recording
Sending speech...
Speech sent!
{"RecognitionStatus":"Success","DisplayText":"Set a 2 minute and 27 second timer.","Offset":4700000,"Duration":35300000}
Set a 2 minute and 27 second timer.
💁 You can find this code in the code-speech-to-text/wio-terminal folder.
😀 Your speech to text program was a success!