Skip to main content

Speech to text - Raspberry Pi

In this part of the lesson, you will write code to convert speech in the captured audio to text using the speech service.

Send the audio to the speech service

The audio can be sent to the speech service using the REST API. To use the speech service, first you need to request an access token, then use that token to access the REST API. These access tokens expire after 10 minutes, so your code should request them on a regular basis to ensure they are always up to date.

Task - get an access token

  1. Open the smart-timer project on your Pi.

  2. Remove the play_audio function. This is no longer needed as you don't want a smart timer to repeat back to you what you said.

  3. Add the following import to the top of the app.py file:

    import requests
  4. Add the following code above the while True loop to declare some settings for the speech service:

    speech_api_key = '<key>'
    location = '<location>'
    language = '<language>'

    Replace <key> with the API key for your speech service resource. Replace <location> with the location you used when you created the speech service resource.

    Replace <language> with the locale name for language you will be speaking in, for example en-GB for English, or zn-HK for Cantonese. You can find a list of the supported languages and their locale names in the Language and voice support documentation on Microsoft docs.

  5. Below this, add the following function to get an access token:

    def get_access_token():
    headers = {
    'Ocp-Apim-Subscription-Key': speech_api_key
    }

    token_endpoint = f'https://{location}.api.cognitive.microsoft.com/sts/v1.0/issuetoken'
    response = requests.post(token_endpoint, headers=headers)
    return str(response.text)

    This calls a token issuing endpoint, passing the API key as a header. This call returns an access token that can be used to call the speech services.

  6. Below this, declare a function to convert speech in the captured audio to text using the REST API:

    def convert_speech_to_text(buffer):
  7. Inside this function, set up the REST API URL and headers:

    url = f'https://{location}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1'

    headers = {
    'Authorization': 'Bearer ' + get_access_token(),
    'Content-Type': f'audio/wav; codecs=audio/pcm; samplerate={rate}',
    'Accept': 'application/json;text/xml'
    }

    params = {
    'language': language
    }

    This builds a URL using the location of the speech services resource. It then populates the headers with the access token from the get_access_token function, as well as the sample rate used to capture the audio. Finally it defines some parameters to be passed with the URL containing the language in the audio.

  8. Below this, add the following code to call the REST API and get back the text:

    response = requests.post(url, headers=headers, params=params, data=buffer)
    response_json = response.json()

    if response_json['RecognitionStatus'] == 'Success':
    return response_json['DisplayText']
    else:
    return ''

    This calls the URL and decodes the JSON value that comes in the response. The RecognitionStatus value in the response indicates if the call was able to extract speech into text successfully, and if this is Success then the text is returned from the function, otherwise an empty string is returned.

  9. Above the while True: loop, define a function to process the text returned from the speech to text service. This function will just print the text to the console for now.

    def process_text(text):
    print(text)
  10. Finally replace the call to play_audio in the while True loop with a call to the convert_speech_to_text function, passing the text to the process_text function:

    text = convert_speech_to_text(buffer)
    process_text(text)
  11. Run the code. Press the button and speak into the microphone. Release the button when you are done, and the audio will be converted to text and printed to the console.

    pi@raspberrypi:~/smart-timer $ python3 app.py 
    Hello world.
    Welcome to IoT 01 for Beginners.

    Try different types of sentences, along with sentences where words sound the same but have different meanings. For example, if you are speaking in English, say 'I want to buy two bananas and an apple too', and notice how it will use the correct to, two and too based on the context of the word, not just it's sound.

💁 You can find this code in the code-speech-to-text/pi folder.

😀 Your speech to text program was a success!