Transcribing audio from live audio using the Speech framework

Transcribing audio from live audio using the Speech framework

Learn how to create a SwiftUI application that transcribes audio to text using the Speech framework.

In "Transcribing audio from a file using the Speech framework", we saw how to transcribe speech from an audio file using the Speech supporting framework included in CoreML. This tutorial will focus on implementing a live transcriber feature using the microphone to recognize speech in real-time.

By the end of this tutorial, you will understand how to access an audio buffer using the microphone and then make it available for the Speech recognition framework to process and convert into text.

Step 1 - Define the logic

The first step is creating a new class responsible for accessing the microphone and starting the recognition process.

Start by creating a new file named SpeechRecognizer and importing the AVFoundation and Speech framework. Then we will define all the needed properties.


import Foundation
import AVFoundation
import Speech

@Observable
class SpeechRecognizer {
    
    // 1.
    var recognizedText: String = "No speech recognized"
    var startedListening: Bool = false
    // 2.
    var audioEngine: AVAudioEngine!
    // 3.
    var speechRecognizer: SFSpeechRecognizer!
    // 4.
    var recognitionRequest: SFSpeechAudioBufferRecognitionRequest!
    // 5.
    var recognitionTask: SFSpeechRecognitionTask!
    
}
  1. The recognizedText property contains the recognized text from the speech input. Initially, it’s set to "No speech recognized" and then will be updated with the actual recognized speech. The startedListening property is used to check when the transcription is active.
  2. the audioEngine property is used to handle the audio input from the microphone
  3. the speechRecognizer property manages the recognition process
  4. the recognitionRequest is a type of request that provides the audio input from the audioEngine to the speechRecognizer.
  5. The recognitionTask property manages the status of the transcription

Step 2 - Enable microphone usage

Now that we have all the necessary property we need to prompt the user to get microphone access. We will do that by defining a new method, setupSpeechRecognition().

@Observable
class SpeechRecognizer {
    
    // Properties declared in the step 1
       ...
    
    init() {
        setupSpeechRecognition()
    }
    
    func setupSpeechRecognition() {
        // 1.
        audioEngine = AVAudioEngine()
        speechRecognizer = SFSpeechRecognizer()
        
        // 2.
        SFSpeechRecognizer.requestAuthorization { authStatus in
            DispatchQueue.main.async {
                switch authStatus {
                case .authorized:
                    print("Speech recognition authorized")
                case .denied, .restricted, .notDetermined:
                    print("Speech recognition not authorized")
                @unknown default:
                    fatalError("Unknown authorization status")
                }
            }
        }
    }
    
}
  1. Initialize the audioEngine and the speechRecognizer properties
  2. Use the requestAuthorization method to request user permission to access speech recognition services. The permission result will be returned in the authStatus parameter.

Additionally, you need to fill in the following fields in the Info.plist:

  1. Go into your project settings and navigate to the Info tab, as part of your project’s target;
  2. Add a new Key in the Custom iOS Target PropertiesPrivacy - Microphone Usage Description;
  3. Add a string value describing why the app needs access to the microphone.
  4. Add a new Key in the Custom iOS Target PropertiesPrivacy - Speech Recognition Usage Description;
  5. Add a string value describing why the app needs access to the speech recognition feature.

Step 3 - Accessing the microphone

After requesting all the necessary permissions we can define the method responsible for transcribing the audio received in input into text.

import Foundation
import AVFoundation
import Speech

@Observable
class SpeechRecognizer {
    
	...
    
    func setupSpeechRecognition() { ... }
    
    func startListening() {
        // 1.
        recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
        recognitionRequest.shouldReportPartialResults = true
        startedListening = true
        
        // 2.
        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.removeTap(onBus: 0)
        
        // 3.
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { buffer, when in
            self.recognitionRequest.append(buffer)
        }
        
        // 4.
        audioEngine.prepare()
        try! audioEngine.start()
        
        // 5.
        speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
            if let result = result {
                Task {
                    self.recognizedText = result.bestTranscription.formattedString
                }
            
            }
            
            if error != nil || result?.isFinal == true {
                self.audioEngine.stop()
                inputNode.removeTap(onBus: 0)
                    
                self.recognitionRequest = nil
                self.recognitionTask = nil
            }
        }
    }
}
  1. Initialize the recognitionRequest variable and set the shouldReportPartialResults property to true. In this way the recognizer will start transcribing the audio as soon it receives an input and will not wait until the entire audio is processed.
  2. The inputNode variable contains the audio input received from the microphone, while the recordingFormat property, we retrieve the audio format for a specific bus number.

    To ensure that bus 0 is empty we use the .removeTap(onBus: 0) method.
  3. We can now define a new tap (a point where audio data is observed as it passes through the audio node) where the audio will be processed.

    A copy of the audio is accessible through the buffer property that will added to the recognitionRequest object using the append() method.
  4. Once we have the audio buffer ready to get the input we can use the prepare() method to properly set the audioEngine and then the start() method to start the audio recognition.
  5. We are now ready to start a new recognitionTask passing the recognitionRequest. The closure provides the recognition results and any errors encountered during the process. If a result is available, it updates the recognizedText with the best transcription.

Step 4 - Managing microphone access

Now that we are able to process the audio from the microphone into text we can define an additional method named stopListening() that we can use to stop the recognition process.

import Foundation
import AVFoundation
import Speech

@Observable
class SpeechRecognizer {
    
    // Properties declared in the step 1
		    ...
    
    func setupSpeechRecognition() {
        ...
    }
    
    func startListening() {
        ...
    }
    
    func stopListening() {
    	// 1.
        audioEngine.stop()
        audioEngine.inputNode.removeTap(onBus: 0)
        // 2.
        recognitionRequest.endAudio()
        recognitionRequest = nil
        recognitionTask = nil
        startedListening = false
    }
}
  1. Using the stop() method and removing the tap defined before we clean the audio buffer.
  2. Using the endAudio() method we stop any kind of request, and then restore the recognitionRequest and the recognitionTask to nil.

Step 5 - Showing the transcribed text

Now that we have our logic defined we can have fun creating a new SwiftUI View where the user can start the recognition process and visualize the processed text.

struct ContentView: View {
    @State private var speechRecognizer = SpeechRecognizer()
    
    var body: some View {
        
        VStack(spacing: 50) {
            Text(speechRecognizer.recognizedText)
                .padding()
            
            Button {
                if speechRecognizer.audioEngine.isRunning {
                    speechRecognizer.stopListening()
                } else {
                    speechRecognizer.startListening()
                }
            } label: {
                Image(systemName: speechRecognizer.startedListening ? "ear.badge.waveform" : "ear")
                    .font(.system(size: 100))
                    .foregroundColor(.white)
                
                    .symbolEffect(.bounce, value: speechRecognizer.startedListening)
                    .symbolEffect(.variableColor, isActive: speechRecognizer.startedListening)
                
                    .background {
                        Circle().frame(width: 200, height: 200)
                    }
                    .padding()
            }
            
        }
        .onAppear {
            speechRecognizer.setupSpeechRecognition()
        }
    }
}
  1. Create a new instance of the class that we defined in the previous steps.
  2. As soon as the view appears we call the setupSpeechRecognition() to prompt the microphone access.
  3. Create a button to trigger speech recognition and a Text view to visualize the processed text.
In the example above to convey that the action has started we use a simple animation with SF Symbols. There are many of them and you can have an overview in our article: Animating SF Symbols with the symbol effect modifier

Final Result

If you followed the previous steps you can try to run the app on your phone:

Implementing speech recognition in a SwiftUI application can significantly enhance user interaction by providing a natural and intuitive way to input data and control the app.