Identifying individual sounds in an audio file

Identifying individual sounds in an audio file

Learn how to add sound recognition capabilities to a SwiftUI app with the Sound Recognition framework.

The ability to identify specific sounds within an audio file is essential for specific applications, such as speech recognition and extraction, simplifying the editing and analysis of a target sound. With Apple's Sound Analysis framework we can identify various classes of sound in audio data either with the help of a built-in machine learning model or with our custom ones.

While fundamental for processing live audio streams, the framework can also be used to examine pre-recorded audio files. This implementation provides wider applications for audio analysis tasks, such as batch processing, manipulation of high-quality recordings, and offline analysis.

This short tutorial will discuss creating a simple sound classification app using SwiftUI and Apple’s Sound Analysis framework. The app will analyze an audio file, classify sounds, and display them in a list.

By the end of the tutorial, you will have an app like the following:

0:00
/0:02

This tutorial has three small steps:

  1. Definition of the ResultsObserver class, responsible for handling the sound classification results;
  2. Set-up of the startDetection() function to classify sounds with the Sound Analysis framework;
  3. Creation of the SwiftUI user interface to run the startDetection() method and show the results.

Step 1: Define the ResultsObserver Class

The ResultsObserver is an observer class that listens for classification events from the Sound Analysis framework, processes them and notifies the app when changes occur in the analysis progression. It conforms to the SNResultsObserving protocol and communicates the analysis results when a sound is detected and classified.

Create a new Swift file named ResultsObserver, import the SoundAnalysis framework, and define the ResultsObserver type:

import SoundAnalysis

@Observable
final class ResultsObserver: NSObject, SNResultsObserving {

    // 1.
    struct IdentifiedSound: Equatable {
        let identifier: String
        let confidence: String
        let time: String
    }

    // 2.
    var identifiedSounds: [IdentifiedSound] = []

    // 3.
    func request(_ request: SNRequest, didProduce result: SNResult) {
        guard let classificationResult = result as? SNClassificationResult,
              let classification = classificationResult.classifications.first else { return }

        let time = String(format: "%.2f", classificationResult.timeRange.start.seconds)
        let confidence = String(format: "%.2f%%", classification.confidence * 100)
        let identifier = classification.identifier.replacingOccurrences(of: "_", with: " ").capitalized

        let identifiedSound = IdentifiedSound(identifier: identifier, confidence: confidence, time: time)
        self.identifiedSounds.append(identifiedSound)
    }

    // 4.
    func request(_ request: SNRequest, didFailWithError error: Error) {
        print("Detection failed: \\(error.localizedDescription)")
    }

    // 5.
    func requestDidComplete(_ request: SNRequest) {
        print("Detection completed.")
    }
}
  1. IdentifiedSound: It supports the representation of identified sounds with three properties - its identifier (name), confidence (certainty), and time (timestamp in seconds that immortalizes the time range between the single detections and the start of the analysis).
  2. identifiedSounds: Responsible for keeping track of identified sounds to append as they get detected.
  3. request(_ request: _, didProduce result: _): It plays a pivotal role in receiving the classified sounds and formatting the results for display in the UI. It handles sound classification results by extracting the top classification, formatting it as an IdentifiedSound object, and appending it to the identifiedSounds array.
  4. request(_ request: _, didFailWithError error:): It logs error messages if the sound analysis request fails.
  5. requestDidComplete(_:): Method that prints a completion message when the sound analysis request finishes successfully.

Step 2: Implement the startDetection function

In the ContentView of your Xcode project, import the SoundAnalysis framework and define the function that triggers the sound analysis.

import SwiftUI
import SoundAnalysis

struct ContentView: View {
    
    // 1.
    @State private var detectionStarted = false
    private var resultsObserver = ResultsObserver()

    
    var body: some View {
        // TODO: Implemented later
    }

    // 2.
    private func startDetection() async {
        // 2.1
        guard let audioFileURL = Bundle.main.url(forResource: "farm", withExtension: "wav") else {
            print("Audio file not found.")
            return
        }
        do {
            // 2.2
            let audioAnalyzer = try SNAudioFileAnalyzer(url: audioFileURL)
            // 2.3
            let request = try SNClassifySoundRequest(classifierIdentifier: .version1)
            request.overlapFactor = 0.5
            request.windowDuration = CMTimeMakeWithSeconds(0.5, preferredTimescale: 48_000)
            // 2.4
            try audioAnalyzer.add(request, withObserver: resultsObserver)
            detectionStarted = true
            // 2.5
            await audioAnalyzer.analyze()
        } catch {
            print("Error starting detection: \\(error.localizedDescription)")
        }
    }
}
  1. Declare a boolean variable named detectionStarted to track if detection has begun, then declare and initialize an ResultsObserver object to handle the classification results.
  2. Declare the startDetection() function and mark it as asynchronous. We don’t want to freeze the interface as the detection takes place. In detail, this is what we have to perform inside the function:
    • We load the URL for the audio file we bundle within our app. As you may guess, we bundled a farm.wav audio file within the app and referred to it in our code. You can simply adapt this part to refer to your audio file or make the users select their files while using the app instead.
    • With the URL of the audio file we want to identify the sounds in, we create a SNAudioFileAnalyzer instance to manage the audio file analysis.
    • We create an analysis request SNClassifySoundRequest with a sound classifier that is built-in the Sound Analysis framework (version1) and configure its parameters (overlapFactor and windowDuration) if we need.

      The overlapFactor determines how much successive analysis windows overlap during sound classification. In this case, it is set to 0.5, meaning there is a 50% overlap so that the sounds are captured near the center of at least one window. Lower values risk missing sounds across windows, while higher values improve accuracy but increase computation cost.

      The windowDuration defines each analysis window's duration. At 0.5 seconds, it balances responsiveness and computation. Shorter windows are faster, while longer ones improve accuracy for extended sounds. There is also the windowDurationConstraint get-only property to get the specified range for the window duration to check if something is affecting the analysis configuration
    • After configuring the sound analysis request, we add it to the analyzer with the ResultsObserver instance that will specifically take care of its results. The analyzer also allows the addition of requests using different observers for the classification results and manages the analysis lifecycle.
    • Then, we call the analyze method with SNAudioFileAnalyzer to process the entire audio file. There are three variations of this method: one is synchronous, another one is marked as async and can be asynchronously called with await (which is the one way we used), and the last one is asynchronous providing a completion handler closure.

For this short tutorial, we used SNClassifySoundRequest(classifierIdentifier:_) to initialize a request that uses version1, the framework’s built-in sound classification model.

If we want to use our custom Core ML model, we have to simply use the SNClassifySoundRequest(mlModel:_) initializer instead. In this case, after dragging and dropping the file of our custom machine learning model in our project’s directories in Xcode, we refer to it by typing SNClassifySoundRequest(mlModel: MyCustomMLModel().model).

Be aware that we are declaring the function as asynchronous since we chose the asynchronous analyze() method and we do not want to block the interface when it will be called. Files are not like live audio buffers, as they require time to be analyzed as a whole block.

Step 3: Create the user interface with SwiftUI

In the body of the ContentView provide the user with the option to start the analysis. Once the analysis is done, the app displays the results of the sound detection, showing each detected sound, its confidence score, and when it occurred in the audio file.

import SwiftUI
import SoundAnalysis

struct ContentView: View {
    
    @State private var detectionStarted = false
    private var resultsObserver = ResultsObserver()

    var body: some View {
        NavigationStack {
            VStack {
                if detectionStarted && resultsObserver.identifiedSounds.isEmpty {
                    // 1.
                    ProgressView("Classifying sound...")
                } else if !resultsObserver.identifiedSounds.isEmpty {
                    // 2.
                    List(resultsObserver.identifiedSounds, id: \\.time) { sound in
                        VStack(alignment: .leading) {
                            Text(sound.identifier).font(.headline)
                            Text("\\(sound.confidence) confidence at \\(sound.time)s").font(.subheadline)
                        }
                    }
                } else {
                    // 3.
                    Spacer()
                    ContentUnavailableView(
                        "No Sound Classified",
                        systemImage: "waveform.badge.magnifyingglass",
                        description: Text("Tap the button to start identifying sounds")
                    )
                    Spacer()
                    Button {
                        Task { await startDetection() }
                    } label: {
                        Text("Start")
                            .padding()
                            .background(.blue)
                            .foregroundColor(.white)
                            .clipShape(RoundedRectangle(cornerRadius: 10))
                    }
                }
            }
            .navigationTitle("Sound Classifier")
        }
    }

    private func startDetection() async { ... }
}
  1. When detection is started but no sounds are identified yet, a ProgressView with the message "Classifying sound..." is displayed to indicate the ongoing analysis.
  2. When sounds are identified, a List dynamically displays the results from resultsObserver.identifiedSounds, showing each sound's identifier as the headline and its confidence with the timestamp as a subheadline.
  3. When detection hasn't started, or there are no results:
    • ContentUnavailableView is shown with a placeholder message and icon.
    • Start button is displayed below it, which, when tapped, triggers the startDetection() function asynchronously to begin the sound analysis.

We’re done! You can now use the full potential of the Sound Analysis framework to identify individual sounds in audio files, opening up new possibilities for your audio processing projects.


Final Result

0:00
/0:02

In this short tutorial, we’ve demonstrated how to use the Sound Analysis framework to identify sounds in a pre-recorded audio file.

By setting up the SNAudioFileAnalyzer, configuring a ResultsObserver, and displaying the results in SwiftUI, we can build an application that automatically detects and classifies sounds. This functionality is perfect for creating interactive applications that respond to specific audio cues.

The capability for non-live analysis opens up new possibilities. When clarity and precision are paramount, projects can benefit from handling pre-recorded high-quality audio files rather than relying on a microphone’s live input. This allows for a deeper dive into the data, enabling more thorough and detailed analyses through extensive and time-consuming calculations without the constraints of live processing demands. Incorporating these advanced techniques can further elevate your audio analysis projects, gaining deeper insights and achieving higher accuracy.

Similar to all the features you can develop using Apple’s native frameworks, you might consider enhancing this audio file sound analysis with additional advanced features or variations. For example, you could analyze multiple audio files sequentially, which is useful for processing a library of audio files. For lengthy audio files, you may want to explore segmented analysis to handle audio files in smaller chunks, improving performance and managing memory more efficiently.

These enhancements can make your sound detection system even more robust and versatile.

Finally, don't overlook the benefits of Post-Processing: pre-analysis techniques like filtering or normalization can significantly improve the accuracy of your results. By cleaning up your audio files beforehand, you ensure that the subsequent analysis is as precise and reliable as possible.

If you want to dive deeper into Sound Analysis and Machine Learning, consider the following material:

In the end, if you’re curious on how it works with live audio buffers, here is another tutorial you can take a look at:

Identify individual sounds in a live audio buffer
Learn how to create a SwiftUI app that uses the Sound Analysis framework to identify sounds with an audio buffer.