Detecting documents in an image with the Vision framework

Detecting documents in an image with the Vision framework

Learn how to use the Vision framework to detect documents in images.

When it comes to understanding which part of an image contains a document, we can leverage the machine learning capabilities of the Vision framework. The DetectDecoumentSegmentationRequest type provides us with an easy and simple way to accomplish that.

private func detectDocument() async throws -> DetectedDocumentObservation? {
    // 1. Set up the Request
    let request = DetectDocumentSegmentationRequest()
    
    // 2. The image to perform the detection on
    let image = CIImage(image: UIImage(named: "document-sample")!)!
    
    // 3. Perform the request
    guard let observation = try await request.perform(on: image) else {return nil}
    
    // 4. The result
    return observation
}

To perform the detect document segmentation request works as follows:

  1. Start with the creation of an instance of the request.
  2. Adjust its settings if and as needed, and set the object to perform the request on.
  3. Perform the request using one of the perform methods, like perform(on:orientation:)
  4. Return the resulting observation.

The resulting value is of type DetectedDocumentObservation, an observation object that stores:

  1. The confidence of the observation - a float value stating the observation’s accuracy, from 0 to 1;
  2. The four corners of the region containing the document in the analyzed image as NormalizedPoint values:
    1. topLeft
    2. topRight
    3. bottomRight
    4. bottomLeft
  3. The boundingBox - the bounding box of the object with coordinates normalized to the dimensions of the processed image, with the origin at the lower-left corner of the picture;
  4. The globalSegmentationMask - a PixelBufferObservation representing a segmentation mask for the detected document.

This request is beneficial when performing tasks like highlighting or extracting the region containing the document directly, like in the example below:

import SwiftUI
import Vision
import CoreImage.CIFilterBuiltins

struct ContentView: View {

    @State var image: CGImage?
    
    var body: some View {
        VStack {
            if let image = image {
                Image(uiImage: UIImage(cgImage: image))
                    .resizable()
                    .scaledToFit()
            } else {
                Image("document")
                    .resizable()
                    .scaledToFit()
            }
            
            Button(action: {
                self.highlightDocument()
            }, label: {
                Text("Highlight Document")
            })
        }
        .padding()
    }
    
    private func highlightDocument() {
        Task {
            guard let uiImage = UIImage(named: "document"),
                  let observation = try await detectDocument(image: uiImage) else { return }
                  
            guard let highlightedDocument = try await applyFilter(startImage: uiImage, observation: observation) else { return }
            
            self.image = highlightedDocument
            
            return
        }
        
        return
    }
    
    // Detect the document segmentation
    private func detectDocument(image: UIImage) async throws -> DetectedDocumentObservation? {
        // The image to perform the detection on
        guard let image = CIImage(image: image) else { return nil }
        
        do {
            // Set up the Request
            let request = DetectDocumentSegmentationRequest()
            
            // Perform the request
            guard let observation = try await request.perform(on: image) else { return nil }
            return observation
        } catch {
            print("Encountered an error when performing the request: \(error.localizedDescription)")
        }
        
        return nil
    }
    
    // Apply mauve color on the parts of the image that are not included in the detected document
    private func applyFilter(startImage: UIImage, observation: DetectedDocumentObservation) async throws -> CGImage? {
        // 1. The CIImage of original image and the CGImage from the observation
        guard let image = CIImage(image: startImage) else { return nil }
        let maskCGImage = try observation.globalSegmentationMask.cgImage
        
        // 2. The CIImage from the mask
        var ciMaskImage = CIImage(cgImage: maskCGImage)
        
        // 3. Ensure the mask and original image have the same size
        let originalExtent = image.extent
        
        ciMaskImage = CIImage(cgImage: maskCGImage).transformed(by: CGAffineTransform(
            scaleX: originalExtent.width / CGFloat(maskCGImage.width),
            y: originalExtent.height / CGFloat(maskCGImage.height)
        ))
        
        // 4. Create a mauve background
        let mauveBackground = CIImage(color: CIColor(red: 1.0, green: 0.5, blue: 1.0))
            // 4.a Crop it to match the size of the original image
            .cropped(to: image.extent)
        
        // 5. Composite the original image over the green background using the mask
        let blendFilter = CIFilter.blendWithMask()
        blendFilter.inputImage = image
        blendFilter.backgroundImage = mauveBackground
        blendFilter.maskImage = ciMaskImage
        
        // 6. Render the composite image
        let context = CIContext()
        
        guard let outputImage = blendFilter.outputImage,
              let cgImage = context.createCGImage(outputImage, from: outputImage.extent) else { return nil }
        
        return cgImage
    }
}

In this SwiftUI app, a button triggers the detection and filtering process, allowing users to differentiate the document from its background for enhanced visual readability.

0:00
/0:14

It first extracts the document’s segmentation mask and then applies a mauve background to non-document areas using Core Image filters. The processed image replaces the original in the UI.