How to recognize text in images with Vision in Swift

Issue #1041

The Vision framework has provided text recognition since iOS 13 and macOS 10.15. If you have ever needed to extract text from a screenshot, a photo of a receipt, or a scanned document, this is the tool to reach for. Starting with iOS 18 and macOS 15, Apple introduced a redesigned Swift-native API that works directly with structured concurrency, making the implementation considerably cleaner.

The older approach

The original API is built around VNRecognizeTextRequest and VNImageRequestHandler. You create a request with a completion handler, hand it to the handler, and call perform. The result arrives in the closure.

func recognizeText(in image: CGImage) -> String? {
    let request = VNRecognizeTextRequest()
    request.automaticallyDetectsLanguage = true
    request.recognitionLevel = .accurate

    let handler = VNImageRequestHandler(cgImage: image, orientation: .up, options: [:])
    do {
        try handler.perform([request])
        let observations = request.results ?? []
        return observations
            .compactMap { $0 as? VNRecognizedTextObservation }
            .flatMap { $0.topCandidates(1) }
            .map(\.string)
            .joined(separator: "\n")
    } catch {
        return nil
    }
}

One thing worth noting: VNImageRequestHandler.perform(_:) is synchronous. The completion handler on VNRecognizeTextRequest fires during perform, so the results are available on request.results right after the call returns. Bridging this to async/await using withCheckedContinuation is tempting but fragile. If perform throws after the completion has already fired, you end up calling resume twice, which crashes at runtime.

The modern API

iOS 18 and macOS 15 introduced RecognizeTextRequest, a Swift-native struct that conforms to Sendable and integrates directly with async/await. There is no completion handler and no separate handler object.

@available(macOS 15, iOS 18, *)
func recognizeText(in image: CGImage) async -> String? {
    var request = RecognizeTextRequest()
    request.recognitionLevel = .accurate
    request.automaticallyDetectsLanguage = true

    do {
        let observations = try await request.perform(on: image)
        let lines = observations.compactMap { $0.topCandidates(1).first?.string }
        return lines.isEmpty ? nil : lines.joined(separator: "\n")
    } catch {
        return nil
    }
}

perform(on:) is truly asynchronous and throws on failure, so error handling fits naturally into the do/catch pattern. The request is a value type, so you configure it before performing rather than passing callbacks around.

Recognition accuracy and language detection

Both the old and new APIs share the same recognition levels. .accurate runs a more thorough analysis and is appropriate when the image has clean text. .fast trades some precision for speed, which can matter when processing many images in sequence.

Setting automaticallyDetectsLanguage to true lets Vision choose the appropriate model based on what it finds in the image. If you know the language in advance, you can set recognitionLanguages instead, which can improve accuracy for specific scripts. For the new API, supportedRecognitionLanguages is an async property that returns the languages available on the current device.

@available(macOS 15, iOS 18, *)
func availableLanguages() async throws -> [Locale.Language] {
    try await RecognizeTextRequest().supportedRecognitionLanguages
}

Working with multi-line documents

topCandidates(1) returns the single highest-confidence string for each observation. An observation corresponds roughly to one line of text. Joining them with newlines gives a reasonable reconstruction of multi-line content, though the order follows the visual layout Vision detects.

For more structured documents, Apple added RecognizeDocumentsRequest in the same release. It returns paragraphs rather than individual lines, which is useful for reading longer prose or documents with clear sectioning.