Tokenizing text with the Natural Language framework

Tokenizing text with the Natural Language framework

Learn how to split text into manageable parts using the tokenization capabilities of the Natural Language framework.

Tokenization is the process of breaking down a piece of text into smaller units, or tokens, which can be words, sentences, or even characters. In natural language processing (NLP), tokenization is the first step in analyzing textual data to extract meaningful text parts for further analysis.

Tokenizing a sentence into single words or sentences makes it simpler to analyze sentiment, identify language, or categorize its contents. Tokenizing by word also helps in more accurate query matching in search engines.


The following sample code defines a function that tokenizes the input text into sentences and prints them.

import Foundation
// 1.
import NaturalLanguage

// 2.
func getTokens(from text: String) -> [String]? {
    // 3.
    let tokenizer = NLTokenizer(unit: .sentence)
    tokenizer.setLanguage(.english)
    tokenizer.string = text
    
    var tokenizedText = [String]()

    // 4.
    tokenizer.enumerateTokens(in: text.startIndex ..< text.endIndex) { tokenIndexRange, _ in
        tokenizedText.append(String(text[tokenIndexRange]))
        return true
    }

    return tokenizedText.isEmpty ? nil : tokenizedText
}
  1. Import the NaturalLanguage framework.
  2. Create a function that takes a string as a parameter and returns an optional array of strings.
  3. Set up an NLTokenizer to break down the text into sentences using the appropriate NLTokenUnit. However, we can also consider tokenizing by word or document. We also create an empty array named tokenizedText to store the resulting tokens (sentences).

    By using setLanguage(_:) we can dynamically set the language based on input to create a language-dependent tokenization.
  4. It loops through the text by sentence tokens using the tokenizer's enumerateTokens(in:using:) method. Finally, all the tokenized parts are appended to tokenizedText.

In the end, the function returns the array tokenizedText if it's not empty, or nil if no tokens are detected.

let text = "Friends, Romans, countrymen, lend me your ears. I come to bury Caesar, not to praise him. The evil that men do lives after them. The good is oft interred with their bones…"

if let tokenizedText = getTokens(from: text) {
    // Print the tokenized text
    print("TOKENIZED TEXT:")
    tokenizedText.forEach { textToken in
        print("- \(textToken)")
    }
}

The NLTokenizer allows you to tokenize text at various levels of granularity. There are different kinds of NLTokenUnit you can use, they are:

  1. word: This breaks down the text into individual words. It is useful when you want to analyze the text at the word level, such as building a word frequency counter or creating a word cloud.
  2. sentence: This breaks down the text into complete sentences. This is helpful when performing tasks like sentiment analysis, where context matters, and it’s essential to evaluate the text in coherent chunks rather than individual words.
  3. paragraph: This option divides the text into paragraphs. This unit is handy when dealing with larger blocks of text, like articles or reports, and you want to process each paragraph as a unit.
  4. document: Tokenizes the text as a whole, treating the entire text input as a single unit. This can be helpful for high-level analysis, such as topic modeling, where the entire document is analyzed to extract themes or patterns.

Once the text is tokenized, it's possible to use the Natural Language framework alongside Create ML for training and deploying personalized natural language models. Tokenization can serve as a tool for making sure that the text is chopped into manageable bits for training better models.