Enumerating Strings By Sentence

In order to properly test Table View Cells With Varying Row Heights I needed a good source of input text. My intention was to have some suitable text that I could feed into the table view with each sentence filling one table view cell. I decided to use a version of Huckleberry Finn from Project Gutenberg but that left me with the problem of parsing the input text into sentences.

Stanford Natural Language Processing Library

The approach I started with was the Stanford Core Natural Language Processing (CoreNLP) tools. This is a set Java packages that can parse raw language text tokenizing the word structure. Rather than using the java tools directly I installed the stanford-core-nlp Ruby gem (I have Ruby 1.9.3 and Java 1.6 installed). A short script was then all that was required to read in the raw text files, annotate them with the core NLP tools and then output the resulting sentences one line at a time. The quick and dirty Ruby script I used to process each of the chapter files is included below:

require "stanford-core-nlp"
StanfordCoreNLP.log_file = 'log.txt'

ARGV.each do |filename|
  begin
  File.open(filename,"rb") do |file|
    text = file.read
    pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit)
    text = StanfordCoreNLP::Annotation.new(text)
    pipeline.annotate(text)
    text.get(:sentences).each do |sentence|
      puts sentence
    end
  end
  rescue => err
    puts err
  end
end

Some extra manipulation in BBEdit with some regular expression search and replace commands surrounded each sentence in \<string\>\</string\> tags that I could then paste into a plist file in the Xcode project. You can judge the results for yourself from the file I included in the Huckleberry project I previously posted to my GitHub repository. In spite of how well it worked it seems a bit of a cheat to rely on Java and Ruby to produce data for an iOS project.

Text Analysis with Cocoa

It turns out that both Mac OS X and iOS have had some sophisticated text analysis capability for some time. In the case of iOS the method that we need was added way back in iOS 4.0. In hindsight I would have to conclude that using Cocoa to solve this problem would have been easier than using the CoreNLP library.

A quick snippet of code should be sufficient to explain how it works. We start by reading a file added to the project bundle containing the first chapter of the raw text (UTF8 encoded) downloaded from Project Gutenberg:

NSError *error = nil;
NSURL *fileURL = [[NSBundle mainBundle] URLForResource:@"001"
                                        withExtension:@"txt"];
NSString *textInput = [NSString
                       stringWithContentsOfURL:fileURL
                       encoding:NSUTF8StringEncoding
                       error:&error];

Next allocate a mutable array to hold the sentences we will parse from the input text:

NSMutableArray *sentences = [[NSMutableArray alloc] init];

Also we will need the set of white space and newline characters to clean up the results:

NSCharacterSet *whiteSpaceSet = [NSCharacterSet 
                whitespaceAndNewlineCharacterSet];

Now comes the cool part, using enumerateSubstringsInRange:options:usingBlock it is trivial to iterate over the input text a substring at a time where the options parameter specifies the type of substring we are interested in. Specifying NSStringEnumerationBySentences is what we need in this case to get each sentence. Other options include NSStringEnumerationByLines, NSStringEnumerationByParagraphs and NSStringEnumerationByWords.

[textInput enumerateSubstringsInRange:
           NSMakeRange(0, [textInput length])
           options:NSStringEnumerationBySentences
           usingBlock:^(NSString *substring, 
                        NSRange substringRange,
                        NSRange enclosingRange,
                        BOOL *stop) {
  NSString *sentence = [substring 
      stringByTrimmingCharactersInSet:whiteSpaceSet];
  if ([sentence length] > 0) {
    [sentences addObject:sentence];
  }
}];

Each time the block is called the substring parameter to the block contains the next sentence. After stripping any leading or trailing whitespace and checking for blank lines we add the sentence to the array of sentences. Finally once the whole input text has been parsed we write the array containing the processed sentences to a plist file in the application document directory:

NSFileManager *fileManager = [NSFileManager defaultManager];
NSURL *documentDirectory = [fileManager
       URLForDirectory:NSDocumentDirectory
       inDomain:NSUserDomainMask
       appropriateForURL:NULL
       create:YES
       error:&error];
NSURL *plistURL = [documentDirectory
                   URLByAppendingPathComponent:@"001.plist"];
[sentences writeToURL:plistURL atomically:YES];

Comparing Results

The text to Huckleberry Finn provides a tough test to any natural language text parser. Consider the second sentence of the first chapter:

That book was made by Mr. Mark Twain, and he told the truth, mainly.

The Stanford CoreNLP library has no problem with this but the Cocoa library stumbles over the full stop in “Mr.” splitting the sentence into two:

That book was made by Mr.
Mark Twain, and he told the truth, mainly.

An even tougher challenge is a sentence such as the following:

Directly I could just barely hear a “me-yow! me-yow!” down there.

Neither approach does a good job on this sentence, both parse this into three sentences as follows:

Directly I could just barely hear a "me-yow!
me-yow!"
down there.

You don’t see this in the posted Xcode project data as I cheated and manually cleaned up some of the first chapter. For my purposes I did not really care as I just wanted some sentences of variable length for testing purposes.

For more details on analysing text on both Mac OS X and iOS I recommend watching Session 215 Text and Linguistic Analysis from the WWDC 2012 videos (developer account required).