Dec 03

We often need to parse text and have vaious classes at our disposal to help us. For example String.split() is often used to break up a string on a given delimiter.

But what if we want to break up a paragraph of text into sentences. We could split on the period (.) character but this will not work because the period character may occur within the sentence. Different languages may also use a different character to mark the end of a sentence.

The BreakIterator class solves this problem for us providing implementations for breaking a string into sentences, words, lines or characters. The following example shows it’s usage for breaking a paragraph into sentences.

BreakIterator bi = BreakIterator.getSentenceInstance();
bi.setText(text);
int index = 0;
while (bi.next() != BreakIterator.DONE) {
    String sentence = text.substring(index, bi.current());
    System.out.println("Sentence: " + sentence);
    index = bi.current();
}

written by objects \\ tags: , , , ,

Jul 30

The following code will split the string into chunks, each of length ‘nchars’ (except potentially the last chunk)

int len = string.length();
for (int i=0; i<len; i+=nchars)
{
    String part = string.substring(i, Math.min(len, i + nchars)));
}

Another more exotic option uses regular expression and the spit() method. The number of dots in the regexp controls the length of each part.

// Splits string into 3 character long pieces.
String[] parts = string.split("(?<=\\G...)");

See how to fill a string with a character for how to dynamically generate a regexp for any number of characters. Though I’d probably go with the first method.

written by objects \\ tags: , , ,

Aug 25

Typically String.split() or StringTokenizer class is user to break up a string into tokens. Problem with these methods is that they do not handle quoted text as you may require.

For example, consider the following string:

The mans name was "Big Fred"

Using split() or StringTokenizer on this would give us 6 tokens: (The) (mans) (name) (was) (“Big) (Fred”). Typically this is not what we want.

This is where the StreamTokenizer class comes in handy as it gives better control over the parsing process including identifying quoted text. The following example shows its usage:

String s = "The mans name was \"Big Fred\"";
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.quoteChar('"');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
     System.out.println(st.sval);
}

Now we get the 5 tokens as required: (The) (mans) (name) (was) (Big Fred)

written by objects \\ tags: , , , ,