Tuning sentence detection
Lots of data will resist the charms of IndoEuropeanSentenceModel
, so this recipe will provide a starting place to modify sentence detection to meet new kinds of sentences. Unfortunately, this is a very open-ended area of system building, so we will focus on techniques rather than likely formats for sentences.
How to do it...
This recipe will follow a well-worn pattern: create evaluation data, set up evaluation, and start hacking. Here we go:
Haul out your favorite text editor and mark up some data—we will stick to the
[
and]
markup approach. The following is an example that runs afoul of our standardIndoEuropeanSentenceModel
:[All decent people live beyond their incomes nowadays, and those who aren't respectable live beyond other people's.] [A few gifted individuals manage to do both.]
We will put the preceding sentence in
data/saki.sentDetected.txt
and run it:java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar: com.lingpipe.cookbook.chapter5.EvaluateAnnotatedSentences...