Your interests/skills: Deep Learning, Natural Language Processing
Consider the following example text, which consists of two sentences:
Prename M. Lastname (brn. 1930 - ?) was a U.S. citizen working for A.B.C. Inc. She owned buildings Pl. Someplace Nos. 1-3, e.g. No. 1.
This is an invented example, but each of the difficult cases in the text happens to appear in some Wikipedia article. Neither the sentence segmenter of SpaCy nor the one of the NLTK toolkit perform the correct split after “Inc.” The two segmenters split the text into 6 and 9 sentences respectively, which is 5 and 8 wrong splits.
We expect that the problem can be solved almost perfectly with machine learning techniques. Furthermore, a learning approach has the advantage that it can be used for multiple languages, simply by training it on different datasets. However, the training dataset has to reflect the difficult cases and must be of appropriate size. As an additional requirement, we want to be able to segment large text corpora (e.g. the entire Wikipedia) in little time, which rules out the usage of big machine learning models.
You could begin your work as follows:
Find existing sentence segmenters (e.g. from NLTK, SpaCy, CoreNLP, OpenNLP, Stanza). Play with them and detect cases where the segmentation goes wrong.
- As always: review the research to see which solutions exist already.
- Get labeled training and test data. Check whether you find existing datasets or generate a dataset by yourself.
- Find corpora which are sentencized. Training data can be generated by attaching randomly selected sentences. You have to ensure that enough difficult examples are in the training data.
- Select difficult examples from Wikipedia and label them manually (only suitable for evaluation, not for training where you need many examples).
- Implement and test your approach(es).
- As a baseline, see how far you can come with handcrafted rules.
- Use statistical methods to automatically generate such rules and exceptions (like the “Punkt” tokenizer behind the NLTK sentence segmentation).
- We expect the best segmentations from a (bidirectional) recurrent neural network. However, this will be the slowest model. (Advanced: a Transformer should in principle work equally well or better, and is faster.)
- Little context could be enough to solve the problem, e.g. 5 characters to the left and right of a potential split position. Therefore, a convolutional neural network or dense neural network could solve the task, which are faster than a recurrent model.
- Can we transform our learned model into a set of rules, or a regular expression? Long training times are acceptable, if we can predict fast.
If the problem is too easy, we can make it harder by introducing noise (typos, missing spaces, missing punctuation), or focus on a less clean domain, for example Tweets.
References
Insightful blog post by Grammarly https://www.grammarly.com/blog/engineering/how-to-split-sentences/
Sentence Boundary Detection and the problem with the U.S. (2009) https://www.aclweb.org/anthology/N09-2061.pdf
Punkt Tokenizer by Kiss and Strunk 2006 https://www.mitpressjournals.org/doi/pdfplus/10.1162/coli.2006.32.4.485
Mikheev 2002 Periods, Capitalized Words, etc. https://www.mitpressjournals.org/doi/pdfplus/10.1162/089120102760275992
Palmer http://www.sfs.uni-tuebingen.de/~keberle/Lit/ch2.pdf and Palmer 1997
Read et al 2012: Sentence boundary detection: a long solved problem? https://www.aclweb.org/anthology/C12-2096.pdf