This is text. As I presume you are human you will understand the words and their meaning. Some words have multiple meanings like the word like. Also as English isn't my native tongue there will be errors in my writing but you will understand it anyway. Our brain is doing a fantastic job at inferring meaning from the context. This is something that is far more difficult for machines.
Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris have written a book about all the difficulties that you might encounter when processing text with machines and ways to solve them. Taming Text not only shows you the theory of extracting, searching and classifying information in text but also introduces different open source projects that you can integrate in your application.
Each chapter focuses on one problem space and most of them can even be read in isolation. You will learn about the difficulties in understanding text, mostly caused by ambiguous meanings and the context words appear in. Tokenization and entity recognition are introduced with some basics of linguistics. Searching in text is covered well with all details on analyzing, the inverted index and the vector space model, which is also important for clustering and classification. Fuzzy string matching, the process of looking up similar strings, is shown using the famous Levenshtein distance, NGrams and Tries. A larger part of the book finally focuses on text clustering, the unsupervised process of putting documents into clusters, and classification and categorization, a learning process that needs some precategorized data.
Throughout all the chapters the authors introduce sample applications in Java using one or more of the open source projects that are covered. You will see an application that searches text in Mary Shelleys Frankenstein using Apache Lucene and does entity recognition to identify people and places using Apache OpenNLP. Apache Solr is mostly used for searching and OpenNLP can do extensive analysis of text like tokenization, determining sentences or parts of speech tagging. Content is extracted from different file formats using Apache Tika. Text clustering is shown using Carrot² for search result clustering in Solr, Apache Mahout is mainly used for document clustering and classification with some help of Lucene, Solr and OpenNLP. The final example of the book builds on the knowledge of all the preceding chapters showing you an example question answering system similar to IBM Watson that accepts natural language questions and tries to give correct answers from a data set extracted from Wikipedia.
This book is exceptional in that it covers many different topics but the authors manage to combine them in a coherent example. It is one of the books in this years Jolt award for a good reason. If you are doing anything with text, be it searching or analytics you are advised to get a copy for yourself. I know that I will come back to mine again in the future when I need to refresh some of the information.
About Florian Hopf
I am working as a freelance software developer and consultant in Karlsruhe, Germany. If you liked this post you can follow me on Twitter or subscribe to my feed to get notified of new posts. If you think I could help you and your company and you'd like to work with me please contact me directly.