FlashText – A library faster than Regular Expressions for NLP tasks

People like me working in the field of Natural Language Processing almost always come across the task of replacing words in a text. The reasons behind replacing the words may be different. Some of them are.
1.“would’ve” and “would have” represent the same thing. So changing all the occurrences of “would’ve” to “would have” is one such task.
2.Changing all Case Variations to a single form i.e Python, pytHon, pYthon, pythoN etc. to python
3.Changing all the synonyms of a word to a common word i.e happy, joyous, delightful etc to happy
Now, if the number of words to replace and the corpus of text is not huge i.e within thousands, then Regular Expressions have always been my solution. But as I started working on bigger and bigger datasets with tens of thousands of documents and sometimes millions, I noticed that performing the above tasks started taking days. In today’s fast-moving world, this is not the amount of time one would want to invest in a very simple but important task. So earlier, it would come down to optimizing the number of words necessary to be changed and time required to replace these words.
But in the early November, I found FlashText – a super blazingly fast library that reduces days of replacement computation time to minutes.


Anomaly detection by robust statistics

Real data often contain anomalous cases, also known as outliers. These may spoil the resulting analysis but they may also contain valuable information. In either case, the ability to detect such anomalies is essential. A useful tool for this purpose is robust statistics, which aims to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. We present an overview of several robust methods and the resulting graphical outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data, such as estimating location and scatter, linear regression, principal component analysis, classification, clustering, and functional data analysis. Also the challenging new topic of cellwise outliers is introduced.
Advertisements