Natural language processing (NLP) is one of the trendier areas of data science. Its end applications are many — chatbots, recommender systems, search, virtual assistants, etc.
So it would be beneficial for budding data scientists to at least understand the basics of NLP even if their career takes them in a completely different direction. And who knows, some topics extracted through NLP might just give your next model that extra analytical boost. Today, in this post, we seek to understand why topic modeling is important and how it helps us as data scientists.
Topic modeling, just as it sounds, is using an algorithm to discover the topic or set of topics that best describes a given text document. You can think of each topic as a word or a set of words.
Topic modeling is the practice of using a quantitative algorithm to tease out the key topics that a body of the text is about. It bears a lot of similarities with something like PCA, which identifies the key quantitative trends (that explain the most variance) within your features. The outputs of PCA are a way of summarizing our features — for example, it allows us to go from something like 500 features to 10 summary features. These 10 summary features are basically topics.
NMF
Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there is no labeling of topics that the model will be trained on. The way it works is that NMF decomposes (or factorizes) high-dimensional vectors into a lower-dimensional representation. These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.
Using the original matrix (A), NMF will give you two matrices (W and H). W is the topics it found and H is the coefficients (weights) for those topics. In other words, A is articles by words (original), H is articles by topics and W is topics by words.
Getting The Data:
This is one of the most crucial steps in the process. As the old adage goes, ‘garbage in, garbage out’. When dealing with text as our features, it’s really critical to try and reduce the number of unique words (i.e. features) since there are going to be a lot. This is our first defense against too many features.
I searched far and wide for an exciting dataset and finally selected the 20 Newsgroups dataset. I’m just being sarcastic — I selected a dataset that is both easy to interpret and load in Scikit Learn. The dataset is easy to interpret because the 20 Newsgroups are known and the generated topics can be compared to the known topics being discussed. Headers, footers, and quotes are excluded from the dataset.
from sklearn.datasets import fetch_20newsgroups
Now that the text is processed we can use it to create features by turning them into numbers. There are a few different ways to do it. I use word count as features.
Data Preprocessing:
I use basic strategies i.e.
1.Tokenization
2.Stemming
3.Lemmatization
We need to remove stopwords first. Casting all values to float will make it easier to iterate over because it will remove most edge cases. Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
Recognizing, searching, and retrieving more forms of words returns more results. When a form of a word is recognized it can make it possible to return search results that otherwise might have been missed. That additional information retrieved is why stemming is integral to search queries and information retrieval.
When a new word is found, it can present new research opportunities. Often, the best results can be attained by using the basic morphological form of the word: the lemma. To find the lemma, stemming is performed by an individual or an algorithm, which may be used by an AI system. Stemming uses a number of approaches to reduce a word to its base from whatever inflected form is encountered.
Now, we obtain a Counts design matrix, for which we use SKLearn’s CountVectorizer module. The transformation will return a matrix of size (Documents x Features), where the value of a cell is going to be the number of times the feature (word) appears in that document.
To reduce the size of the matrix, to speed up computation, we will set the maximum feature size to 500, which will take the top 500best features that can contribute to our model.
Clustering:
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups.
Hence, the documents have moved into 20 different clusters. Instead of Centroids, Medoids are formed and again distances are re-calculated to ensure that the documents who are closer to a medoid are assigned to the same cluster.
We will implement the following 5 methods which will help us split up the algorithm into manageable parts.
Initialise_centroids
assign_clusters
update_centroids
fit_kmeans
predict
Topic Modelling:
As mentioned previously the algorithms are not able to automatically determine the number of topics and this value must be set when running the algorithm. Comprehensive documentation on available parameters is available for NMF. Initializing the W and H matrices in NMF with ‘nndsvd’ rather than random initialization improves the time it takes for NMF to converge.
The result of the topics is shown below. You can see the result topic by topic in my Github.
I tested the algorithm on 20 Newsgroup data set which has thousands of news articles from many sections of a news report. In this data set, I knew the main news topics beforehand and could verify that NMF was correctly identifying them.
The code is quite simple and fast to run. You can find it on my Github. I encourage you to pull it and try it.
Github link: https://github.com/priyanka9707/NLP
コメント