top of page
  • Priyanka P. Pattnaik

tfidf-matcher: the SUPER-FAST string matching package


If you are working in the field of data then you can see this man as a person who is looking for correct data from the large dataset. It is very crucial and very time-consuming work. In artificial intelligence, if you got your data correct then you can pass the first hurdle. This blog is about one of my projects where I need to find the matching string from a large database while taking care of the time complexity.


From a laymen's point of view, the time complexity is referred to as the total time taken to get the output from certain work. While the definition states that the Time complexity of an algorithm signifies the total time required by the program to run until its completion. The time complexity of algorithms is most commonly expressed using the big O notation. It’s an asymptotic notation to represent the time complexity.

In my last blog, I have done my work using the fuzzy-wuzzy and I have used the package. The result was good but the server takes time especially when I go for a large dataset. So, I search for others and found this amazing package.




Before finding the match in a dataset, we need to sort the dataset. Cause matching will be easier if your dataset is sorted. So our first task is to deal with the dataset.

  1. Import your dataset using pandas

  2. use the n-grams for cleaning and for making a contiguous sequence of n items

  3. Make the items into the tfidf matrix by using the - from sklearn.feature_extraction.text import TfidfVectorizer

  4. Fitting a K-NearestNeighbours model to the sparse matrix.

  5. Vectorizing the list of strings to be matched and passing it into the KNN model to calculate the cosine distance by using- import tfidf_matcher as tm, and call the matcher function with tm.matcher().

  6. match it with your lookup data.

In my work, I got 9859 matched rows from a dataset in a few seconds. So, indeed it is very quick. As we saw that the matches created with this method are really appreciating and the ratio really gives us a way to look through the matching ratio percentage with the rows. The biggest advantage is speed.


Brought to You by-

COE-AI(CET-BBSR)- A Initiative by CET-BBSR, Tech Mahindra, and BPUT to provide solutions to Real-world problems through ML and IoT

23 views0 comments

Recent Posts

See All
bottom of page