As a data scientist/machine learning engineer, or even AI researcher, the very first step an individual needs is to have access to data. Data is everywhere around us, in various forms, such as images, words, sound waves, and so on. To collect this data and process and work on it is what gives rise to this field of Machine learning and Data Science. But the main question is, how can someone start working on a problem statement without data? And even if he/she has the idea, how do they start searching for relevant data to work on?
Data can be gathered for various sources, but to summarize in a broad sense, there are three main sources from where you can get data:
1 - From open source data repositories such as Kaggle, Google public data explorer, etc.
2 - From private sources owned by certain individuals/organizations.
3 - Collecting data by scraping from public domains and building your own dataset.
Let's dive into each one of the aforesaid categories :
Publicly available datasets
There are many datasets available publicly on the internet. They are free to use and open-source, and you can use them to build your own projects/models or even use them to practice on without any issues. Some of the examples of portals which have free and open to use datasets include:
Kaggle- Containing over 80,000 datasets from various sources, you can find almost any kind of data here, as well as contribute to the same.
World Open Bank Data
As a repository of the world’s most comprehensive data regarding what’s happening in different countries across the world, World Bank Open Data is a vital source of Open Data. It also provides access to other datasets as well which are mentioned in the data catalog.
Many more datasets could be found online, you can check a list here
Private sources
Many datasets are available only through private individuals or organizations. These organizations allow access to their data either for free or on paid basis depending upon the type of data as well as the organization you are requesting. Some examples are medical data requesting from Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, extracting maps data from google maps/azure maps API, and so on. These vary depending up on the kind of task you are doing and such sources can be found by googling the relevant information.
Scraping Data
In many cases though, data won't be available to you directly, i.e., in the form of a dataset. You will still be able to find the data online on various sources, but to make it into a workable format is a different thing. Here comes the work of scraping data into play. Data is available in abundance throughout the internet, and scraping and wrangling data forms a major part of the role of a data scientist. Examples of such sources include Earthdata for satellite based data from NASA's satellites, as well as websites such as Amazon, Flipkart, etc. and so on. You can scrape data from any online page using Python or any other language, or even a web-crawler/scraper, and then start working on the data with ease. Let's dive into an example for the same.
Scraping customer review data from an amazon product:
Let's take the example of a smartphone product, MI Poco M2 Pro(Renewed). Our task here will be to extract relevant information from the customer review page and form a dataset based on the same. We start by importing the necessary libraries.
The requests library lets us to request data from a web page and scrape it, while the BeautifulSoup provides us functionalities to work on the extracted data and get necessary information from the same.
Now, we provide the URL for the webpage we want to scrape data from, and use the above two libraries to get the data from the page and parse it into an HTML format.
The above code does exactly the same as described earlier, and we can see it extracts the data in the form of HTML format. Now, we can start extracting relevant information from this data.
Say, we want to find the names of all the people who have posted a review in the given webpage. To do that, we again use BeautifulSoup, and this is the below code on how to achieve that:
As we can see, we are able to filter out the names of only the users from the extracted data. We slice the list since two of the users have their reviews featured in "top positive review" and "Top Critical Review" as well as the same is present in the overall reviews page. This can be seen by going to the URL we are using here.
In a similar manner, we can extract all the information such as "Review Title", "Review Body", "Review Star Rating" and so on, in the same manner as earlier. Finally, we can combine all the data we have gathered and form a dataset on the same.
And there we go, using simple python and a little web-scraping, we were able to extract data from a webpage and make it into a usable format as we see from publicly available datasets.
At the end of the day, it will all depend on what kind of task you are trying to achieve. Based on that only, you will have to search and find the data relevant to your problem statement at hand. There are other ways to scrape data too using frameworks like Scrapy, ParseHub, and so on.
Comments