"Correcting the Corrupted" from the Big Data _ DATA CLEANING !

Jaya Darshana Singam
Sep 25, 2020
2 min read

... how easy is to ask "google", "alaska" or "siri" about the every day's climate, traffic updates, to navigations, podcasts, latest application updates, reminders and many more.. .and of course we are somewhere dependandant on the technology to keep up with our routine schedules. But, imagine what would happen if we get the incorrect information .. ??

To not get messed with the inaccurate or wrong details and facts, it is very important that the data should be cleant before they are fed to ML, DL or Statistical Models, because any model with faulty data is of no use ..

DATA CLEANING is the procedure of correcting or removing inaccurate and corrupt data from a larger data set or a Big data.

By this line we have a brief idea on Why Data Cleaning is important and What is it ..?

So, How data cleaning is done and where are these applied , and what is Big Data ..?

" BIG DATA " - is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt by traditional data-processing application software via Machine Learning, Deep Learning and other statistical models and algorithms. [google]

Data with many cases offer greater statistical power, while data with higher complexity may lead to a higher false discovery rate. Big data was originally associated with three key concepts: Volume, Variety, Velocity, Variability, Veracity.

" DATA CLEANING with numpy and pandas Libraries " :

The main function of the data cleaning is to identify the cells from a dataset (.csv/excel file) which are not appropriate, missing (empty) and not suited to the kind of information that is provided.

Step 1 : Cells with NaN (Not a Number), Na (Not available), N / A (no answer / not appropriate) are to be identified from a dataset, and specify the inappropriate cells with "true" (Truly data is missing/inappropriate).

Step 2: Getting to know the number of missing values from a specific section., and plotting a heatmap of the dataset for visual representation (easy identification of missing values).

Step 3: Removing the whole Column of the NaN/ N/A valued cells.

Step 4: Filling all the missing values. This can be done in three different ways:

a. Forward Fill - will propagate last valid observation to the exiting cell

b. Back Fill - will propagate the next cell value to the existing cell

c. Interpolate - the average of the front and the last value is been added to the cell

d. Or else, there is an option to replace a certain value in the required cell which has to be corrected.

" VARIOUS APPLICATIONS WHERE THE BIG DATASETS ARE APPLICATICABLE FURTHR TO ML, AI, STATISTICAL AND DL MODELS "

Web data

Transactional data

Sensor data

Data in the news

Spatiotemporal data

Time-stamped data

Personal Data

" SUMMARY " -

Data Cleaning is one of the most essential and basic practice to be done in case of any Big data or a data set which is to be further implemented in any of the Models to build applications, prediction models .. etc.

REFERENCES:

Google Photos'
https://realpython.com/python-data-cleaning-numpy-pandas/

CoE in Artificial Intelligence

"Correcting the Corrupted" from the Big Data _ DATA CLEANING !

Recent Posts

Comments