HealthTech Research

Data Cleaning and Preprocessing for Beginners

When our team’s project scored first in the text subtask of this year’s CALL Shared Task challenge, one of the key components of our success was careful preparation and cleaning of data.

SwissCognitiveData cleaning and preparation is the most critical first step in any project. As evidence shows, most data scientists spend most of their time — up to 70% — on cleaning data.

In this blog post, we’ll guide you through these initial steps of data cleaning and preprocessing in Python, starting from importing the most popular libraries to actual encoding of features.

Step 1. Loading the data set

Importing libraries

The absolutely first thing you need to do is to import libraries for data preprocessing. There are lots of libraries available, but the most popular and important Python libraries for working on data are Numpy, Matplotlib, and Pandas. Numpy is the library used for all mathematical things. Pandas is the best tool available for importing and managing datasets. Matplotlib (Matplotlib.pyplot) is the library to make charts.

To make it easier for future use, you can import these libraries with a shortcut alias:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Loading data into pandas

Once you downloaded your data set and named it as a .csv file, you need to load it into a pandas DataFrame to explore it and perform some basic cleaning tasks removing information you don’t need that will make data processing slower.

Usually, such tasks include:

my_dataset = pd.read_csv(‘data/my_dataset.csv’, skiprows=1, low_memory=False)
my_dataset = my_dataset.drop([‘url’],axis=1)
my_dataset = my_dataset.dropna(thresh=half_count,axis=1)

It’s also a good practice to name the filtered data set differently to keep it separate from the raw data. This makes sure you still have the original data in case you need to go back to it.

Step 2. Exploring the data set

Understanding the data

Now you have got your data set up, but you still should spend some time exploring it and understanding what feature each column represents. Such manual review of the data set is important, to avoid mistakes in the data analysis and the modelling process.

To make the process easier, you can create a DataFrame with the names of the columns, data types, the first row’s values, and description from the data dictionary.

As you explore the features, you can pay attention to any column that:

since these things can hurt your analysis if handled incorrectly.

You should also pay attention to data leakage, which can cause the model to overfit. This is because the model will be also learning from features that won’t be available when we’re using it to make predictions. We need to be sure our model is trained using only the data it would have at the point of a loan application.

Deciding on a target column

With a filtered data set explored, you need to create a matrix of dependent variables and a vector of independent variables. At first you should decide on the appropriate column to use as a target column for modelling based on the question you want to answer. For example, if you want to predict the development of cancer, or the chance the credit will be approved, you need to find a column with the status of the disease or loan granting ad use it as the target column.

For example, if the target column is the last one, you can create the matrix of dependent variables by typing:

X = dataset.iloc[:, :-1].values

That first colon (:) means that we want to take all the lines in our dataset. : -1 means that we want to take all of the columns of data except the last one. The .values on the end means that we want all of the values. […]

Read the full article at:


Leave a Reply