Personal tools

Data Preprocessing

The University of British Columbia_022424C
[The University of British Columbia]

- Overview

The raw data that you get directly from your sources are never in the format that you need to perform analysis on. There are two main goals in the data pre-processing step. 

The first is to clean the data to address data quality issues, and the second is to transform the raw data to make it suitable for analysis. A very important part of data preparation is to address quality of issues in your data. Real-world data is messy. 

In order to address data quality issues effectively, knowledge about the application, such as how the data was collected, the user population, and the intended uses of the application is important. This domain knowledge is essential to making informed decisions on how to handle incomplete or incorrect data. 

The second part of preparing data is to manipulate the clean data into the format needed for analysis. The step is known by many names: data manipulation, data preprocessing, data wrangling, and even data munging. Some operations for this type of operation include scaling, transformation, feature selection, dimensionality reduction, and data manipulation. 

 

- The Tools and Libraries for Data Preprocessing in ML

The Tools and libraries for data preprocessing in ML include:

  • Python Libraries, such as Pandas
  • R Libraries, such as dplyr
  • OpenRefine
  • Apache Spark
  • RapidMiner
  • WEKA
  • KNIME
  • Orange

 

- Steps in Data Preprocessing in ML

Data preprocessing in ML is a critical step to help improve data quality to facilitate the extraction of meaningful insights from data. 

Data preprocessing in ML refers to the techniques of preparing (cleaning and organizing) raw data to make it suitable for building and training ML models. In short, data preprocessing in ML is a data mining technique that transforms raw data into an understandable and readable format.

When creating a ML model, data preprocessing is the first step that marks the start of the process. Often, real-world data is incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute values/trends. 

This is where data preprocessing comes into the picture - it helps to clean, format and organize raw data so that it is ready for machine learning models.

Here are the seven important steps of data preprocessing in machine learning:

  

- Step 1: Acquire the Dataset

Acquiring a dataset is the first step in data preprocessing in ML. To build and develop ML models, you must first acquire relevant datasets. 

This dataset will consist of data collected from a number of different sources, then combined in an appropriate format to form a dataset. 

Dataset formats vary by use case. For example, a commercial dataset will be completely different from a medical dataset. Business datasets will contain relevant industry and business data, while medical datasets will contain healthcare-related data.


- Step 2: Import All the Key Libraries

Since Python is the most widely used library and the favorite library of data scientists around the world. Predefined Python libraries can perform specific data preprocessing jobs. 

Importing all key libraries is an important step in data preprocessing for ML.

The three core Python libraries used for this data preprocessing in ML are:

  • NumPy - NumPy is the foundational package for scientific computing in Python. Hence, it is used to insert any kind of mathematical operation in the code. With NumPy, you can also add large multidimensional arrays and matrices to your code.
  • Pandas - Pandas is an excellent open source Python library for data manipulation and analysis. It is widely used to import and manage datasets. It includes high-performance, easy-to-use data structures and data analysis tools for Python.
  • Matplotlib - Matplotlib is a Python 2D plotting library for drawing any kind of charts in Python. It provides publication-quality diagrams across platforms (IPython shell, Jupyter notebooks, web application servers, etc.) in a variety of hardcopy formats and interactive environments.
 

- Step 3: Importing the Dataset

In this step, you need to import the dataset collected for the ML project at hand. Importing datasets is one of the important steps in data preprocessing in ML.

 

- Step 4: Identifying and Handling Missing Values

In data preprocessing, it is crucial to identify and properly handle missing values, otherwise, you may draw inaccurate and wrong conclusions and inferences from the data. Needless to say, this can get in the way of your ML projects. 

Basically, there are two ways of dealing with missing data:

  • Deleting Particular Rows – In this method, you delete specific rows or specific columns with null values where more than 75% of the values are missing. However, this method is not 100% effective. It is recommended that you only use it when the data set samples are sufficient. You have to make sure that after removing the data, the bias is not increased.
  • Calculating the Mean - This method is useful for features with numeric data such as age, salary, year, etc. Here you can calculate the mean, median or mode for a specific feature or column or row that contains missing values and replace the result with missing values. This approach increases the variance of the dataset and effectively counteracts any data loss. So it will yield better results than the first approach (omitting rows/columns). Another approximation is by the bias of neighboring values. However, this works best for linear data.
 

- Step 5: Encoding the Categorical Data

Categorical data refers to information in a data set that has a specific category. Machine learning models are primarily based on mathematical equations. 

So you can intuitively understand that keeping categorical data in the equation causes some problems because you only need the numbers in the equation.

 

- Step 6: Splitting the Dataset

Splitting datasets is the next step in data preprocessing in machine learning. Every dataset for a machine learning model must be split into two separate sets - a training set and a test set.

The training set represents the subset of the dataset used to train the machine learning model. Here, you already know the output. On the other hand, a test set is a subset of a dataset used to test a machine learning model. ML models use a test set to predict outcomes.

Typically, datasets are split in a 70:30 or 80:20 ratio. This means you can use 70% or 80% of the data to train the model and ignore the other 30% or 20%. The splitting process varies depending on the shape and size of the dataset in question.

 

- Step 7: Feature Scaling

Feature scaling marks the end of data preprocessing in machine learning. It is a method of standardizing the independent variables of a data set within a certain range. 

In other words, feature scaling limits the range of variables so that you can compare them on common ground.   

 

[More to come ...]

 

 

 
Document Actions