Personal tools

Feature Engineering

 
Dartmouth College_012924A
[Dartmouth College]


Feature Engineering: From Raw Data to Training Set


- Overview

Data feature engineering, also known as data preprocessing, is the process of transforming raw data into features that can be used to develop machine learning (ML) models. 

Feature engineering involves: 

  • Extracting and transforming variables from raw data
  • Selecting, combining, and crafting attributes that capture the relationships between variables
  • Adding, deleting, combining, or mutating data to improve ML model training


Feature engineering helps:

  • Increase the model's accuracy on new, unseen data
  • Enhance the predictive power of ML models
  • Lead to better performance and greater accuracy


Feature engineering can involve raw data such as: Price lists, Product descriptions, Sales volumes. 

Some examples of features in a dataset include:

  • Numerical features: Numerical values such as height, weight, and so on
  • Categorical features: Multiple classes/ categories, such as gender, color, and so on


Principal component analysis (PCA) is a feature engineering technique for dimensionality reduction. It involves:

  • Standardizing data
  • Computing the covariance matrix
  • Performing eigenvalue decomposition

 

- Principal Component Analysis

Principal component analysis (PCA) is a statistical method that summarizes large data tables into a smaller set of "summary indices". These indices can be more easily visualized and analyzed. 

PCA is a dimensionality reduction method that transforms a large set of variables into a smaller one. The smaller set still contains most of the information in the large set. 

The axes that represent the variation are "Principal Components". PC1 represents the most variation in the data, and PC2 represents the second most variation. 

The outcome of PCA can be visualized on colorful scatterplots, ideally with only a minimal loss of information. 

Here are some steps for solving PCA problems: 

  • Standardize the dataset
  • Find the Eigenvalues and eigenvectors
  • Arrange Eigenvalues
  • Form Feature Vector
  • Transform Original Dataset
  • Reconstructing Data

 
PCA can be based on either the covariance matrix or the correlation matrix. The new variables (the PCs) depend on the dataset, rather than being pre-defined basis functions.
PCA is a popular and unsupervised algorithm that has been used across several applications like data analysis, data compression, de-noising, and reducing the dimension of data.

 

- Feature Extraction

Feature extraction refers to the process of converting raw data into processable numerical features while retaining the information in the original data set. It produces better results than directly applying machine learning to raw data.

There are two main methods of performing feature extraction: manual and automatic. 

  • Manual feature extraction: It involves applying domain knowledge and human intuition to select or design features suitable for the problem. For example, you can use image processing techniques to extract edges, corners, or regions of interest from images. Manual feature extraction can be efficient and customized, but it can also be labor-intensive and subjective. 
  • Automatic feature extraction: It involves using ML algorithms to learn features from data without human intervention. For example, you can use principal component analysis (PCA) to reduce the dimensionality of your data by finding the direction of maximum variation. Automatic feature extraction can be efficient and objective, but it can also be complex and opaque.

 

Harvard University_102221A

- Feature Engineering for ML and Data Analytics

Feature engineering is a ML process that transforms raw data into features that better represent a predictive model's problem. It's the first step in developing a predictive model, and it can be time-consuming and error-prone. 

However, deep learning (DL) algorithms can process large raw datasets without feature engineering, as the neural network automatically creates features when it learns. However, deep learning algorithms still require careful preprocessing and cleaning of the input data.

Feature engineering can help improve the accuracy of a model for unseen data by selecting the most useful predictor variables for the model. It can also help explain DL models by removing non-predictive features and building more predictive ones.

Some examples of numerical features include: age, height, weight, and income.

Some examples of categorical features include: gender, color, and zip code.

Some techniques for feature engineering include:

  • Target encoding: Can help capture the interdependence of targets
  • Imputation: Can help capture the interdependence of targets
  • Dropping rows or columns: A simple solution for missing values, but there's no optimum threshold for dropping
  • Replacing outliers: Can be replaced with mean, median, or any quantile values

 

- Feature Selection and Extraction

To minimize the effects of noise, correlation, and high dimensionality, some form of dimension reduction is sometimes a desirable preprocessing step for data mining. 

Feature selection and extraction are two approaches to dimension reduction.

  • Feature selection: Selecting the most relevant attributes
  • Feature extraction: Combining attributes into a new reduced set of features


Feature selection and feature extraction are both machine learning techniques for handling irrelevant and redundant features. The main difference is that feature selection keeps a subset of the original features, while feature extraction creates new features from the original data. 

Unlike feature selection, which selects and retains the most significant attributes, Feature Extraction actually transforms the attributes. The transformed attributes, or features, are linear combinations of the original attributes. The Feature Extraction process results in a much smaller and richer set of attributes.

 

[More to come ...]

 

 

 
Document Actions