Feature Engineering
Feature Engineering: From Raw Data to Training Set
- Overview
Feature engineering is a preprocessing step in machine learning (ML) and statistical modeling that transforms raw data into a more effective set of inputs for training and prediction.
The goal of feature engineering is to improve the performance of ML algorithms by selecting the most relevant aspects of the data for the predictive task and model type.
Feature engineering involves extracting and transforming variables from raw data, such as price lists, product descriptions, and sales volumes. This process may include:
- Data preparation: Manipulating and consolidating raw data from different sources into a standardized format
- Designing artificial features: Creating predictive model features, also called dimensions, that are input variables used to generate model predictions
Some common types of features include:
- Numerical: Values with numeric types, such as age, salary, or height
- Categorical: Features that can take one of a limited number of values, such as gender, color, or T-shirt size
- Binary: A special case of categorical features with only two categories, such as is_smoker or has_subscription
- Text: Features that contain textual data
Please refer to the following for more information:
- Wikipedia: Feature Engineering
- Features
In artificial intelligence (AI), a "feature" refers to a measurable property or characteristic of data that is used as input for a machine learning (ML) model to make predictions; essentially, it's a specific piece of information extracted from raw data that helps the AI understand and classify something, like age, gender, or color in an image recognition system.
Features are the inputs that a ML algorithm uses to learn and make predictions based on the data provided. Raw data usually needs to be transformed into features through a process called "feature engineering" to be suitable for ML models.
Only relevant features that have a significant impact on the prediction should be used.
- Data Feature Engineering
Data feature engineering, also known as data preprocessing, is the process of transforming raw data into features that can be used to develop machine learning (ML) models.
Features are variables that can be defined and observed. For example, in a healthcare setting, features might include gender, height, weight, resting heart rate, or blood sugar level.
Feature engineering involves:
- Extracting and transforming variables from raw data
- Selecting, combining, and crafting attributes that capture the relationships between variables
- Adding, deleting, combining, or mutating data to improve ML model training
Feature engineering helps:
- Increase the model's accuracy on new, unseen data
- Enhance the predictive power of ML models
- Lead to better performance and greater accuracy
Feature engineering can involve raw data such as: Price lists, Product descriptions, Sales volumes.
Some examples of features in a dataset include:
- Numerical features: Numerical values such as height, weight, and so on
- Categorical features: Multiple classes/ categories, such as gender, color, and so on
Feature engineering is the process of reworking datasets to improve ML model training. By adding, deleting, combining, or mutating data in the dataset, data scientists can expertly customize training data to ensure that the resulting ML models meet their business use cases.
Data scientists use feature engineering to prepare input data sets that are best suited to support the intended business purpose of ML algorithms.
For example, one approach involves handling outliers. Because outliers are well outside the expected range, they can have a negative impact on the accuracy of the forecast. A common way to deal with outliers is pruning. Pruning simply removes outliers, ensuring they do not contaminate the training data.
- Feature Extraction
Feature extraction refers to the process of converting raw data into processable numerical features while retaining the information in the original data set. It produces better results than directly applying machine learning (ML) to raw data.
To make predictions and recommendations more accurately, ML involves large data sets that require significant resources to process. Feature extraction is an effective method to reduce the amount of resources required without losing important information. Feature extraction plays a key role in improving the efficiency and accuracy of machine learning models.
There are two main methods of performing feature extraction: manual and automatic.
- Manual feature extraction: It involves applying domain knowledge and human intuition to select or design features suitable for the problem. For example, you can use image processing techniques to extract edges, corners, or regions of interest from images. Manual feature extraction can be efficient and customized, but it can also be labor-intensive and subjective.
- Automatic feature extraction: It involves using ML algorithms to learn features from data without human intervention. For example, you can use principal component analysis (PCA) to reduce the dimensionality of your data by finding the direction of maximum variation. Automatic feature extraction can be efficient and objective, but it can also be complex and opaque.
- Feature Engineering for ML and Data Analytics
Feature engineering is a ML process that transforms raw data into features that better represent a predictive model's problem. It's the first step in developing a predictive model, and it can be time-consuming and error-prone.
However, deep learning (DL) algorithms can process large raw datasets without feature engineering, as the neural network automatically creates features when it learns. However, deep learning algorithms still require careful preprocessing and cleaning of the input data.
Feature engineering can help improve the accuracy of a model for unseen data by selecting the most useful predictor variables for the model. It can also help explain DL models by removing non-predictive features and building more predictive ones.
Some examples of numerical features include: age, height, weight, and income.
Some examples of categorical features include: gender, color, and zip code.
Some techniques for feature engineering include:
- Target encoding: Can help capture the interdependence of targets
- Imputation: Can help capture the interdependence of targets
- Dropping rows or columns: A simple solution for missing values, but there's no optimum threshold for dropping
- Replacing outliers: Can be replaced with mean, median, or any quantile values
- Feature Selection and Extraction
To minimize the effects of noise, correlation, and high dimensionality, some form of dimension reduction is sometimes a desirable preprocessing step for data mining.
Feature selection and extraction are two approaches to dimension reduction.
- Feature selection: Selecting the most relevant attributes
- Feature extraction: Combining attributes into a new reduced set of features
Feature selection and feature extraction are both machine learning techniques for handling irrelevant and redundant features. The main difference is that feature selection keeps a subset of the original features, while feature extraction creates new features from the original data.
Unlike feature selection, which selects and retains the most significant attributes, Feature Extraction actually transforms the attributes. The transformed attributes, or features, are linear combinations of the original attributes. The Feature Extraction process results in a much smaller and richer set of attributes.
[More to come ...]