Training Data
- Overview
In machine learning (ML), training data is a large dataset used to train a model or algorithm. It's used to teach prediction models how to extract features that are relevant to business goals.
Training data can include: labeled images, text documents, audio recordings, sensor data. There are several types of data labeling, including:
- Audio processing: This process refers to converting sounds into written text.
- Computer vision: The computer vision process refers to the classification and tagging of visual content, such as images, in order to group them together as a data set.
- Language processing: Language processing, also called natural language processing (NLP), refers to the process of identifying text and classifying or tagging it.
Training data is used in three main types of ML: supervised learning, unsupervised learning, semi-supervised learning.
In supervised learning, the training data must be labeled. This allows the model to learn a mapping from the label to its associated features. The more training data a model has, the better it can make predictions.
Here are some steps for preparing data for ML:
- Transform all the data files into a common format
- Explore the dataset using a data preparation tool like Tableau, Python Pandas, etc.
- Clean the data using mathematical operations
- Pick feature variables from the dataset using feature selection methods
Data transformation includes dimensionality reduction, feature selection, and creation of new features. These steps help reduce data noise and improve the ML model's ability to make accurate predictions.
[More to come ...]