Training and Testing Data, Labeled and Unlabeled Data
- Overview
In machine learning (ML), training data is the data that you use to train a ML algorithm or model. Training data requires some human involvement to analyze or process the data for use in ML.
How people get involved depends on the type of ML algorithms you use and the types of problems they are intended to solve.
- With supervised learning, humans participate in selecting the data features to be used in the model. The training data must be labeled - that is, enriched or annotated - to teach the machine how to recognize the outcomes your model is designed to detect.
- Unsupervised learning uses unlabeled data to find patterns in the data, such as inference or clustering of data points. There are hybrid ML models that allow you to use a combination of supervised and unsupervised learning.
- Semi-supervised learning is a hybrid of supervised and unsupervised learning. The model has a relatively small dataset with available labels and a larger dataset with unlabeled data. The goal is to learn relationships from a small amount of labeled information and test these relationships in an unlabeled dataset to learn from.
- Reinforcement learning differs from previous approaches in that it does not require training data, but simply works and learns through the described reward system.
- AI Training Data - Part of a Continuous Flywheel
The development process of AI is like a continuous flywheel, and data is the link that makes the flywheel turn. Since it all starts with AI training data, it has to be top-notch to confidently proceed with AI-based methods.
Whether you're looking at what's right, what's wrong, or an explanation of what happened to your model, a plethora of issues will eventually be identified as having to do with the quality, quantity, and completeness of your AI training data.
Taking self-driving cars as an example, how can a model learn correctly if it doesn't know the difference between a car and a street sign? The answer is that it cannot reasonably be assigned this expectation.
So how does it affect other parts of the AI development flywheel? When you start training your model, you'll want to verify that it was trained correctly. You will need test data to see how it works, and then you may need more training data to further tune the model for areas where it did not or could not make accurate predictions.
Once your model is behaving the way you want it to, it becomes critical to regularly update your model to ensure your model evolves with human behavior.
- Comparing Labeled and Unlabeled Data
Labeled data is a set of samples labeled with one or more labels. Unlabeled data are pieces of data that have not been labeled with labels that identify features, attributes, or categories. Unlabeled data is often used in various forms of machine learning.
- Unsupervised learning uses unlabeled data while supervised learning uses labeled data.
- Unlabeled data is easier to obtain and store than labeled data, and therefore cheaper and more convenient.
- Compared to labeled data, unlabeled data has a more limited range of applications in providing actionable insights (e.g., predicting activity). Unsupervised learning techniques can help discover new data clusters and enable new labels.
- To eliminate the need for manually labeled data, while still providing large annotated datasets, computers can also use combined data for semi-supervised learning.
Data labeling is a critical step in developing a high-performance ML model. Though labeling appears simple, it’s not always easy to implement. As a result, companies must consider multiple factors and methods to determine the best approach to labeling.
[More to come ...]