Personal tools

Training Data and Test Data

University of New South Wales_022724A
[University of New South Wales, Australia]


- Overview

In machine learning (ML), datasets are typically split into two subsets: training and testing data. The training data is used to train the ML algorithm. The testing data is used to evaluate the accuracy of the trained algorithm.

In ML, training data and test data are subsets of a dataset: 

  • Training data: A subset of the original data used to train a model. Training data is typically larger than testing data. It can include photos, videos, texts, or audio files. The data is labeled with classes or tags to teach the algorithm how to make predictions.
  • Test data: A subset of the training data used to test the model's performance. Test data is typically different from the training data and not labeled. This means the model's output is unknown for each data point. Test data can be used to assess the progress and efficiency of algorithms' training. It can also be used to modify or optimize algorithms for better results.

 

There are several types of data labeling, including:

  • Audio processing: This process refers to converting sounds into written text.
  • Computer vision: The computer vision process refers to the classification and tagging of visual content, such as images, in order to group them together as a data set.
  • Language processing: Language processing, also called natural language processing (NLP), refers to the process of identifying text and classifying or tagging it.

 

The process of training and testing data in ML involves several steps:  

  • Data collection
  • Data preprocessing
  • Data splitting: Train-test split
  • Data augmentation (optional)
  • Model training
  • Model evaluation: testing

 

[More to come ...]

 

 

Document Actions