Data Wall
- Overview
In machine learning (ML) and AI, the "data wall" is a term used to describe when the performance of a model stops improving due to a lack of new or higher quality data. This can happen when all the training data on the internet has been used, or when the remaining data is hidden behind paywalls, blocked by robots.txt, or restricted by exclusive deals. The research firm Epoch ai estimates that by 2028, the internet will have run out of high-quality textual data.
Some AI companies believe they can overcome the data wall by using synthetic data, which is data generated by AI systems. However, synthetic data can have limitations, such as exaggerating biases in the original dataset or failing to include rare exceptions that are only found in real data. This could make AI's tendency to hallucinate worse, or models trained on synthetic data might not produce anything new.
Other ways to address the data wall include:
- Being more efficient with data
- Developing new techniques for data collection and use
- Moving beyond data-driven learning paradigms
- Using tools that leverage live context from workflows to generate accurate responses
[More to come ...]