The Data Layer
- [Data Collection for ML - Yuji Roh]
- Overview
The AI Stack Data Layer serves as the foundational "fuel" for AI systems, responsible for ingestion, cleaning, storing, and preparing structured and unstructured data.
Key components include platforms for storage (e.g., Snowflake, Google Cloud BigQuery, vector databases), data pipelines for ingestion and transformation (e.g., Databricks, Kafka), and labeling tools (e.g., Labelbox, SageMaker Ground Truth) to ensure high-quality supervised learning.
This layer is essential for creating "AI-Ready" data, meaning data that is high-quality, relevant, and consistent, which is crucial for model training and deployment.
(A) Core Components of the Data Layer:
1. Storage & Management: Modern AI relies on platforms that store both structured and unstructured data to provide a "single source of truth".
- Data Lakes & Warehouses: Platforms like Snowflake and Google Cloud BigQuery store massive datasets, allowing for efficient analytics and AI preparation.
- Vector Databases: Used for storing high-dimensional embeddings, essential for retrieval-augmented generation (RAG) and semantic search (e.g., Pinecone and Milvus).
2. Data Pipelines & ETL (Extraction, Transformation, Loading): These tools facilitate the movement, cleaning, and preparation of data for AI models. Examples include Databricks for unified data processing and Apache Kafka for real-time data streaming.
3. Data Labeling & Annotation: Crucial for supervised learning, these tools help tag and structure data for model training. Key tools include Labelbox and Amazon SageMaker Ground Truth.
(B) Key Functions:
- Data Ingestion & Cleaning: Automating workflows to collect data and ensure consistency before it is fed into algorithms.
- Data Transformation: Converting raw data into usable formats for model training (vectorization).
- Storage & Governance: Ensuring data security and compliance within a centralized repository.
[More to come ...]

