Optimize data storage formats for AI training and inference

Applicable Role: Provider and Consumer

Description

Data storage and access form a significant part of AI and ML systems. During development, large datasets are collected and processed for training. During runtime, especially in retrieval-augmented generation (RAG) systems, data and embeddings are frequently accessed for retrieval and inference.

Inefficient data storage formats and access patterns increase storage requirements, data transfer volumes, and processing overhead. This leads to higher energy consumption and infrastructure usage.

Using efficient data storage and access patterns improves data retrieval performance and reduces the overall resource footprint of both training and runtime systems.

Solution

Use columnar storage formats such as Parquet or ORC for structured datasets
Avoid text-based formats like CSV for large-scale workloads when more efficient alternatives are available
Compress data where appropriate to reduce storage and transfer size
Optimize data schemas to reduce redundancy and improve access efficiency
Use storage systems that support efficient querying, indexing, and partial reads
For retrieval systems, use optimized vector storage and indexing techniques to reduce compute during similarity search

SCI Impact

SCI = (E × I) + M per R

E (Energy): Efficient storage, retrieval, and vector search reduce compute required for data processing and runtime inference.

M (Embodied Carbon): Reduced storage requirements decrease infrastructure needs and associated embodied emissions.

Cost Impact

Storage costs: Reduced through efficient formats (Parquet vs. CSV) and compression
Data transfer costs: Lower egress charges due to smaller data sizes
Compute costs: Reduced query and retrieval costs from optimized indexing
Tooling costs: Vector DB licensing (Milvus, Pinecone) may add operational expense
Trade-off: Storage efficiency gains offset by vector indexing infrastructure costs

Assumptions

Data storage formats and systems can be updated without breaking downstream applications
Compression and indexing strategies do not introduce excessive processing overhead

Considerations

Compatibility with existing tools and pipelines must be evaluated
Retrieval workloads may require both efficient storage and optimized indexing strategies
Compression should be balanced with decompression cost
Vector storage and indexing choices can significantly impact retrieval performance and energy usage

Description​

Solution​

SCI Impact​

Cost Impact​

Assumptions​

Considerations​

References​