Ask HN: Distributed SQL engine for ultra-wide tables

The challenge of managing "ultra-wide tables"—datasets featuring hundreds of thousands to millions of columns—is a significant hurdle in domains like machine learning feature engineering and multi-omics data analysis. Traditional SQL databases often hit practical caps around 1,000-1,600 columns, while even columnar formats like Parquet typically necessitate complex Spark or Python pipelines. OLAP engines, though fast, are generally optimized for relatively narrower schemas, and feature stores frequently resort to data explosion through joins or multiple tables, which can introduce their own complexities.

The core problem shifts from handling a massive number of rows to efficiently managing an overwhelming number of columns, where metadata handling, query planning, and even SQL parsing become bottlenecks.

The Need for Ultra-Wide Tables

A concrete use case driving this need is multi-omics research. A single study might combine thousands of gene expression values, hundreds of thousands to a million SNPs (Single Nucleotide Polymorphisms), thousands of proteins and metabolites, alongside clinical metadata—all pertaining to a single patient. Historically, this data isn't stored in relational tables but in files and in-memory matrices, leading to substantial effort in repeatedly rebuilding these wide matrices just to explore subsets of features or patient cohorts. A robust "wide table" solution in this context provides a persistent, queryable representation of these conceptual matrices, simplifying data integration and exploration through direct SELECT statements.

Experimental Architectures and Solutions

One experimental approach to tackle this involves a highly specialized design that foregoes traditional relational database features like joins and transactions, focusing instead on distributing columns rather than rows, with SELECT as the primary operation. This design enables native SQL selects on tables with hundreds of thousands to millions of columns, delivering predictable sub-second latency when accessing a subset of columns. Performance figures include creating a 1M-column table in ~6 minutes, inserting a single column with 1M values in ~2 seconds, and selecting ~60 columns over ~5,000 rows in ~1 second on a modest two-server cluster.

The underlying mechanism for this experimental system involves storing table data on numerous MariaDB servers. Each table uses user-defined hash key columns for automatic partitioning. Wide tables are split into chunks, where one chunk—comprising the hash key plus a set of columns—resides on a single MariaDB server (or even a single database within a server). The system uses a "massive fork policy" for operations, with a dedicated, mirrored data dictionary.

Diverse Approaches from the Community

Several alternative strategies and tools have emerged for handling ultra-wide datasets:

Specialized File Formats and Query Engines:
- Vortex File Format: A columnar format designed for high performance, especially relevant for analytical workloads. It can be queried efficiently by tools like DuckDB locally, and its creators also offer SpiralDB as a distributed SQL engine built upon it.
- DuckDB: A fast, in-process analytical database that can directly query formats like Parquet and includes an extension for Vortex, making it a powerful option for local analysis of wide datasets, potentially negating the need for distributed processing if the data fits within a single machine's resources.
Array Databases:
- TileDB: These databases are purpose-built for multi-dimensional array data, which often characterizes scientific and analytical workloads with high dimensionality. TileDB provides a flexible storage format and API for managing and querying such data.
Sparse Matrix Formats for Genomics:
- For genomics data, which is typically numeric, mostly write-once, and often subjected to single-client offline analysis, the overhead of SQL might be unnecessary. Solutions like sparse matrix formats (CSC/CSR) can store numeric data very compactly (~12 bytes per non-zero entry). Combining these with smaller, easily manageable metadata tables can be highly effective. Projects like BPCells demonstrate efficient storage and querying for high-dimensional RNA-seq matrices.
Custom Object Stores:
- One intriguing approach involves an object store where metadata tags are managed in key-value stores. This system can be used to construct relational tables, handling many thousands (demonstrated up to 10,000, and tested to 1 million) columns with sub-second query times for subsets, even on a single desktop. This system effectively creates wide, sparse relational tables from metadata, providing both analytical and transactional capabilities.
OLAP and Columnar Databases (with limitations at extreme width):
- Tools like ClickHouse, Scuba, StarRocks, and Exasol are highly efficient for OLAP workloads and support columnar storage. They excel at fast analytics and handle dozens to hundreds of columns very well. However, the experience shared is that at the extreme widths (tens to hundreds of thousands or millions of columns) that trigger the initial problem, these systems can still encounter bottlenecks in metadata handling, query planning, or even column enumeration. They are designed for efficient access to referenced columns but the sheer volume of potential columns can be taxing, necessitating architectural shifts (e.g., dropping joins and transactions, explicit projection as part of the contract).

Rethinking Data Modeling and SQL Paradigms

Some discussions also touched on fundamental data modeling choices. For instance, the practice of exploding categorical variables using one-hot encoding (OHE) and storing them as columns directly in tables is questioned. Instead, storing untransformed feature data and applying transformations like OHE upon reading (parameterized by the training data subset) is often recommended. Furthermore, exploring solutions like compressing columns with custom datatypes (potentially requiring database internals like PostgreSQL's C extensions) or abstracting join complexity through views were suggested to manage wide schemas within more conventional SQL frameworks. The consensus, however, is that for truly ultra-wide datasets, a departure from some traditional SQL paradigms, particularly regarding joins and transactions, often becomes necessary.