Data Agents Finally Get Real: DAComp & DP-Bench Crush the 'erfect Query' Myth


It’s a real-world benchmark for data AI agents with 210 tasks covering the entire data lifecycle—from grabbing data to making actual business decisions. Forget the old “perfect query” nonsense; DAComp throws agents into real enterprise workflows: cleaning messy datasets, exploring patterns, building models, visualizing results, and even suggesting next steps. No more pretending models understand databases when they’re still guessing how to handle real data.

Let’s be real: most NL2SQL tests (Spider, BIRD) are just a single step—translate to SQL. But real analysts don’t stop there. They grab data, clean it, build models, and actually decide what to do next. DAComp throws LLMs into the actual chaos of enterprise workflows: handling messy files, picking the right Python library, fixing errors, and even drafting business recommendations. No more “perfect query” fantasy.
Built by 8 real data engineers (not just AI), DAComp has two parts:
73 enterprise SaaS setups with 400+ columns each, filled with synthetic but realistic data. Split into three phases:
100 complex live databases + analysis layers from DE. For each table, annotators draft 8 open-ended questions → 5-person voting panel picks the top 2 that’d make a real analyst sweat (e.g., “Why did sales drop 30% in Q3?”).
DAComp isn’t a toy test. Even GPT-4o cringes at the engineering tasks—only 20% success rate in DE, and way lower for strategy-level decisions. This isn’t about “getting SQL right.” It’s about building agents that actually work in the wild. Finally, a benchmark that stops pretending.
This isn’t just another NL2SQL benchmark. DP-Bench is the first test for data product generation systems—where “data product” means real business value, like predicting customer churn before they cancel, so support teams can actually do something about it.
Forget “just generate SQL” nonsense. DP-Bench forces models to:
No more “perfect query” fantasy. Every metric in DP-Bench has a traceable SQL behind it—so you can actually see how the model got there.

Let’s be honest: most Text-to-SQL tests (BIRD, Spider) only care about “translate to SQL.” But real data work? It’s messy. You need to clean data, derive metrics, track where they came from. DP-Bench makes models actually do the whole thing—starting from a business request (DPR), finding relevant tables, selecting columns, and proving how they built each derived metric.
No more pretending LLMs understand databases when they’re still guessing how to handle real business needs.
Built from BIRD’s real database schemas + ELT-Bench’s transformation pipelines—no more “fake” data.
DP-Bench isn’t just another benchmark. It finally tests if models can handle real business data work—not just spit out SQL. But here’s the catch: 71% of initial requests needed zero tweaks to work. Meaning? We’re still not testing enough messy business edge cases.
It’s not the finish line—it’s the first step toward Data Mesh. Think of it as NL2SQL’s awkward cousin who actually tries to understand business.
SQLFlash is your AI-powered SQL Optimization Partner.
Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.
Join us and experience the power of SQLFlash today!.