Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Fangyu Lei*1, Jixuan Chen*1, Yuxiao Ye1, Ruisheng Cao1, Dongchan Shin1,
Hongjin Su1, Zhaoqing Suo1, Hongcheng Gao1, Wenjing Hu1, Pengcheng Yin4,
Victor Zhong6, Caiming Xiong2, Ruoxi Sun5, Qian Liu3, Sida Wang, Tao Yu1,
1The University of Hong Kong 2Salesforce Research 3Sea AI Lab
4Google Deepmind 5Google Cloud AI Research 6University of Waterloo

osworld task_demonstration

Abstract

Real-world enterprise text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. We introduce Spider 2.0, an evaluation framework comprising 632 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, which goes far beyond traditional text-to-SQL challenges. Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 17.0% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation --- especially in prior text-to-SQL benchmarks --- they require significant improvement in order to achieve adequate performance for real-world enterprise usage. Progress on Spider 2.0 represents crucial steps towards developing intelligent, autonomous, code agents for real-world enterprise settings.

News

  • Nov. 12, 2024: We released Spider 2.0 full paper, data and code!
  • Aug. 28, 2024: We released a smaller version of Spider 2.0 (~ 25% of the full dataset) containing 190 examples to give users early access. As this is a preliminary release, there may be errors. Your feedback would be invaluable in refining the dataset. Stay tuned!

Why Spider 2.0?

In 2018, we introduced Spider 1.0 , SParC, and CoSQL as part of the Yale Semantic Parsing and Text-to-SQL Challenge Series, attracting over 300 submissions from leading research labs worldwide.

Now, in the era of Large Language Models (LLMs), we present Spider 2.0 to advance code generation, particularly text-to-SQL capabilities.

This new benchmark offers a more realistic and challenging test of LLMs' performance on complex enterprise-level text-to-SQL workflows, involving complex data environments (e.g., >3000 columns), multiple SQL dialects (e.g., BigQuery, Snowflake), and diverse operations (e.g., transformation, analytics).

Notably, even the advanced LLMs-o1-preview solve only 17.1% of Spider 2.0 tasks. For widely-used models like GPT-4o, the success rate is only 10.1% on Spider 2.0 tasks, compared to 86.6% on Spider 1.0, underscoring the substantial challenges posed by Spider 2.0.

Setting Task Type #Examples Databases Cost
Spider 2.0 Code agent task 632 BigQuery(214), Snowflake(198), Postgres(10), ClickHouse(7), SQLite(135), DuckDB (DBT)(68) Some cost incurred
Spider 2.0-Snow Text-to-SQL task 547 Snowflake(547) NO COST!😊
Spider 2.0-Lite Text-to-SQL task 547 BigQuery(214), Snowflake(198), SQLite(135) Some cost incurred

Spider 2.0-lite

To meet with research interests in traditional Text2SQL settings, we also release a subset of Spider 2.0 called Spider 2.0-Lite which is more self-contained, to support faster development and evaluation.

Spider 2.0-snow

Spider 2.0-snow includes 547 examples, all hosted on Snowflake, which offers participants free quotas. If you want to test performance on a single SQL dialect, don’t hesitate to use Spider 2.0-snow.

Submission

Refer to the Quick Start to run your experiments on Spider 2.0, Spider 2.0-snow, or Spider 2.0-lite. For submission, provide a clear README, compressed code that passes your dev evaluation, any additional API keys required, and a report of prompt token counts for cost estimation. Follow the Submission Guideline for evaluation on full dataset. Usually, we will return your results in 10 days!

Acknowledgement

We thank Snowflake for their generous support in hosting the Spider 2.0 Challenge. We also thank Tianbao Xie, Yiheng Xu, Fan Zhou, Yuting Lan, Per Jacobsson, Yiming Huang, Canwen Xu, Zhewei Yao, and Binyuan Hui for their helpful feedback on this work. The leaderboard submission guidelines are greatly inspired by BIRD-SQL, and we thank them for their contributions.

Snowflake Logo

Data Examples

test image

Have Questions?

Ask us questions at our Github issues page or contact Fangyu Lei, Jixuan Chen, Ruisheng Cao or Yuxiao Ye for more information.

Leaderboard

Spider 2.0 is a comprehensive code generation agent task that includes 632 examples. The agent has to interactively explore various types of databases, such as BigQuery, Snowflake, Postgres, ClickHouse, DuckDB, and SQLite. It is required to engage with complex SQL workflows, process extensive contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines across multiple interactions.
Rank Method Score

1

Nov 2, 2024
Spider-Agent + o1-preview 17.01

2

Nov 2, 2024
Spider-Agent + GPT-4o 10.13

3

Nov 2, 2024
Spider-Agent + Claude-3.5-Sonnect 9.02

4

Nov 2, 2024
Spider-Agent + GPT-4 8.86

5

Nov 2, 2024
Spider-Agent + Qwen2.5-72B 6.17

6

Nov 2, 2024
Spider-Agent + DeepSeek-V2.5 5.22

7

Nov 2, 2024
Spider-Agent + Gemini-Pro-1.5 2.53

8

Nov 2, 2024
Spider-Agent + Llama-3.1-405B 2.21