📚 FREE "DuckDB in Action" Book: Building Data Engineering Pipelines, Advanced SQL, and moreGet yours

Reflections on SIGMOD/PODS 2024: Insights and Highlights

2024/07/02

BY

Subscribe to MotherDuck Blog

The SIGMOD PODS 2024 conference, sponsored by MotherDuck and several tech giants, was full of groundbreaking research, innovative technologies, and engaging discussions. It was a hub of intellectual exchange and collaboration, which made for an inspiring and productive week in Santiago, Chile for the MotherDuck team.

This blog will walk through an overview of MotherDuck’s presence at SIGMOD and cover key highlights and innovation themes that caught our attention. Our biggest takeaway? There has never been a better time to be part of the database community, and we look forward to seeing how these advancements progress in the future.

MotherDuck’s Presence at SIGMOD

MotherDuck showcased our contributions at SIGMOD/PODS 2024 with a series of presentations.

Peter Boncz from Centrum Wiskunde & Informatica (CWI) is currently on sabbatical at MotherDuck. He delivered an inspiring keynote, "Making Data Management Better with Vectorized Query Processing," to make the case for doing data systems research with impact in the real world. And he did not fail to mention all the exciting work currently in progress at MotherDuck!

Peter Boncz SIGMOD 2024 Peter Boncz from CWI presenting his SIGMOD keynote on data systems research

Till Döhmen, AI/ML Lead, presented his research on "SchemaPile: A Large Collection of Relational Database Schemas" and provided the community with a corpus of 221,171 database schemas containing rich metadata to improve various data management applications.

Effy Xue Li, PhD Intern, introduced innovative approaches through her research, "Towards Efficient Data Wrangling with LLMs using Code Generation," to demonstrate how LLM-based data wrangling through code generation significantly improves data transformation tasks at lower computational costs.

Effy Xue Li and Till Döhmen (From left to right): Effy Xue Li, PhD Intern, and Till Döhmen, AI/ML Lead, in front of their co-authored paper “Towards Efficient Data Wrangling with LLMs using Code Generation”

Stephanie Wang, Founding Engineer, and Till Döhmen collaborated on a sponsor talk, "Simplifying Data Warehousing for Efficient and User-Friendly Data Management," that emphasizes MotherDuck's commitment to making data warehousing more accessible and efficient for users.

MotherDuck demo booth (Pictured from left to right): Effy Xue Li, PhD Intern, and Stephanie Wang, Founding Engineer, at MotherDuck’s demo station

MotherDuck’s sponsorship of SIGMOD and deep involvement in the academic community underscore our commitment to fostering innovation and supporting innovative database research. In the following sections, we’ll outline highlights and themes that caught our attention at SIGMOD 2024.

Disaggregated Memory

A key conference theme focused on the exploration of disaggregated memory systems. Disaggregated memory systems involve the separation of memory and compute, which requires advanced networking technologies to enable low-latency, high-bandwidth communication.

AlibabaCloud showcased PolarDB-MP, their multi-primary cloud-native database that leverages disaggregated shared memory, and also presented scalable distributed inverted list indexes designed for disaggregated memory.

Adaptive Lossless Floating-Point Compression (ALP)

The CWI research group, the origin of DuckDB, presented a new floating-point compression method.

A series of new codecs were recently introduced, starting with Facebook’s Gorilla encoding and followed by codecs called Chimp and Patas.

Notably, ALP outperforms these codecs in both compression and decompression speeds and compression ratio. The algorithm was first published in SIGMOD 2024 and presented by PhD student Leonardo Kuffo, and it has already been incorporated into DuckDB 0.10, which means MotherDuck customers can already take advantage of its efficiency benefits!

Group photo at SIGMOD (Pictured from left to right): Stephanie Wang, Peter Boncz, Ilaria Battiston, Effy Xue Li, and Leonardo Kuffo

SQL Alternatives and Additions

While SQL has existed since the early 1970s, there has never been a more opportune moment to innovate on its syntax and make analytics more intuitive. This is one of many reasons MotherDuck has adopted DuckDB’s intuitive, highly flexible SQL dialect, and we’re excited about the possibilities in this area and its potential applications. SIGMOD 2024 showcased new ideas on this topic by proposing novel SQL alternatives and additions.

TypeDB showcased TypeQL, a new query language inspired by natural language. It offers an expressive type system that promises to revolutionize how we interact with databases.

Looker by Google introduced Measures in SQL, which brings composable calculations to SQL, allowing context-sensitive expressions to be attached to tables, which makes tables with measures composable and closed when used in queries. This innovative addition is a significant enhancement to traditional SQL capabilities.

Proactive and Hybrid Resource Allocation

Proactive Resource Allocation

There was a significant focus on distributed systems at SIGMOD 2024, emphasizing proactive and hybrid resource allocation.

Microsoft presented its proactive resource allocation strategies for millions of serverless Azure SQL databases, while Alibaba showcased Flux, a cloud-native workload auto-scaling platform designed for AnalyticDB. It offers decoupled auto-scaling for heterogeneous query workloads.

Amazon also introduced RAIS, Redshift’s next-generation AI-powered Scaling, which includes new optimization techniques for intelligent scaling in Amazon Redshift.

Hybrid Resource Allocation

Microsoft’s scalable Container-As-A-Service Performance Enhanced Resizing algorithm for the cloud (CaaSPER) stood out in the hybrid resource allocation category. CaaSPER uses a combination of reactive and predictive approaches based on historical time-series data to make informed decisions about CPU requirements for monolithic applications.

Generative AI and Large Language Models (LLMs)

This year, there were many talks on Generative AI and LLMs and their applications in data management. With dozens of research papers and industry sessions, four (!) workshops, and two keynotes, it was impossible to ignore that Generative AI has arrived in the data management world and is here to stay.

Natural Language Interfaces

There were many panel discussions, industry talks, and hallway conversations where Text2SQL was a topic. The importance of context was repeatedly emphasized, particularly regarding rich schema metadata and query history. Several sessions also highlighted responsible AI safeguards and downstream feedback mechanisms. Preferred architectural patterns are converging towards a combination of Foundation Models (FM) and Retrieval Augmentation Generation (RAG), with optionally fine-tuned foundation models. It was exciting to see that progress has also continued in the development of smaller Text2SQL models. Notably, Renmin University of China presented the CodeS model, which achieved a new top score on the Spider benchmark.

Industry presentations underscored how Text2SQL solutions are primarily used today as co-pilots for SQL analysts and data scientists to yield significant productivity gains. Solutions such as semantic layers seem promising for enabling natural language interfaces for business users, particularly those that allow essential business metrics (e.g., organization-specific definitions of revenue) to be represented on the language layer.

Data Discovery

Finding the right data in a data lake with hundreds or thousands of tables often presents a challenging problem for data analysts and data scientists. Madelon Hulsebos from UC Berkeley presented an insightful user study on how users actually want to use data search systems. Simple search features that help users quickly identify the most relevant dataset are the most effective, but data freshness and semantics are crucial to their swift identification.

Cocoon, a semantic data profiling tool built by Zezhou Huang on DuckDB, fits in here very well! The future of dataset search is moving towards more interactive and flexible search solutions that go beyond keyword search. One fascinating example is Ver, a view discovery system, and we look forward to seeing how this space evolves and where MotherDuck can evolve to make data sharing and discovery more intuitive.

Looking Ahead

The SIGMOD/PODS 2024 conference highlighted ongoing advancements in database technologies and the importance of collaboration between academia and industry.

At MotherDuck, we look forward to seeing how these innovations will shape the data management landscape in the years to come.

Stay tuned for more updates and reflections on our involvement in upcoming conferences, and learn more about events and talks we’re giving worldwide!

CONTENT
  1. MotherDuck’s Presence at SIGMOD
  2. Disaggregated Memory
  3. Adaptive Lossless Floating-Point Compression
  4. SQL Alternatives and Additions
  5. Proactive and Hybrid Resource Allocation
  6. Generative AI and Large Language Models
  7. Looking Ahead

Subscribe to MotherDuck Blog