<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Database Reading Group | DIPr Lab at PSU</title><link>https://diprlab.github.io/dbrg/</link><atom:link href="https://diprlab.github.io/dbrg/index.xml" rel="self" type="application/rss+xml"/><description>Database Reading Group</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 06 Mar 2026 00:00:00 +0000</lastBuildDate><image><url>https://diprlab.github.io/media/logo_hu_b20e6a1540b35ad9.png</url><title>Database Reading Group</title><link>https://diprlab.github.io/dbrg/</link></image><item><title>Winter 2026 Week 9</title><link>https://diprlab.github.io/dbrg/events/2026/winter/09/</link><pubDate>Fri, 06 Mar 2026 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2026/winter/09/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;Title&lt;/td&gt;
&lt;td&gt;
BridgeScope: A Universal Toolkit for Bridging Large Language Models and Databases
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authors&lt;/td&gt;
&lt;td&gt;
Lianggui Weng, Dandan Liu, Rong Zhu, Bolin Ding, Jingren Zhou
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Abstract&lt;/td&gt;
&lt;td&gt;
As large language models (LLMs) demonstrate increasingly powerful reasoning and orchestration capabilities, LLM-based agents are rapidly adopted for complex data-related tasks. Despite this progress, the current design of how LLMs interact with databases exhibits critical limitations in usability, security, privilege management, and data transmission efficiency. To address these challenges, we introduce BridgeScope, a universal toolkit that bridges LLMs and databases through three key innovations. First, it modularizes SQL operations into fine-grained tools for context retrieval, CRUD execution, and ACID-compliant transaction management. This design enables more precise, LLM-friendly controls over database functionality. Second, it aligns tool implementations with database privileges and user-defined security policies to steer LLMs away from unsafe or unauthorized operations, which not only safeguards database security but also enhances task execution efficiency by enabling early identification and termination of infeasible tasks. Third, it introduces a proxy mechanism that supports seamless data transfer between tools, thereby bypassing the transmission bottlenecks via LLMs. All of these designs are database-agnostic and can be transparently integrated with existing agent architectures. We also release an open-source implementation of BridgeScope for PostgreSQL. Evaluations on two novel benchmarks demonstrate that BridgeScope enables LLM agents to interact with databases more effectively. It reduces token usage by up to 80% through improved security awareness and uniquely supports data-intensive workflows beyond existing toolkits. These results establish BridgeScope as a robust foundation for next-generation intelligent data automation.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Winter 2026 Week 8</title><link>https://diprlab.github.io/dbrg/events/2026/winter/08/</link><pubDate>Fri, 27 Feb 2026 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2026/winter/08/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;Title&lt;/td&gt;
&lt;td&gt;
Algorithmic Data Minimization for Machine Learning over Internet-of-Things Data Streams
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authors&lt;/td&gt;
&lt;td&gt;
Ted Shaowang, Shinan Liu, Jonatas Marques, Nick Feamster, Sanjay Krishnan
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Abstract&lt;/td&gt;
&lt;td&gt;
Machine learning can analyze vast amounts of data generated by IoT devices to identify patterns, make predictions, and enable real-time decision-making. This raises significant privacy concerns, necessitating the application of data minimization – a foundational principle in emerging data regulations, which mandates that service providers only collect data that is directly relevant and necessary for a specified purpose. Despite its importance, data minimization lacks a precise technical definition in the context of sensor data, where collections of weak signals make it challenging to apply a binary “relevant and necessary” rule. This paper provides a technical interpretation of data minimization in the context of sensor streams, explores practical methods for implementation, and addresses the challenges involved. Through our approach, we demonstrate that our framework can reduce user identifiability by up to 16.7% while maintaining accuracy loss below 1%, offering a viable path toward privacy-preserving IoT data processing.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Winter 2026 Week 5</title><link>https://diprlab.github.io/dbrg/events/2026/winter/05/</link><pubDate>Fri, 06 Feb 2026 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2026/winter/05/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;Title&lt;/td&gt;
&lt;td&gt;
I Can’t Believe It’s Not Yannakakis: Pragmatic Bitmap Filters in Microsoft SQL Server
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authors&lt;/td&gt;
&lt;td&gt;
Hangdong Zhao et al.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Abstract&lt;/td&gt;
&lt;td&gt;
The quest for optimal join processing has reignited interest in the Yannakakis algorithm, as researchers seek to realize its theoretical ideal in practice via bitmap filters instead of expensive semijoins. While this academic pursuit may seem distant from industrial practice, our investigation into production databases led to a startling discovery: over the last decade, Microsoft SQL Server has built an infrastructure for bitmap pre-filtering that subsumes the very spirit of Yannakakis! This is not a story of academia leading industry; but rather of industry practice, guided by pragmatic optimization, outpacing academic endeavors. This paper dissects this discovery. As a crucial contribution, we prove how SQL Server’s bitmap filters, pull-based execution, and Cascades optimizer conspire to not only consider, but often generate, instance-optimal plans, when it truly minimizes the estimated cost! Moreover, its rich plan search space reveals novel, largely overlooked pre-filtering opportunities on intermediate results, which approach strong semi-robust runtime for arbitrary join graphs. Instead of a verdict, this paper is an invitation: by exposing a system design that is long-hidden, we point our community towards a challenging yet promising research terrain.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Winter 2026 Week 4</title><link>https://diprlab.github.io/dbrg/events/2026/winter/04/</link><pubDate>Fri, 30 Jan 2026 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2026/winter/04/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;Title&lt;/td&gt;
&lt;td&gt;
LOCATER: Cleaning WiFi Connectivity Datasets for Semantic Localization
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authors&lt;/td&gt;
&lt;td&gt;
Yiming Lin, Daokun Jiang, Roberto Yus, Georgios Bouloukakis, Andrew Chio, Sharad Mehrotra, Nalini Venkatasubramanian
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Abstract&lt;/td&gt;
&lt;td&gt;
This paper explores the data cleaning challenges that arise in using WiFi connectivity data to locate users to semantic indoor locations such as buildings, regions, rooms. WiFi connectivity data consists of sporadic connections between devices and nearby WiFi access points (APs), each of which may cover a relatively large area within a building. Our system, entitled semantic LOCATion cleanER (LOCATER), postulates semantic localization as a series of data cleaning tasks - first, it treats the problem of determining the AP to which a device is connected between any two of its connection events as a missing value detection and repair problem. It then associates the device with the semantic subregion (e.g., a conference room in the region) by postulating it as a location disambiguation problem. LOCATER uses a bootstrapping semi-supervised learning method for coarse localization and a probabilistic method to achieve finer localization. The paper shows that LOCATER can achieve significantly high accuracy at both the coarse and fine levels.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Winter 2026 Week 2</title><link>https://diprlab.github.io/dbrg/events/2026/winter/02/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2026/winter/02/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;Title&lt;/td&gt;
&lt;td&gt;
LLM-Driven Auto Configuration for Transient IoT Device Collaboration
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authors&lt;/td&gt;
&lt;td&gt;
Hetvi Shastri, Walid A. Hanafy, Li Wu, David Irwin, Mani Srivastava, Prashant Shenoy
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Abstract&lt;/td&gt;
&lt;td&gt;
Today's Internet of Things (IoT) has evolved from simple sensing and actuation devices to those with embedded processing and intelligent services, enabling rich collaborations between users and their devices. However, enabling such collaboration becomes challenging when transient devices need to interact with host devices in temporarily visited environments. In such cases, fine-grained access control policies are necessary to ensure secure interactions; however, manually implementing them is often impractical for non-expert users. Moreover, at run-time, the system must automatically configure the devices and enforce such fine-grained access control rules. Additionally, the system must address the heterogeneity of devices.&lt;br /&gt;&lt;br /&gt;
In this paper, we present CollabIoT, a system that enables secure and seamless device collaboration in transient IoT environments. CollabIoT employs a Large language Model (LLM)-driven approach to convert users' high-level intents to fine-grained access control policies. To support secure and seamless device collaboration, CollabIoT adopts capability-based access control for authorization and uses lightweight proxies for policy enforcement, providing hardware-independent abstractions.&lt;br /&gt;&lt;br /&gt;
We implement a prototype of CollabIoT's policy generation and auto configuration pipelines and evaluate its efficacy on an IoT testbed and in large-scale emulated environments. We show that our LLM-based policy generation pipeline is able to generate functional and correct policies with 100% accuracy. At runtime, our evaluation shows that our system configures new devices in ~150 ms, and our proxy-based data plane incurs network overheads of up to 2 ms and access control overheads up to 0.3 ms.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Fall 2025 Week 9</title><link>https://diprlab.github.io/dbrg/events/2025/fall/09/</link><pubDate>Wed, 26 Nov 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/fall/09/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
SIEVE: Effective Filtered Vector Search with Collection of Indexes
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Zhaoheng Li, et al.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Real-world tasks such as recommending videos tagged kids can be reduced to finding similar vectors associated with hard predicates. This task, filtered vector search, is challenging as prior state-of-the-art graph-based (unfiltered) similarity search techniques degenerate when hard constraints are considered: effective graph-based filtered similarity search relies on sufficient connectivity for reaching similar items within a few hops. To consider predicates, recent works propose modifying graph traversal to visit only items that satisfy predicates. However, they fail to offer the just-a-few-hops property for a wide range of predicates: they must restrict predicates significantly or lose efficiency if only few items satisfy predicates. &lt;br /&gt; &lt;br /&gt;
We propose an opposite approach: instead of constraining traversal, we build many indexes each serving different predicate forms. For effective construction, we devise a three-dimensional analytical model capturing relationships among index size, search time, and recall, with which we follow a workload-aware approach to pack as many useful indexes as possible into a collection. At query time, the analytical model is employed yet again to discern the one that offers the fastest search at a given recall. We show superior performance and support on datasets with varying selectivities and forms: our approach achieves up to 8.06 x speedup while having as low as 1% build time versus other indexes, with less than 2.15 x memory of a standard HNSW graph and modest knowledge of past workloads.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Fall 2025 Week 8</title><link>https://diprlab.github.io/dbrg/events/2025/fall/08/</link><pubDate>Wed, 19 Nov 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/fall/08/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
Adaptive Differentially Private Structural Entropy Minimization for Unsupervised Social Event Detection
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Zhiwei Yang, et al.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Social event detection refers to extracting relevant message clusters from social media data streams to represent specific events in the real world. Social event detection is important in numerous areas, such as opinion analysis, social safety, and decision-making. Most current methods are supervised and require access to large amounts of data. These methods need prior knowledge of the events and carry a high risk of leaking sensitive information in the messages, making them less applicable in open-world settings. Therefore, conducting unsupervised detection while fully utilizing the rich information in the messages and protecting data privacy remains a significant challenge. To this end, we propose a novel social event detection framework, ADP-SEMEvent, an unsupervised social event detection method that prioritizes privacy. Specifically, ADP-SEMEvent is divided into two stages, i.e., the construction stage of the private message graph and the clustering stage of the private message graph. In the first stage, an adaptive differential privacy approach is used to construct a private message graph. In this process, our method can adaptively apply differential privacy based on the events occurring each day in an open environment to maximize the use of the privacy budget. In the second stage, to address the reduction in data utility caused by noise, a novel 2-dimensional structural entropy minimization algorithm based on optimal subgraphs is used to detect events in the message graph. The highlight of this process is unsupervised and does not compromise differential privacy. Extensive experiments on two public datasets demonstrate that ADP-SEMEvent can achieve detection performance comparable to state-of-the-art methods while maintaining reasonable privacy budget parameters.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Fall 2025 Week 7</title><link>https://diprlab.github.io/dbrg/events/2025/fall/07/</link><pubDate>Wed, 12 Nov 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/fall/07/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
Scribe: How Meta transports terabytes per second in real time
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Manos Karpathiotakis, et al.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Millions of web servers and a multitude of applications are producing ever-increasing amounts of data in real time at Meta. Regardless of how data is generated and how it is processed, there is a need for infrastructure that can accommodate the transport of arbitrarily large data streams from their generation location to their processing location with low latency. &lt;br /&gt; &lt;br /&gt;
This paper presents Scribe, a multi-tenant message queue service that natively supports the requirements of Meta’s data-intensive applications, ingesting &gt; 15 TB/s and serving &gt; 110 TB/s to its consumers. Scribe relies on a multi-hop write path and opportunistic data placement to maximise write availability, whereas its read path adapts replica placement and representation based on the incoming workload as a means to minimise resource consumption for both Scribe and its downstreams. The wide range of Scribe use cases can pick from a range of offered guarantees, based on the trade-offs favourable for each one.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Fall 2025 Week 6</title><link>https://diprlab.github.io/dbrg/events/2025/fall/06/</link><pubDate>Wed, 05 Nov 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/fall/06/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
Delta Sharing: An Open Protocol for Cross-Platform Data Sharing
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Krishna Puttaswamy, et al.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Organizations across industries increasingly rely on sharing data to drive collaboration, innovation, and business performance. However, securely and efficiently sharing live data across diverse platforms and adhering to varying governance requirements remains a significant challenge. Traditional approaches, such as FTP and proprietary in-data-warehouse solutions, often fail to meet the demands of interoperability, cost, scalability, and low overhead. This paper introduces Delta Sharing, an open protocol we developed in collaboration with industry partners, to overcome these limitations. Delta Sharing leverages open formats like Delta Lake and Apache Parquet alongside simple HTTP APIs to enable seamless, secure, and live data sharing across heterogeneous systems. Since its launch in 2021, Delta Sharing has been adopted by over 4000 enterprises and supported by hundreds of major software and data vendors. We discuss the key challenges in developing Delta Sharing and how our design addresses them. We also present, to our knowledge, the first large-scale study of production data sharing workloads offering insights into this emerging data platform capability.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Summer 2025 Week 4</title><link>https://diprlab.github.io/dbrg/events/2025/summer/04/</link><pubDate>Wed, 20 Aug 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/summer/04/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
TSB-UAD: An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
John Paparrizos ,Yuhao Kang , Paul Boniol , Ruey S. Tsay ,Themis Palpanas , Michael J. Franklin
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
The detection of anomalies in time series has gained ample academic and industrial attention. However, no comprehensive benchmark exists to evaluate time-series anomaly detection methods. It is common to use (i) proprietary or synthetic data, often biased to support particular claims; or (ii) a limited collection of publicly available datasets. Consequently, we often observe methods performing exceptionally well in one dataset but surprisingly poorly in another, creating an illusion of progress. To address the issues above, we thoroughly studied over one hundred papers to identify, collect, process, and systematically format datasets proposed in the past decades. We summarize our effort in TSB-UAD, a new benchmark to ease the evaluation of univariate time-series anomaly detection methods. Overall, TSB-UAD contains 13766 time series with labeled anomalies spanning different domains with high variability of anomaly types, ratios, and sizes. TSB-UAD includes 18 previously proposed datasets containing 1980 time series and we contribute two collections of datasets. Specifically, we generate 958 time series using a principled methodology for transforming 126 time-series classification datasets into time series with labeled anomalies. In addition, we present data transformations with which we introduce new anomalies, resulting in 10828 time series with varying complexity for anomaly detection. Finally, we evaluate 12 representative methods demonstrating that TSB-UAD is a robust resource for assessing anomaly detection methods. We make our data and code available at www.timeseries.org/TSB-UAD. TSB-UAD provides a valuable, reproducible, and frequently updated resource to establish a leaderboard of univariate time-series anomaly detection methods.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Summer 2025 Week 3</title><link>https://diprlab.github.io/dbrg/events/2025/summer/03/</link><pubDate>Wed, 06 Aug 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/summer/03/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
HoneyBee: Efficient Role-based Access Control for Vector Databases via Dynamic Partitioning
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Hongbin Zhong, Matthew Lentz, Nina Narodytska, Adriana Szekeres, Kexin Rong
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
As vector databases gain traction in enterprise applications, robust access control has become critical to safeguard sensitive data. Access control in these systems is often implemented through hybrid vector queries, which combine nearest neighbor search on vector data with relational predicates based on user permissions. However, existing approaches face significant trade-offs: creating dedicated indexes for each user minimizes query latency but introduces excessive storage redundancy, while building a single index and applying access control after vector search reduces storage overhead but suffers from poor recall and increased query latency. This paper introduces HoneyBee, a dynamic partitioning framework that bridges the gap between these approaches by leveraging the structure of Role-Based Access Control (RBAC) policies. RBAC, widely adopted in enterprise settings, groups users into roles and assigns permissions to those roles, creating a natural "thin waist" in the permission structure that is ideal for partitioning decisions. Specifically, HoneyBee produces overlapping partitions where vectors can be strategically replicated across different partitions to reduce query latency while controlling storage overhead. By introducing analytical models for the performance and recall of the vector search, HoneyBee formulates the partitioning strategy as a constrained optimization problem to dynamically balance storage, query efficiency, and recall. Evaluations on RBAC workloads demonstrate that HoneyBee reduces storage redundancy compared to role partitioning and achieves up to 6x faster query speeds than row-level security (RLS) with only 1.4x storage increase, offering a practical middle ground for secure and efficient vector search.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Summer 2025 Week 2</title><link>https://diprlab.github.io/dbrg/events/2025/summer/02/</link><pubDate>Wed, 23 Jul 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/summer/02/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
An Elephant Under the Microscope: Analyzing the Interaction of Optimizer Components in PostgreSQL
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Rico Bergmann, Claudio Hartmann, Dirk Habich, Wolfgang Lehner
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Despite an ever-growing corpus of novel query optimization strategies, the interaction of the core components of query optimizers is still not well understood. This situation can be problematic for two main reasons: On the one hand, this may cause surprising results when two components influence each other in an unexpected way. On the other hand, this can lead to wasted effort in regard to both engineering and research, e.g., when an improvement for one component is dwarfed or entirely canceled out by problems of another component. Therefore, we argue that making improvements to a single optimization component requires a thorough understanding of how these changes might affect the other components. To achieve this understanding, we present results of a comprehensive experimental analysis of the interplay in the traditional optimizer architecture using the widely-used PostgreSQL system as prime representative. Our evaluation and analysis revisit the core building blocks of such an optimizer, i.e. per-column statistics, cardinality estimation, cost model, and plan generation. In particular, we analyze how these building blocks influence each other and how they react when faced with faulty input, such as imprecise cardinality estimates. Based on our results, we draw novel conclusions and make recommendations on how these should be taken into account.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Summer 2025 Week 1</title><link>https://diprlab.github.io/dbrg/events/2025/summer/01/</link><pubDate>Wed, 09 Jul 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/summer/01/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Daniel Sotolongo, Daniel Mills, Tyler Akidau, Anirudh Santhiar, Attila-Péter Tóth, Botong Huang, Boyuan Zhang, Igor Belianski, Ling Geng, Matt Uhlar, Nikhil Shah, Olivia Zhou, Saras Nowak, Sasha Lionheart, Vlad Lifliand, Wendy Grus, Yiwen Zhu, Ankur Sharma, Dzmitry Pauliukevich, Enrico Sartorello, Ilaria Battiston, Ivan Kalev, Lawrence Benson, Leon Papke, Niklas Semmler, Till Merker, Yi Huang
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Streaming data pipelines remain challenging and expensive to build and maintain, despite significant advancements in stronger consistency, event time semantics, and SQL support over the last decade. Persistent obstacles continue to hinder usability, such as the need for manual incrementalization, semantic discrepancies across SQL implementations, and the lack of enterprise-grade operational features (e.g. granular access control, disaster recovery). While the rise of incremental view maintenance (IVM) as a way to integrate streaming with databases has been a huge step forward, transaction isolation in the presence of IVM remains underspecified, which leaves the maintenance of application-level invariants as a painful exercise for the user. Meanwhile, most streaming systems optimize for latencies of 100 milliseconds to 3 seconds, whereas many practical use cases are well-served by latencies ranging from seconds to tens of minutes.
&lt;p&gt;In this paper, we present delayed view semantics (DVS), a conceptual foundation that bridges the semantic gap between streaming and databases, and introduce Dynamic Tables, Snowflake&amp;rsquo;s declarative streaming transformation primitive designed to democratize analytical stream processing. DVS formalizes the intuition that stream processing is primarily a technique to eagerly compute derived results asynchronously, while also addressing the need to reason about the resulting system end to end. Dynamic Tables then offer two key advantages: ease of use through DVS, enterprise-grade features, and simplicity; as well as scalable cost efficiency via IVM with an architecture designed for diverse latency requirements. We first develop extensions to transaction isolation that permit the preservation of invariants in streaming applications. We then detail the implementation challenges of Dynamic Tables and our experience operating it at scale. Finally, we share insights into user adoption and discuss our vision for the future of stream processing.&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Spring 2025 Week 9</title><link>https://diprlab.github.io/dbrg/events/2025/spring/09/</link><pubDate>Fri, 30 May 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/spring/09/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
In-Database Time Series Clustering
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Yunxiang Su, Kenny Ye Liang, Shaoxu Song
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Time series data are often clustered repeatedly across various time ranges to mine frequent subsequence patterns from different periods, which could further support downstream applications. Existing state-of-the-art (SOTA) time series clustering method, such as K-Shape, can proficiently cluster time series data referring to their shapes. However, in-database time series clustering problem has been neglected, especially in IoT scenarios with large-volume data and high efficiency demands. Most time series databases employ LSM-Tree based storage to support intensive writings, yet causing underlying data points out-of-order in timestamps. Therefore, to apply existing out-of-database methods, all data points must be fully loaded into memory and chronologically sorted. Additionally, out-of-database methods must cluster from scratch each time, making them inefficient when handling queries across different time ranges. In this work, we propose an in-database adaptation of SOTA time series clustering method K-Shape. Moreover, to solve the problem that K-Shape cannot efficiently handle long time series, we propose Medoid-Shape, as well as its in-database adaptation for further acceleration. Extensive experiments are conducted to demonstrate the higher efficiency of our proposals, with comparable effectiveness. Remarkably, all proposals have already been implemented in an open-source commodity time series database, Apache IoTDB.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Spring 2025 Week 8</title><link>https://diprlab.github.io/dbrg/events/2025/spring/08/</link><pubDate>Fri, 23 May 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/spring/08/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
Highly Efficient and Scalable Access Control Mechanism for IoT Devices in Pervasive Environments
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Alian Yu, Jian Kang, Wei Jiang and Dan Lin
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
With the continuous advancement of sensing, networking, controlling, and computing technologies, there is a growing number of IoT (Internet of Things) devices emerging that are expected to integrate into public infrastructure in the near future. However, the deployment of these smart devices in public venues presents new challenges for existing access control mechanisms, particularly in terms of efficiency. To address these challenges, we have developed a highly efficient and scalable access control mechanism that enables automatic and fine-grained access control management while incurring low overhead in large-scale settings. Our mechanism includes a dual-hierarchy access control structure and associated information retrieval algorithms, which we have used to develop a large-scale IoT device access control system called FACT+. FACT+ overcomes the efficiency issues of granting and inquiring access control status over millions of devices in pervasive environments. Additionally, our system offers a pay-and-consume scheme and plug-and-play device management for convenient adoption by service providers. We have conducted extensive experiments to demonstrate the practicality, effectiveness, and efficiency of our access control mechanism.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Spring 2025 Week 6</title><link>https://diprlab.github.io/dbrg/events/2025/spring/06/</link><pubDate>Fri, 09 May 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/spring/06/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
Grouping, Subsumption, and Disjunctive Join Optimizations in Oracle
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Rafi Ahmed, Krishna Kantikiran Pasupuleti, Sriram Tirupattur, Lei Sheng, Hong Su, Mohamed Ziauddin
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Query optimization must evolve with new workloads. As analytic and data warehouse workloads become more ubiquitous, optimization techniques that reduce the amount of data processed during query execution, enable shared computation and avoid expensive data access and joins must be rigorously explored. In this paper, we present aggregate-decomposition techniques as enhancements to an existing query transformation that performs grouping before joins. Consequently, the transformation generates more query rewrite candidates and can also be applied to a larger set of queries. Further, we introduce two new query transformations, i) subsumption of views and subqueries that explores opportunities for sharing computation and ii) union-all duplicator transformation for queries with disjunctive join predicates that removes the need for multiple data access and joins. These techniques are applicable to commonly noticed query patterns in customer workloads and provide significant performance benefit as indicated in our performance study. They have been implemented in Oracle RDBMS.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Spring 2025 Week 4</title><link>https://diprlab.github.io/dbrg/events/2025/spring/04/</link><pubDate>Fri, 25 Apr 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/spring/04/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
How good are query optimizers, really?
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, Thomas Neumann
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries. We investigate the quality of industrial-strength cardinality estimators and find that all estimators routinely produce large errors. We further show that while estimates are essential for finding a good join order, query performance is unsatisfactory if the query engine relies too heavily on these estimates. Using another set of experiments that measure the impact of the cost model, we find that it has much less influence on query performance than the cardinality estimates. Finally, we investigate plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and find that exhaustive enumeration improves performance despite the sub-optimal cardinality estimates.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Spring 2025 Week 3</title><link>https://diprlab.github.io/dbrg/events/2025/spring/03/</link><pubDate>Fri, 18 Apr 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/spring/03/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
PDX: A Data Layout for Vector Similarity Search
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Leonardo Kuffo, Elena Krippner, and Peter Boncz from CWI Amsterdam, The Netherlands
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
We propose Partition Dimensions Across (PDX), a data layout for vectors (e.g., embeddings) that, similar to PAX, stores multiple vectors in one block, using a vertical layout for the dimensions (Figure 1). PDX accelerates exact and approximate similarity search thanks to its dimension-by-dimension search strategy that operates on multiple-vectors-at-a-time in tight loops. It beats SIMD-optimized distance kernels on standard horizontal vector storage (avg 40% faster), only relying on scalar code that gets auto-vectorized. We combined the PDX layout with recent dimension-pruning algorithms ADSampling and BSA that accelerate approximate vector search. We found that these algorithms on the horizontal vector layout can lose to SIMD-optimized linear scans, even if they are SIMD-optimized. However, when used on PDX, their benefit is restored to 2-7x. We find that search on PDX is especially fast if a limited number of dimensions has to be scanned fully, which is what the dimension-pruning approaches do. We finally introduce PDX-BOND, an even more flexible dimension-pruning strategy, with good performance on exact search and reasonable performance on approximate search. Unlike previous pruning algorithms, it can work on vector data "as-is" without preprocessing; making it attractive for vector databases with frequent updates.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item><item><title>Spring 2025 Week 1</title><link>https://diprlab.github.io/dbrg/events/2025/spring/01/</link><pubDate>Fri, 04 Apr 2025 00:00:00 +0000</pubDate><guid>https://diprlab.github.io/dbrg/events/2025/spring/01/</guid><description>&lt;table&gt;
&lt;tr&gt;
&lt;td&gt;
Title
&lt;/td&gt;
&lt;td&gt;
Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor Search
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Authors
&lt;/td&gt;
&lt;td&gt;
Yuzheng Cai, Jiayang Shi, Yizhuo Chen, Weigue Zheng
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Abstract
&lt;/td&gt;
&lt;td&gt;
Given a query vector, approximate nearest neighbor search (ANNS) aims to retrieve similar vectors from a set of high-dimensional base vectors. However, many real-world applications jointly query both vector data and structured data, imposing label constraints such as attributes and keywords on the search, known as filtered ANNS. Effectively incorporating filtering conditions with vector similarity presents significant challenges, including index for dynamically filtered search space, agnostic query labels, computational overhead for label-irrelevant vectors, and potential inadequacy in returning results. To tackle these challenges, we introduce a novel approach called the Label Navigating Graph, which encodes the containment relationships of label sets for all vectors. Built upon graph-based ANNS methods, we develop a general framework termed Unified Navigating Graph (UNG) to bridge the gap between label set containment and vector proximity relations. UNG offers several advantages, including versatility in supporting any query label size and specificity, fidelity in exclusively searching filtered vectors, completeness in providing sufficient answers, and adaptability in integration with most graph-based ANNS algorithms. Extensive experiments on real datasets demonstrate that the proposed framework outperforms all baselines, achieving 10x speedups at the same accuracy.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</description></item></channel></rss>