Lean Data-Stream Pipelines through Summarization

By Quang Vinh Ngo

Energy Network Day 2025

Last week, we, several of the people from RELAX-DN at Chalmers were joining the Energy Network Day 2025, organized by Area of Advance Energy at Chalmers University of Technology. This is where researchers from both academia and industry join together to stay updated and explore new collaboration opportunities in the field of energy. The event includes networking, inspiring examples, knowledge exchange, and a poster exhibition.

Our team presented two research posters at the event. The first titled “Data-driven support for intelligent stream analytics: Energy-efficient processing using summarization”, outlining our theoretical framework for optimizing data pipelines through summarization techniques. The second poster, “Summarization-Based Clustering: A Use Case in Electricity Grid Data” demonstrated practical implementation of these concepts. In the rest of the blog, we explore the motivation and the application of the Data Summarization concept in general, which can be applied to many fields, such as finance, security, energy, etc.

Context & Motivation

Today, there is a wide variety of data sources: Vehicles, sensors, smart meters, edge devices, smart homes, IoT devices, etc. (heterogeneous data sources), producing enormous amounts of data. Analysis of this data can bring many valuable insights. For example, smart meter data helps detect anomalous consumption patterns, enables predictive maintenance for utility equipment, supports personalized energy-saving recommendations, and improves grid management through better load forecasting. However, leveraging this data presents significant challenges, which are well-captured by the set of characteristics known as the 5V’s of big data: Volume (amount of data being generated and collected), Velocity (the speed at which data is generated, streamed, and needs to be processed), Variety (the diverse types of data and structures), Veracity (the trustworthiness or accuracy of the data), and Value (the potential to turn data into meaningful insights and tangible benefits).

To address these challenges, organizations implement data pipeline infrastructure—systems that coordinate and automate the transfer and processing of data from source to destination. These pipelines ingest data from heterogeneous sources, transform it through various processing stages, and deliver clean, structured data for downstream applications such as machine learning, data analytics, and monitoring systems.

The data pipeline provides us a way to deal with the aforementioned 5V’s of big data. To deal with high “Volume”, distributed processing frameworks can be implemented to scale computing horizontally across multiple machines. To handle high “Velocity”, pipelines incorporate stream processing components that process data in motion rather than waiting for complete batch collection. The “Variety” challenge is addressed through standardized ingestion patterns with flexible schema handling and format conversion modules. For “Veracity”, pipelines include validation rules, anomaly detection, and data quality checks at various stages to filter out corrupt or inaccurate information. Finally, to maximize Value, they incorporate purpose-built transformations that filter, aggregate, and enrich raw data into formats optimized for specific analytical needs.

Despite these capabilities, traditional data pipelines often struggle with performance bottlenecks when handling extreme data volumes/rates and near-real-time/real-time processing requirements. This challenge is particularly evident in domains like financial services, high-frequency trading systems require microsecond-level data processing where even minor delays can result in millions of dollars in missed opportunities or losses. Similarly, fraud detection systems must analyze transaction patterns in real-time across millions of accounts to prevent financial crimes before they complete. Once they fail, e.g. if trading platforms experience system failures during periods of market volatility, they can cause significant financial and reputational damage across entire organizations. These examples highlight the consequences when data processing infrastructure fails to maintain performance under extreme conditions.

To alleviate such challenges of massive data volumes and the associated processing costs (hardware, energy, time), summarization of data offers compelling ways.

Scalable/Energy-efficient processing using summarization

Data Summarization is based on the observation that big data is often very large but also often ephemeral, with the value brought by different pieces of data being uneven. Instead of needing to store, process, and index the entire amount of data, we can extract useful information from massive data sets into synopses data structures, typically requiring much less space and computation.

Those algorithms make use of a number of basic algorithmic/mathematical concepts such as histograms, clusters, extracting frequency moments, and various mathematical transformations to produce compact representations of the data. Given the case of time-series analysis of financial data, trading systems traditionally must process both recent and historical data to identify correlations between prices, market events, and temporal patterns. Data summarization techniques effectively approximate the essential shape and features of these time-series while significantly reducing data volume. These approximations preserve critical patterns and anomalies while discarding noise, which possibly enable more effective analysis with much lower computational requirements. By implementing effective data summarization strategies, organizations can maintain high-performance data pipelines even under extreme conditions that would otherwise overwhelm traditional processing approaches.

Moreover, having simple, fast and efficient algorithms as baselines can further offers several long-term substantial benefits such as scalability and energy efficiency.

Our research:

At the DCS @ Chalmers (Distributed Computing & Systems) group, our research focuses on advancing data summarization techniques to enable lean, efficient data pipelines. We investigate enhanced, efficient data structures and transformation methods that preserve analytical value while minimizing resource consumption. Our work also examines how summarization algorithms can better utilize modern hardware architectures, from multi-core processors to specialized accelerators, to achieve better throughput with less energy expenditure. This theoretical work finds practical application through our collaborations with industrial partners including Volvo Group, Gothenburg Energy, and possibly other partners at SESBC (https://www.sesbc.se/), where we work on applying our summarization techniques to real-world challenges such as support of grid management, forecasting & customer segmentation, etc.