Xianzhi Zeng, Wenchao Jiang, Shuhao Zhang
Matrix multiplication (MM) is pivotal in fields from deep learning to scientific computing, driving the quest for improved computational efficiency. Accelerating MM encompasses strategies like complexity reduction, parallel and distributed computing, hardware acceleration, and approximate computing techniques, namely AMM algorithms. Amidst growing concerns over the resource demands of large language models (LLMs), AMM has garnered renewed focus. However, understanding the nuances that govern AMM’s effectiveness remains incomplete. This study delves into AMM by examining algorithmic strategies, operational specifics, dataset characteristics, and their application in real-world tasks. Through comprehensive testing across diverse datasets and scenarios, we analyze how these factors affect AMM’s performance, uncovering that the selection of AMM approaches significantly influences the balance between efficiency and accuracy, with factors like memory access playing a pivotal role. Additionally, dataset attributes are shown to be vital for the success of AMM in applications. Our results advocate for tailored algorithmic approaches and careful strategy selection to enhance AMM’s effectiveness. To aid in the practical application and ongoing research of AMM, we introduce LibAMM —a toolkit offering a wide range of AMM algorithms, benchmarks, and tools for experiment management. LibAMM aims to facilitate research and application in AMM, guiding future developments towards more adaptive and context-aware computational solutions.
Approximate Computing; Matrix Multiplication
Xianzhi Zeng, Shuhao Zhang, Hongbin Zhong, et al.
Stream Window Join (SWJ), a vital operation in stream analytics, struggles with achieving a balance between accuracy and latency due to out-of-order data arrivals. Existing methods predominantly rely on adaptive buffering, but often fall short in performance, thereby constraining practical applications. We introduce PECJ, a solution that proactively incorporates unobserved data to enhance accuracy while reducing latency, thus requiring robust predictive modeling of stream oscillation. At the heart of PECJ lies a mathematical formulation of the posterior distribution approximation (PDA) problem using variational inference (VI). This approach circumvents error propagation while meeting the low-latency demands of SWJ. We detail the implementation of PECJ, striking a balance between complexity and generality, and discuss both analytical and learning-based approaches. Experimental evaluations reveal PECJ’s superior performance. The successful integration of PECJ into a multi-threaded SWJ benchmark testbed further establishes its practical value, demonstrating promising advancements in enhancing data stream processing capabilities amidst out-of-order data.
Streams and complex event processing
Xianzhi Zeng, Shuhao Zhang
Data stream compression attracts much attention recently due to the rise of IoT applications. Thanks to the balanced computational power and energy consumption, asymmetric multicores are widely used in IoT devices. This paper introduces CStream, a novel framework for parallelizing stream compression on asymmetric multicores to minimize energy consumption without violating the user-specified compressing latency constraint. Existing works cannot effectively utilize asymmetric multicores for stream compression, primarily due to the non-trivial asymmetric computation and asymmetric communication effects. To this end, CStream is developed with the following two novel designs: 1) fine-grained decomposition, which decomposes a stream compression procedure into multiple finegrained tasks to better expose the task-core affinities under the asymmetric computation effects; and 2) asymmetry-aware task scheduling, which schedules the decomposed tasks based on a novel cost model to exploit the exposed task-core affinities while considering asymmetric communication effects. To validate our proposal, we evaluate CStream with five competing mechanisms of parallelizing stream compression algorithms on a recent asymmetric multicore processor. Our extensive experiments based on a benchmark consisting of three algorithms and four datasets, show that CStream outperforms alternative approaches by up to 53% lower energy consumption without compressing latency constraint violation.
Stream compression, Edge Computing and IoT, Asymmetric Hardware
Xianzhi Zeng, Shuhao Zhang
Data stream compression has attracted vast interest in emerging IoT (Internet of Things) applications. However, adopting stream compression on IoT applications is non-trivial due to the divergent demands, i.e., low energy consumption, high throughput, low latency, high compressibility, and tolerable information loss, which sometimes conflict with each other. This is particularly challenging when adopting stateful stream compression algorithms, which rely on states, e.g., a dictionary or model. This paper presents our vision of CStream, a hardware-conscious stateful stream compression framework for IoT applications. Through careful hardware-conscious optimizations, CStream will minimize energy consumption while striving to satisfy the divergent performance demands for parallelizing complex stateful stream compression algorithms for IoT applications.
Stream compression, IoT and Edge Computing, Asymmetric and Heterogeneous Hardware
Hao Zhang, Xianzhi Zeng, Shuhao Zhang, Xinyi Liu, Mian Lu, Zhao Zheng, Yuqiang Chen
OpenMLDB is an open-source machine learning database, that provides a feature platform computing consistent features for training and inference. The online interval join (OIJ ), i.e., joining two input streams over relative time intervals, is becoming a core operation in OpenMLDB. Its costly nature and intrinsic parallelism opportunities have created significant interest in accelerating OIJ on modern multicore processors. In this work, we first present an in-depth empirical study on an existing parallel OIJ algorithm (Key-OIJ ), which applies a key-partitioned parallelization strategy. Key-OIJ has been implemented in Apache Flink and used in real-world applications. However, our study points out the limitations of Key-OIJ , and reveals that Key-OIJ is not capable of fully exploiting modern multicore processors. Based on our analysis, we propose a new approach, the Scale-OIJ algorithm with a set of optimization techniques. Compared with Key-OIJ , Scale-OIJ is particularly efficient for handling workloads involving fewer keys, large time intervals, and large lateness configurations. The extensive experiments using real workloads have demonstrated the superior performance of Scale-OIJ . Furthermore, we have partially integrated and tested Scale-OIJ in the latest version of OpenMLDB, demonstrating its practicality in a machine learning database.