



At the 4th International Workshop on Testing Distributed Internet of Things Systems (TDIS ’26), Babar Ali presented ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge—Cloud Speculative LLM Serving. TDIS held in Edinburgh, Scotland, as part of the broader EUROSYS conference (April 27–30, 2026).
TDIS focuses on the tools and frameworks for testing and evaluating distributed IoT systems, which is an increasingly critical area as the emerging applications are spreading the computations across the edge–cloud continuum. This year’s edition brought together researchers working on distributed systems, edge computing, and AI inference, making it an ideal venue for our work on collaborative LLM deployment across edge and cloud.
What is ConfigSpec?
To remedy this, speculative decoding emerged as a promising solution, which enables collaborative inferencing. In this approach, a lightweight “draft” model on the edge device quickly proposes a sequence of candidate tokens, which are then sent to a powerful “target” model in the cloud for verification. The accepted tokens become part of the final response, and the rejected are discarded. Speculative decoding aids in distributing the work between edge and cloud while offering the response quality of the “target” model.
If not carefully used, the promising Speculative decoding can yield delayed responses and escalate cost, energy, and memory consumption across the heterogeneous hardware. Because there is an enormous number of configuration choices to make: which draft model to deploy? What should be the quantization level? Which model family to prefer? How many tokens to speculate at a time? And on which hardware? Finding answers to select the best configuration can be time-consuming and costly, which leads us to present this work.
ConfigSpec is our framework for navigating this space systematically. It profiles each edge device and draft model, measuring drafting speed, power draw, and how well the draft model aligns with the target model. From these measurements, it analytically evaluates three integral deployment objectives:
- Goodput: How many verified tokens are produced per second?
- Cost efficiency – How many accepted tokens are produced per dollar of cloud API spend?
- Energy efficiency – How many joules are consumed per verified token on the edge device, given the heterogeneous hardware?
We evaluated ConfigSpec across three edge platforms (Raspberry Pi 4B, Raspberry Pi 5, and NVIDIA Jetson AGX Orin) and two LLM families (Llama 3 and Qwen3) using the Databricks Dolly dataset, and found interesting insights. To begin with, goodput favours the smallest, fastest draft model with a device-dependent speculation length. Cost efficiency always favours the largest draft model at a speculation length of 2, where Qwen demonstrated better cost efficiency. Finally, energy efficiency agrees with goodput on model size but also converges to a speculation length of 2.
Consequently, no single configuration wins on all three fronts, confirming that profiling-based selection is not just helpful but necessary, where ConfigSpec provides industry-led practical insights for diverse configuration spaces.
Read the Paper
The full paper is available via the ACM Digital Library: