Invited Talk – Dr Luo Mai – Bringing LLM Inference to Wafer Scale Systems – DIPSA: Data-Intensive Parallel Systems and Algorithms

Abstract

Emerging AI accelerators increasingly adopt wafer-scale integration, combining hundreds of thousands of cores with massive on-chip memory and ultra-high bandwidth. Yet, existing LLM inference systems—designed primarily for GPUs—cannot fully exploit this architecture. In this talk, I will present WaferLLM, the first LLM inference system designed specifically for wafer-scale accelerators. WaferLLM introduces new approaches for wafer-scale prefill and decode parallelism, KV-cache management, and high-performance kernels—MeshGEMM and MeshGEMV—to maximise hardware utilisation. On commodity hardware (Cerebras WSE-2), WaferLLM achieves 2,700 tokens per second for a single user, translating to less than one millisecond per token and demonstrating its potential for efficient scaling in test-time compute.

Bio

Luo Mai is a Reader (Associate Professor) at the University of Edinburgh, where he leads the Large-Scale Machine Learning Systems Group and co-directs the UK EPSRC Centre for Doctoral Training in Machine Learning Systems. His research spans systems, AI, and data, and has been recognised with multiple research and rising-star awards from academia and industry. He has co-authored a textbook on ML systems and co-founded several widely used open-source AI system libraries. Previously, he worked at Imperial College London and Microsoft Research and received his PhD from Imperial College London with support from a Google Doctoral Fellowship.

Date and Venue

2 March 2026
Ashby Building, Queen’s University Belfast