LMSYS ORGProjectsBlogAboutDonationsChatbot Arena (graduated)by: Weihao Cui, Yukang Chen, Xiaoze Fan, Han Zhao, Ziyi Xu, Xusheng Chen, Bingsheng He, Quan Chen, Sep 28, 2025This post highlights our initial efforts to support a new serving paradigm, PD-Multiplexing, in SGLang. It is designed to deliver higher goodput in LLM serving. PD-Multiplexing leverages GreenContext, a new NVIDIA GPU capability that allows lightweight and fine-grained partitioning of GPU resources across tasks within the same process. We envision this paradigm as a promising new approach to LLM service deployment, delivering stronger SLO guarantees and higher goodput for Model-as-a-Service (MaaS).Delivering MaaS at scale demands that LLM serving systems consistently meet stringent Service Level Objectives (SLOs) without sacrificing throughput. In practice, this translates into satisfying the well-established latency SLOs for both stages of inference: Time-to-First-Token (TTFT) during the prefill phase, and Inter-Token Latency (ITL)—also referred to as Time-Between-Tokens (TBT)—during the decode phase. The challenge arises because prefill and decode interleave on the same serving instance, creating contention for GPU resources. Two common approaches have emerged to enforce SLO compliance:In particular, when targeting a tight SLO threshold for practical LLM services, the shortcomings of both disaggregation and chunking become increasingly pronounced.To this end, we propose a new serving paradigm, PD-Multiplexing, for achieving higher goodput. It multiplexes the prefill and decode phases within the same instance through intra-GPU spatial sharing. It offers several important benefits:Figure 1 presents an overview of PD-Multiplexing, which consists of two core modules: a bubble-less multiplex engine that independently and efficiently executes prefill and decode phases, and an SLO-aware dispatcher that iteratively generates multiplexing plans compliant with SLOs.We built the new paradigm on top of GreenContext, a capability introduced in NVIDIA GPUs starting with CUDA 12.4. GreenContext enables lightweight intra-process spatial sharing. Briefly, we can create multiple CUDA streams with dedicated SM allocations for concurrent GPU kernels since CUDA 12.6. With GreenContext, GPU resources can be dynamically partitioned between the prefill and decode phases, adapting to SLO requirements, workload patterns, and other serving needs in real time.To preserve the existing serving architecture, we adopt single-thread scheduling for multiplexing prefill and decode, rather than creating separate threads for each. This choice is also motivated by the fact that Python’s Global Interpreter Lock (GIL) still prevents true parallel execution, and will remain the default for upcoming versions. Fortunately, both prefill and decode are asynchronous, which makes this design feasible. By switching the dispatch between prefill and decode using their dedicated GreenContext streams, we can enable multiplexing.However, such an integration of GreenContext introduces GPU bubbles. As illustrated in Figure 2(a), these bubbles arise for two reasons: (1) The launch time of prefill phases is significantly longer than that of decode phases (which involve only a single CUDA graph). In some cases, launching a prefill phase takes longer than executing an entire decode phase, leaving GPU resources idle. (2) The number of iterations in the decode phase is non-deterministic. When all requests in a decode batch finish early, the pre-allocated SMs may remain underutilized if a prefill has already been launched.To address this, we split the prefill phase into smaller prefill blocks, as shown in Figure 2(b). Since prefill is typically far more compute-intensive than decode, this block-level splitting incurs negligible overhead while effectively eliminating GPU bubbles during multiplexing.With the bubble-less multiplex engine in place, the next challenge is scheduling prefill blocks and decode batches. Offline profiling shows that the two phases compete for resources under GreenContext. The root cause is that while GreenContext partitions SMs, it does not partition memory bandwidth, making contention difficult to model. To address this, we profile representative workloads offline and use the results to train a latency predictor that drives our SLO-aware scheduling policies. Since the modeling depends on the specific model and hardware environment, we omit the details here but will provide a detailed tutorial with practical, step-by-step guidance in the future.The intuition behind the scheduling policy is simple: allocate just enough SMs to the decode phase to guarantee the ITL SLO, then dedicate all remaining SMs to prefill. At the same time, we determine the number of prefill blocks to launch. This way, decode always runs under strict SLO guarantees, while prefill proceeds as quickly as possible to enlarge the decode batch size.In summary, we evaluate PD-Multiplexing against multiple baselines across a range of workloads and devices. We first present an experiment that is easy to reproduce, then demonstrate the advantages of PD-Multiplexing using real-world traces and diverse tasks. Finally, we provide a zoomed-out visualization of runtime scheduling details. In our extensive evaluations, PD-Multiplexing improves goodput by up to 3.06x over state-of-the-art baselines.* The following results are presented for research purposes. In real-world applications, SLO requirements are often more specific. Here, we use these benchmarks to illustrate the potential of PD-Multiplexing. We first evaluate PD-Multiplexing against chunked-prefill with varying chunk sizes on a single H200 GPU running CodeLlama-34b-hf. Figure 3 reports the 99th-percentile TTFT and ITL. We set the SLO target of ITL to 60 ms. We do not impose SLO constraints on TTFT. Instead, we report the P99 of TTFT to demonstrate the efficiency of PD-Multiplexing. In the figure, solid points indicate that the corresponding baseline meets the ITL SLO requirement, while empty points indicate that the baseline violates it.The results highlight a clear advantage: PD-Multiplexing delivers the fastest TTFT while consistently meeting the stringent SLO target for ITL. In contrast, Chunked-prefill often must reduce the chunk size below 1024 to satisfy such a strict ITL requirement, which degrades prefill performance and leaves GPU resources underutilized. This benefit becomes even more pronounced for long-context workloads such as LooGLE, where the inefficiency of chunking is magnified. Reproduction instructions are available here.We have also evaluated PD-Multiplexing with real-world trace, Mooncake-Tool&Agent. We compared it with chunked-prefill and PD-disaggregation. All are based on the codebase of SGLang for fair comparison. This experiment is conducted on a server with 8 A100s and the chunk size for chunked-prefill is 512. The ratio of P:D in disaggregation is 1:1. Prefix cache sharing is enabled.Figure 4(a) presents the TTFT and ITL results. Compared with chunked prefill, PD-Multiplexing improves both metrics. Relative to PD-disaggregation, it achieves noticeably shorter TTFT, while both methods meet the SLO for decode phases. To assess its impact on goodput, we gradually increase the request rate and measure SLO attainment. As shown in Figure 5, PD-Multiplexing sustains significantly higher goodput, delivering up to 3.06× and 1.62× improvements over chunked prefill and PD-disaggregation, respectively.We further evaluate PD-Multiplexing on three representative tasks: OpenThoughts, ShareGPT, and LooGLE. These tasks exhibit contrasting workload patterns: OpenThoughts features the shortest prefill input with the longest decode output; ShareGPT involves longer prefill input but shorter decode output; and LooGLE stresses the system with the longest prefill input and the shortest decode output.Figure 5 reports the results across these tasks. PD-Multiplexing consistently maintains strong performance and achieves significantly higher goodput than the baselines. To illustrate how this is realized in practice, Figure 6 presents a zoomed-out runtime timeline of scheduling decisions. As shown, PD-Multiplexing dynamically adapts resource allocation: for OpenThoughts, it assigns minimal resources to prefill, while for LooGLE, it minimizes resources for decode. The timeline further demonstrates how the scheduler seamlessly switches between different SM allocation plans as workloads vary, ensuring both SLO compliance and high efficiency.PD-Multiplexing shows strong promise for MaaS deployments. We will focus our next steps on the following areas:We have a proof-of-concept implementation of PD-Multiplexing, and the roadmap for full integration into SGLang is underway.