Andes: A GPU Task Scheduling Framework
Abstract
GPU programming today follows a host-centric model: the CPU orchestrates every kernel launch, acting as a scheduler that must synchronize with the device between stages. This model works well for regular, bulk-synchronous workloads, but breaks down for algorithms with irregular or data-dependent parallelism — graph traversal, adaptive mesh refinement, iterative solvers, and producer-consumer pipelines — where the amount of work at each stage is unknown ahead of time and host round-trips become a bottleneck.
We introduce Andes, a GPU task scheduling framework that eliminates the CPU from the execution loop entirely. Andes exposes a composable, type-safe dataflow API — built around primitives such as transform, filter, reduce, for_each, and subflow — which programmers use to express task graphs as functional pipelines. Once submitted, the entire execution proceeds on the device through a round-based persistent executor that uses CUDA Dynamic Parallelism (CDP2) with tail-launch semantics: each scheduling round launches the next from within the GPU itself, with no host involvement.
A central contribution of Andes is automatic kernel fusion. Adjacent fusable operations in a pipeline — sequences of transforms and filters — are detected at compile time and merged into a single kernel, eliminating redundant passes over global memory without any programmer annotation. Task scheduling is managed by three rotating MultiQueue instances that organize work by thread requirements, while a dedicated holder queue defers tasks awaiting asynchronous DeviceFuture results. Andes integrates with the Gallatin GPU memory manager to support dynamic task spawning with low-overhead device-side allocation throughout execution.