GPU Acceleration
Roadmap for GPU Support in Elmfire.jl
Elmfire.jl is designed with GPU acceleration in mind. This page describes the current state, the planned approach, and the computational profile that motivates the work.
Status
Phase 0 (complete): Core types have been refactored to support GPU arrays.
FireState{T, A}is parameterized on array typeA, so it can hold eitherMatrix{T}(CPU) or GPU arrays.FuelModelArray{T}provides a dense struct-of-arrays fuel model lookup, replacing theDict-basedFuelModelTablewhich cannot run on GPU.active_mask()converts the narrow band’sSetto a denseBitMatrixfor GPU kernel masking.surface_spread_rate_flat()computes Rothermel spread rates fromFuelModelArrayusing array indexing instead of struct access.
Phase 1 (complete): GPU kernels for velocity, CFL, RK2, and simulation driver.
simulate_gpu!andsimulate_gpu_uniform!run complete fire simulations on GPU.- Spread rate kernel computes Rothermel + elliptical spread + velocity components per cell.
- CFL reduction kernel for adaptive timestep control.
- RK2 level set update kernel with narrow band masking.
- Supports
accel_time_constantfor fire acceleration (computed on host, passed as scalar). - Tested with
KernelAbstractions.CPU()backend in CI.
Phase 2+ (planned): GPU ensemble support and optimization. See below.
Computational Profile
On a 512x512 grid with ~1,000 active cells in the narrow band, each simulation timestep breaks down roughly as:
| Component | Time | Parallelizable? |
|---|---|---|
| Velocity calculation (Rothermel + wind) | ~100 ms | Yes (per-cell, independent) |
| Crown fire | ~70 ms | Yes (per-cell, independent) |
| Level set RK2 | ~10 ms | Yes (stencil, structured) |
| Weather interpolation | ~10 ms | Yes (per-cell lookup) |
| Narrow band updates | ~10 ms | No (set operations) |
The first four components (~190 ms) are candidates for GPU offloading. The narrow band management (~10 ms) stays on the CPU.
Approach
KernelAbstractions.jl
GPU kernels will be written with KernelAbstractions.jl (KA), which provides a vendor-agnostic @kernel macro. The same kernel code runs on:
- NVIDIA GPUs (via CUDA.jl)
- AMD GPUs (via AMDGPU.jl)
- Apple Silicon (via Metal.jl)
- CPU (for testing without GPU hardware)
Package Extension
GPU support will be shipped as a package extension so that users without GPUs pay no dependency cost:
# Project.toml
[weakdeps]
KernelAbstractions = "..."
Adapt = "..."
[extensions]
ElmfireKAExt = ["KernelAbstractions", "Adapt"]CPU/GPU Boundary
The narrow band is managed on the CPU (it uses Set operations). Each timestep:
- CPU collects active cells and builds a mask
- Mask is uploaded to GPU
- GPU runs velocity, CFL, and RK2 kernels
- Newly burned cell indices are downloaded to CPU
- CPU updates narrow band and burned status
Grid arrays (phi, ux, uy) stay resident on the GPU between timesteps to minimize transfers.
Planned Phases
Phase 2: GPU Ensemble Support
Multiple ensemble members run concurrently using GPU streams. Each stream executes an independent simulation, reusing the run_ensemble_threaded! pattern with GPU-accelerated inner loops.
Phase 3: Optimization
- Dual-mode launch: 1D kernel over an index list for small fires (<5K active cells); 2D grid kernel with mask for large fires.
- Persistent GPU state: Eliminate per-timestep mask uploads by detecting newly burned cells on GPU.
- Float32 fast path: Already supported by the type system; validate precision on GPU.
Projected Performance
Estimates assume a mid-to-high-end GPU (e.g. NVIDIA A100) on a 512x512 grid:
| Component | CPU | GPU (est.) | Speedup |
|---|---|---|---|
| Velocity calculation | ~170 ms | ~2-5 ms | 30-80x |
| RK2 level set | ~10 ms | ~0.2 ms | 50x |
| CFL reduction | ~1 ms | ~0.1 ms | 10x |
| Narrow band (CPU) | ~10 ms | ~10 ms | 1x |
| Total per timestep | ~200 ms | ~15-20 ms | 10-15x |
The narrow band management on CPU becomes the bottleneck on GPU. For larger grids or more active cells, the GPU advantage grows because the CPU-side cost is proportionally smaller.
These are projections, not benchmarks. Actual performance will depend on hardware, grid size, and fire characteristics.