Autotuning¶
FlashInfer includes an autotuner that selects the best kernel implementation (runner and tactic) for each operation and input shape by profiling at runtime.
What Is Autotuning?¶
Several FlashInfer operations – GEMM and MoE – support multiple backend implementations (runners). Each runner may also expose several low-level tactics (e.g. tile sizes, pipeline stages). The best choice depends on the hardware, data types, and input shapes of your workload.
Without autotuning, FlashInfer picks a default (fallback) tactic. With autotuning enabled, the autotuner profiles every candidate for a given shape and automatically selects the fastest one.
Enabling Autotuning¶
Wrap the portion of your code that you want to tune inside the
flashinfer.autotune context manager:
import flashinfer
with flashinfer.autotune():
# All FlashInfer ops executed here will be profiled.
output = flashinfer.gemm.bmm_fp8(A, B, A_scale, B_scale, dtype=out_dtype)
The first time an operation runs inside the context, the autotuner benchmarks
all available runners and tactics for that (operation, backend, shape)
combination. Subsequent calls with the same shape reuse the cached result
without re-profiling.
You can also pass tune_mode=True explicitly (the default):
with flashinfer.autotune(True):
output = flashinfer.gemm.bmm_fp8(A, B, A_scale, B_scale, dtype=out_dtype)
Note that flashinfer.autotune(), flashinfer.autotune(True), and flashinfer.autotune(tune_mode=True) are all equivalent.
When tune_mode=False, the context manager enters a no-profiling mode that
only uses previously cached or loaded configs. This is equivalent to flashinfer.autotune(False):
with flashinfer.autotune(False):
# No profiling -- uses default/fallback tactic if nothing is cached.
model(inputs)
Autotuning in the Benchmark Harness¶
The FlashInfer benchmark harness supports autotuning via the --autotune
flag:
python3 flashinfer_benchmark.py \
--routine mm_fp4 --m 4 --n 7168 --k 4608 \
--out_dtype bfloat16 --backends cudnn cutlass trtllm \
--use_128x4_sf_layout --use_nvfp4 \
--autotune
Config Lookup Priority¶
When a FlashInfer operation executes, the autotuner resolves the best
(runner, tactic) by searching these sources in order:
In-memory profiling cache — results from live autotuning in the current process.
User-loaded file configs — loaded via
load_configs()orautotune(cache=...).Bundled package configs — legacy
.pyconfig files shipped with FlashInfer (only when theFLASHINFER_AUTOTUNER_LOAD_FROM_FILE=1environment variable is set and tuning mode is off).Fallback tactic (−1) — a safe default that every runner must implement.
Notably, user-loaded file configs (level 2) are always consulted, even during tuning mode, so that already-tuned shapes from a cache file are never re-profiled.
Config Caching¶
Note
Config caching (the cache parameter) is experimental.
Single-process and multi-threaded use is fully supported.
Multi-process and multi-node use is best-effort: concurrent writes to
a shared cache file may result in lost updates from race conditions.
By default, autotuning results live only in memory and are lost when the
process exits. The cache parameter on flashinfer.autotune lets you
persist results to a JSON file and load them back in future runs, avoiding
repeated profiling.
Saving Tuned Configs¶
Pass a file path to cache when tuning. On exit, all profiled configs are
written to that file:
import flashinfer
with flashinfer.autotune(True, cache="my_configs.json"):
model(inputs)
# On exit, tuned configs are saved to my_configs.json.
Loading Cached Configs¶
To reuse previously tuned configs without profiling, pass tune_mode=False
with the same cache path:
import flashinfer
with flashinfer.autotune(False, cache="my_configs.json"):
# Configs are loaded on entry. No profiling occurs.
model(inputs)
Incremental Tuning¶
Cache files support incremental updates. When autotune(True, cache=path)
exits, save_configs performs the following merge:
Previously loaded configs (from the file read on entry) are used as a base.
Newly profiled configs are overlaid (new results take priority for duplicate keys).
The file on disk is re-read and merged, so that configs saved by other sessions since entry are also preserved (in-memory results still win on overlap).
The merged result is atomically written back to the same file.
This means you can run multiple tuning sessions – for example different batch sizes or sequence lengths – and accumulate all configs in a single file:
# Session 1: tune with batch_size=1
with flashinfer.autotune(True, cache="configs.json"):
run_model(batch_size=1)
# Session 2: tune with batch_size=32 (configs.json now has both)
with flashinfer.autotune(True, cache="configs.json"):
run_model(batch_size=32)
Cache Hit Behavior During Tuning¶
When autotune(True, cache=path) is active and a matching config is found in
the cache file, the autotuner uses it directly without re-profiling. This
means:
Shapes that were already tuned are skipped, saving time.
Only truly new shapes trigger profiling.
A log message is printed once per
(operation, runner)pair when a cache hit is detected.
Caching in the Benchmark Harness¶
The benchmark harness supports config caching via the --autotune_cache flag.
Tune and save during benchmarking:
python3 flashinfer_benchmark.py \
--routine mm_fp4 --m 4 --n 7168 --k 4608 \
--out_dtype bfloat16 --backends cudnn cutlass trtllm \
--use_128x4_sf_layout --use_nvfp4 \
--autotune --autotune_cache my_configs.json
Run with cached configs (no profiling):
python3 flashinfer_benchmark.py \
--routine mm_fp4 --m 4 --n 7168 --k 4608 \
--out_dtype bfloat16 --backends cudnn cutlass trtllm \
--use_128x4_sf_layout --use_nvfp4 \
--autotune_cache my_configs.json
Cache File Format¶
The cache file is a plain JSON dictionary. Each key is a string representation
of (custom_op, runner_class_name, optimization_profile) and each value is
[runner_class_name, tactic]:
{
"_metadata": {
"flashinfer_version": "0.6.3",
"cuda_version": "13.0",
"cublas_version": "13.2.1",
"cudnn_version": "91900",
"gpu": "NVIDIA B200"
},
"('fp4_gemm', 'CudnnFp4GemmRunner', ((4, 7168), (7168, 4608), ...))": [
"CudnnFp4GemmRunner",
3
],
"('flashinfer::trtllm_fp4_block_scale_moe', 'MoERunner', ((1, 7168), (1, 256), (1, 8), (1, 8), (1, 3584), (1, 448)))": [
"MoERunner",
[
8,
34
]
]
}
The _metadata key records the environment that created the cache file
(FlashInfer version, CUDA, cuBLAS, cuDNN, and GPU).
On load, _metadata is compared against the current environment. If any
field differs (e.g. different GPU, FlashInfer version, or cuBLAS version),
the entire cache is skipped — no configs are loaded, and the file will
not be overwritten on exit; i.e. the autotuner would behave as if the cache
file input was not provided (cache=None). This prevents silently using invalid
tactics and avoids destroying configs tuned for a different environment. A
warning is logged with the mismatch details and a suggestion to use a
different cache path for the current environment.
Advanced users can bypass individual checks by manually editing the JSON file
and setting a metadata field to "*". For example, setting
"cudnn_version": "*" in _metadata will skip the cuDNN version check
while still enforcing all other fields.
Tactics are typically integers, but some runners use compound tactics (e.g.
(tile_size, gemm1_tactic, gemm2_tactic)). These are serialized as nested
JSON arrays and restored to tuples on load.
The file is human-readable but not portable. Config ordering is not guaranteed to be stable across FlashInfer, CUDA, cuDNN, or cuBLAS versions.
API Reference¶
flashinfer.autotune¶
flashinfer.autotune(tune_mode: bool = True, cache: str | None = None)
Context manager for autotuning with optional file-based caching.
Parameters:
Parameter |
Type |
Description |
|---|---|---|
|
|
If |
|
|
Optional path to a JSON config file.
On entry, configs are loaded from this file if it exists.
On exit, configs are saved back to this file when |
Behavior matrix:
|
|
Load on entry? |
Save on exit? |
Use case |
|---|---|---|---|---|
|
path |
Yes (if file exists) |
Yes (incremental) |
Cache hits skip profiling; misses are tuned and merged back |
|
|
No |
No |
Tune in-memory only (results lost on exit) |
|
path |
Yes (if file exists) |
No |
Inference with pre-tuned configs |
|
|
No |
No |
No-op (default behavior) |
Multi-Thread / Multi-Process Considerations¶
Quick Reference¶
Environment |
Safe? |
Notes |
|---|---|---|
Single process |
Yes |
Fully safe. |
Multi-threaded (single process) |
Yes |
All state is lock-protected. |
Multi-process, each with its own cache file |
Yes |
No shared state. |
Multi-process, shared file, reading only |
Yes |
Readers never see partial files. |
Multi-process, shared file, all writing |
Best-effort |
Works under low contention. Under high contention the last writer can overwrite another’s results. Use per-rank files for guaranteed correctness. |
Thread Safety¶
The AutoTuner singleton is protected by a reentrant lock
(threading.RLock). All state-mutating operations – search_cache,
choose_one, save_configs, load_configs, clear_cache, and the
mode-flag save/restore in autotune() – acquire this lock, so multiple
threads can safely share the same autotuner instance.
During tuning mode, the lock also serializes GPU profiling per process, which is the correct behavior since concurrent kernel measurements would interfere with each other.
Multi-Process¶
Each process has its own AutoTuner singleton (separate address space), so
in-memory state is fully isolated. The only shared resource is the cache
file on disk.
Reads are safe. Writes use
os.replace(atomic on local filesystems), so a concurrent reader always sees either the old or new complete file, never a partial one.Concurrent writes are best-effort. Before writing,
save_configsre-reads the file from disk and merges any new entries from other processes (in-memory results win on overlap). This significantly reduces the lost-update window. However, the read-merge-write sequence is not itself atomic, so two truly simultaneous writers can still race:Process A Process B ───────── ───────── 1. Read file {X, Y} 2. Read file {X, Y} 3. Merge → {X, Y, Z} 4. Write {X, Y, Z} 5. Merge → {X, Y, W} (stale: doesn't see Z) 6. Write {X, Y, W} ← Z is lost
If you are tuning with multiple processes (e.g. multi-GPU
with torchrun), you could use separate output files per rank and merge them afterwards:
import json
merged = {}
for path in ["configs_rank0.json", "configs_rank1.json"]:
with open(path) as f:
merged.update(json.load(f))
with open("configs_merged.json", "w") as f:
json.dump(merged, f, indent=2, sort_keys=True)
Note
Atomic file writes rely on os.replace(), which maps to the POSIX
rename() syscall. This is atomic on all local filesystems and is
expected to be atomic on most network filesystems (NFS, Lustre) per POSIX
semantics. FlashInfer’s cubin caching also relies on this guarantee.