flashinfer.testing.bench_gpu_time¶
- flashinfer.testing.bench_gpu_time(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, l2_flush: bool | None = None, l2_flush_size_mb: int | None = None, l2_flush_device: str | None = None, sleep_after_run: bool = False, enable_cupti: bool = False, use_cuda_graph: bool = False, num_iters_within_graph: int = 10, input_args: Tuple = (), input_kwargs: dict | None = None, cold_l2_cache: bool = True)¶
Unified GPU benchmarking interface with configurable timing backends.
This is the recommended entry point for GPU kernel benchmarking. It provides a single interface that dispatches to the appropriate timing implementation based on the configuration flags.
Timing Backends (in order of precedence):
CUPTI (
enable_cupti=True): Most accurate, measures pure GPU kernel time via hardware profiling. Requires cupti-python >= 13.CUDA Graphs (
use_cuda_graph=True): Amortizes launch overhead by capturing and replaying multiple kernel calls. Good balance of accuracy and availability.CUDA Events (default): Simplest method, measures launch + execution. Available everywhere but includes CPU overhead.
Cold-L2 Strategy (automatically selected based on timing backend):
- Returns:
Per-iteration execution times in milliseconds.
- Return type:
List[float]
Example
Simple benchmarking with CUDA events (default):
>>> times = bench_gpu_time(fn=lambda: my_kernel()) >>> print(f"Median: {np.median(times):.3f} ms")
Example
CUDA graph benchmarking for reduced launch overhead:
>>> def run_kernel(x, y, out): ... my_memory_bound_kernel(x, y, out) >>> times = bench_gpu_time( ... fn=run_kernel, ... input_args=(x, y, out), ... use_cuda_graph=True, ... )
Example
CUPTI benchmarking for most accurate GPU kernel time:
>>> times = bench_gpu_time( ... fn=run_kernel, ... input_args=(x, y, out), ... enable_cupti=True, ... )
See also
bench_gpu_time_with_cuda_event: Direct CUDA event timing.bench_gpu_time_with_cudagraph: Direct CUDA graph timing.bench_gpu_time_with_cupti: Direct CUPTI timing.
Deprecated since version The:
l2_flush,l2_flush_size_mb, andl2_flush_deviceparameters are deprecated. Usecold_l2_cacheinstead.