flashinfer.testing.bench_gpu_time

flashinfer.testing.bench_gpu_time(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, l2_flush: bool | None = None, l2_flush_size_mb: int | None = None, l2_flush_device: str | None = None, sleep_after_run: bool = False, enable_cupti: bool = False, use_cuda_graph: bool = False, num_iters_within_graph: int = 10, input_args: Tuple = (), input_kwargs: dict | None = None, cold_l2_cache: bool = True)

Unified GPU benchmarking interface with configurable timing backends.

This is the recommended entry point for GPU kernel benchmarking. It provides a single interface that dispatches to the appropriate timing implementation based on the configuration flags.

Timing Backends (in order of precedence):

  1. CUPTI (enable_cupti=True): Most accurate, measures pure GPU kernel time via hardware profiling. Requires cupti-python >= 13.

  2. CUDA Graphs (use_cuda_graph=True): Amortizes launch overhead by capturing and replaying multiple kernel calls. Good balance of accuracy and availability.

  3. CUDA Events (default): Simplest method, measures launch + execution. Available everywhere but includes CPU overhead.

Cold-L2 Strategy (automatically selected based on timing backend):

Returns:

Per-iteration execution times in milliseconds.

Return type:

List[float]

Example

Simple benchmarking with CUDA events (default):

>>> times = bench_gpu_time(fn=lambda: my_kernel())
>>> print(f"Median: {np.median(times):.3f} ms")

Example

CUDA graph benchmarking for reduced launch overhead:

>>> def run_kernel(x, y, out):
...     my_memory_bound_kernel(x, y, out)
>>> times = bench_gpu_time(
...     fn=run_kernel,
...     input_args=(x, y, out),
...     use_cuda_graph=True,
... )

Example

CUPTI benchmarking for most accurate GPU kernel time:

>>> times = bench_gpu_time(
...     fn=run_kernel,
...     input_args=(x, y, out),
...     enable_cupti=True,
... )

See also

  • bench_gpu_time_with_cuda_event: Direct CUDA event timing.

  • bench_gpu_time_with_cudagraph: Direct CUDA graph timing.

  • bench_gpu_time_with_cupti: Direct CUPTI timing.

Deprecated since version The: l2_flush, l2_flush_size_mb, and l2_flush_device parameters are deprecated. Use cold_l2_cache instead.