flashinfer.testing.bench_gpu_time_with_cupti

flashinfer.testing.bench_gpu_time_with_cupti(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, l2_flush: bool | None = None, l2_flush_size_mb: int | None = None, l2_flush_device: str | None = None, sleep_after_run: bool = False, use_cuda_graph: bool = False, input_args: Tuple = (), input_kwargs: dict | None = None, cold_l2_cache: bool = True)

Benchmark GPU time using CUPTI activity tracing for precise kernel timing.

CUPTI (CUDA Profiling Tools Interface) provides hardware-level profiling that measures actual GPU kernel execution time, excluding CPU-side launch overhead. This gives the most accurate kernel performance measurements.

Cold L2 cache is achieved via L2 flush between iterations. CUPTI measures per-iteration, so L2 flush works correctly regardless of use_cuda_graph.

Behavior: - Uses CUPTI (requires version >= 13, i.e., CUDA 13+) to trace kernel activities

and compute per-iteration GPU time from recorded start/end timestamps.

  • Optionally captures operations in a CUDA graph (use_cuda_graph=True) for reduced launch overhead during measurement.

  • If CUPTI is unavailable, falls back to: - bench_gpu_time_with_cudagraph if use_cuda_graph=True (uses rotating buffers

    for cold L2)

    • bench_gpu_time_with_cuda_event otherwise (uses L2 flush for cold L2)

Parameters:
  • fn (Callable) – The kernel function to benchmark.

  • dry_run_iters (int, optional) – Number of warmup iterations (not timed). If None, computed from dry_run_time_ms.

  • repeat_iters (int, optional) – Number of measured iterations. If None, computed from repeat_time_ms.

  • dry_run_time_ms (int) – Target warmup duration in ms (default: 25).

  • repeat_time_ms (int) – Target measurement duration in ms (default: 100).

  • sleep_after_run (bool) – If True, sleep briefly after each iteration (default: False).

  • use_cuda_graph (bool) – If True, capture and replay a CUDA graph (default: False).

  • input_args (tuple) – Positional arguments to pass to fn.

  • input_kwargs (dict, optional) – Keyword arguments to pass to fn.

  • cold_l2_cache (bool) – If True, flush L2 cache before each iteration to ensure cold-cache performance measurements (default: True).

Returns:

Per-iteration GPU kernel execution times in milliseconds.

Return type:

List[float]

Example

Basic CUPTI benchmarking (requires cupti-python >= 13):

>>> def my_kernel(a, b):
...     return torch.matmul(a, b.T)
>>> q = torch.randn(1024, 128, device="cuda")
>>> k = torch.randn(1024, 128, device="cuda")
>>> times = bench_gpu_time_with_cupti(
...     fn=my_kernel,
...     input_args=(q, k),
... )
>>> print(f"Median GPU time: {np.median(times):.3f} ms")

Note

Requires cupti-python package version >= 13.0.0: pip install -U cupti-python

If CUPTI is not available, a warning is issued and the function automatically falls back to CUDA event or CUDA graph timing.

Deprecated since version The: l2_flush, l2_flush_size_mb, and l2_flush_device parameters are deprecated. Use cold_l2_cache instead.