flashinfer.testing.bench_gpu_time_with_cupti¶
- flashinfer.testing.bench_gpu_time_with_cupti(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, l2_flush: bool | None = None, l2_flush_size_mb: int | None = None, l2_flush_device: str | None = None, sleep_after_run: bool = False, use_cuda_graph: bool = False, input_args: Tuple = (), input_kwargs: dict | None = None, cold_l2_cache: bool = True)¶
Benchmark GPU time using CUPTI activity tracing for precise kernel timing.
CUPTI (CUDA Profiling Tools Interface) provides hardware-level profiling that measures actual GPU kernel execution time, excluding CPU-side launch overhead. This gives the most accurate kernel performance measurements.
Cold L2 cache is achieved via L2 flush between iterations. CUPTI measures per-iteration, so L2 flush works correctly regardless of
use_cuda_graph.Behavior: - Uses CUPTI (requires version >= 13, i.e., CUDA 13+) to trace kernel activities
and compute per-iteration GPU time from recorded start/end timestamps.
Optionally captures operations in a CUDA graph (use_cuda_graph=True) for reduced launch overhead during measurement.
If CUPTI is unavailable, falls back to: -
bench_gpu_time_with_cudagraphif use_cuda_graph=True (uses rotating buffersfor cold L2)
bench_gpu_time_with_cuda_eventotherwise (uses L2 flush for cold L2)
- Parameters:
fn (Callable) – The kernel function to benchmark.
dry_run_iters (int, optional) – Number of warmup iterations (not timed). If None, computed from dry_run_time_ms.
repeat_iters (int, optional) – Number of measured iterations. If None, computed from repeat_time_ms.
dry_run_time_ms (int) – Target warmup duration in ms (default: 25).
repeat_time_ms (int) – Target measurement duration in ms (default: 100).
sleep_after_run (bool) – If True, sleep briefly after each iteration (default: False).
use_cuda_graph (bool) – If True, capture and replay a CUDA graph (default: False).
input_args (tuple) – Positional arguments to pass to fn.
input_kwargs (dict, optional) – Keyword arguments to pass to fn.
cold_l2_cache (bool) – If True, flush L2 cache before each iteration to ensure cold-cache performance measurements (default: True).
- Returns:
Per-iteration GPU kernel execution times in milliseconds.
- Return type:
List[float]
Example
Basic CUPTI benchmarking (requires cupti-python >= 13):
>>> def my_kernel(a, b): ... return torch.matmul(a, b.T) >>> q = torch.randn(1024, 128, device="cuda") >>> k = torch.randn(1024, 128, device="cuda") >>> times = bench_gpu_time_with_cupti( ... fn=my_kernel, ... input_args=(q, k), ... ) >>> print(f"Median GPU time: {np.median(times):.3f} ms")
Note
Requires
cupti-pythonpackage version >= 13.0.0:pip install -U cupti-pythonIf CUPTI is not available, a warning is issued and the function automatically falls back to CUDA event or CUDA graph timing.
Deprecated since version The:
l2_flush,l2_flush_size_mb, andl2_flush_deviceparameters are deprecated. Usecold_l2_cacheinstead.