flashinfer.testing.bench_gpu_time_with_cuda_event¶

flashinfer.testing.bench_gpu_time_with_cuda_event(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, l2_flush: bool | None = None, l2_flush_size_mb: int | None = None, l2_flush_device: str | None = None, sleep_after_run: bool = False, input_args: Tuple = (), input_kwargs: dict | None = None, cold_l2_cache: bool = True)¶

Benchmark kernel execution time using CUDA events (no CUDA graphs).

This is the simplest benchmarking method. Best suited for kernels where launch overhead is negligible compared to execution time.

The function performs: 1. A quick estimation phase (5 iterations) to determine iteration counts 2. Dry-run warmup iterations (not measured) 3. Measured iterations with per-iteration timing via CUDA events

Iteration counts can be specified directly or derived from target durations: - If dry_run_iters/repeat_iters are provided, those counts are used directly. - Otherwise, counts are computed from dry_run_time_ms/repeat_time_ms.

Parameters:

fn (Callable) – The kernel function to benchmark.
dry_run_iters (int, optional) – Number of warmup iterations (not timed). If None, computed from dry_run_time_ms.
repeat_iters (int, optional) – Number of measured iterations. If None, computed from repeat_time_ms.
dry_run_time_ms (int) – Target warmup duration in ms (default: 25).
repeat_time_ms (int) – Target measurement duration in ms (default: 100).
sleep_after_run (bool) – If True, sleep briefly after each iteration to reduce thermal throttling (default: False).
input_args (tuple) – Positional arguments to pass to fn.
input_kwargs (dict, optional) – Keyword arguments to pass to fn.
cold_l2_cache (bool) – If True, flush L2 cache before each iteration to ensure cold-cache performance measurements (default: True).

Returns:

Per-iteration execution times in milliseconds.

Return type:

List[float]

Example

Basic usage:

>>> def my_kernel(a, b):
...     return torch.matmul(a, b.T)
>>> q = torch.randn(1024, 128, device="cuda")
>>> k = torch.randn(1024, 128, device="cuda")
>>> times = bench_gpu_time_with_cuda_event(
...     fn=my_kernel,
...     input_args=(q, k),
... )
>>> print(f"Median time: {np.median(times):.3f} ms")

Note

This method does NOT use CUDA graphs, so each iteration incurs kernel launch overhead. For microbenchmarking where launch latency matters, consider using bench_gpu_time_with_cudagraph instead.

Deprecated since version The: l2_flush, l2_flush_size_mb, and l2_flush_device parameters are deprecated. Use cold_l2_cache instead.