flashinfer.testing.bench_gpu_time_with_cupti¶
- flashinfer.testing.bench_gpu_time_with_cupti(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, l2_flush: bool = True, l2_flush_size_mb: int = 256, l2_flush_device: str = 'cuda', sleep_after_run: bool = False, use_cuda_graph: bool = False)¶
Benchmark GPU time using CUPTI activity tracing to measure kernel execution time.
Behavior: - Uses CUPTI (>=13) to capture runtime launches and concurrent kernel activities
and computes per-iteration GPU time from recorded activities.
Supports optional CUDA Graph capture (use_cuda_graph=True). In this mode, a single replay of the captured graph is timed per iteration.
If CUPTI is unavailable or <13, falls back to CUDA events or CUDA graphs depending on use_cuda_graph.
Dry run and repeat iterations can be specified directly or derived from target times (dry_run_time_ms/repeat_time_ms) using a short estimate phase.
Optionally flushes L2 and sleeps after runs to reduce throttling.
- Parameters:
fn – Callable to benchmark.
dry_run_iters – Dry-run iterations; if None, inferred from dry_run_time_ms.
repeat_iters – Measurement iterations; if None, inferred from repeat_time_ms.
dry_run_time_ms – Target dry-run duration in ms when inferring iterations.
repeat_time_ms – Target measurement duration in ms when inferring iterations.
l2_flush – Whether to flush L2 before each iteration.
l2_flush_size_mb – Size of the buffer used for L2 flush.
l2_flush_device – Device for the flush buffer.
sleep_after_run – Whether to sleep briefly after each iteration.
use_cuda_graph – If True, capture and replay a CUDA graph during timing.
- Returns:
Measured times in milliseconds per iteration.
- Return type:
List[float]