flashinfer.testing.bench_gpu_time_with_cupti

flashinfer.testing.bench_gpu_time_with_cupti(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, l2_flush: bool = True, l2_flush_size_mb: int = 256, l2_flush_device: str = 'cuda', sleep_after_run: bool = False, use_cuda_graph: bool = False)

Benchmark GPU time using CUPTI activity tracing to measure kernel execution time.

Behavior: - Uses CUPTI (>=13) to capture runtime launches and concurrent kernel activities

and computes per-iteration GPU time from recorded activities.

  • Supports optional CUDA Graph capture (use_cuda_graph=True). In this mode, a single replay of the captured graph is timed per iteration.

  • If CUPTI is unavailable or <13, falls back to CUDA events or CUDA graphs depending on use_cuda_graph.

  • Dry run and repeat iterations can be specified directly or derived from target times (dry_run_time_ms/repeat_time_ms) using a short estimate phase.

  • Optionally flushes L2 and sleeps after runs to reduce throttling.

Parameters:
  • fn – Callable to benchmark.

  • dry_run_iters – Dry-run iterations; if None, inferred from dry_run_time_ms.

  • repeat_iters – Measurement iterations; if None, inferred from repeat_time_ms.

  • dry_run_time_ms – Target dry-run duration in ms when inferring iterations.

  • repeat_time_ms – Target measurement duration in ms when inferring iterations.

  • l2_flush – Whether to flush L2 before each iteration.

  • l2_flush_size_mb – Size of the buffer used for L2 flush.

  • l2_flush_device – Device for the flush buffer.

  • sleep_after_run – Whether to sleep briefly after each iteration.

  • use_cuda_graph – If True, capture and replay a CUDA graph during timing.

Returns:

Measured times in milliseconds per iteration.

Return type:

List[float]