flashinfer.testing.bench_gpu_time_with_cudagraph¶

flashinfer.testing.bench_gpu_time_with_cudagraph(fn, dry_run_iters: int = None, repeat_iters: int = None, dry_run_time_ms: int = 25, repeat_time_ms: int = 100, num_iters_within_graph: int = 10, l2_flush: bool = True, l2_flush_size_mb: int = 256, l2_flush_device: str = 'cuda', sleep_after_run: bool = False)¶

Benchmark GPU time using by constructing CUDA graphs with kernel launch and then replaying the graph. Increasing the number of iterations within graph can amortize kernel launch latency to help obtain measurements close to GPU kernel time of fn(). Can flush L2 cache and sleep after the run.

Number of dry run and actual run iterations can be set by iteration count or time: - If dry_run_iters and repeat_iters are provided, provided iteration count will be used. - If dry_run_iters and repeat_iters are not provided, dry_run_time_ms and repeat_time_ms will be used.

Returns an array of measured times so that the caller can compute statistics.

Uses PyTorch’s API to construt and use CUDA Graphs. Also see PyTorch’s post on CUDA Graphs: https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/

Parameters:

fn – Function to benchmark.
dry_run_iters – Number of dry runs during which times does not count. If not provided, dry_run_time_ms will be used.
repeat_iters – Number of iterations. If not provided, repeat_time_ms will be used.
dry_run_time_ms – Time to run the dry run in milliseconds.
repeat_time_ms – Time to run the repeat in milliseconds.
num_iters_within_graph – Number of iterations to run within the graph.
l2_flush – Whether to flush L2 cache.
l2_flush_size_mb – Size of the L2 cache to flush.
l2_flush_device – Device that needs to flush L2 cache.
sleep_after_run – Whether to sleep after the run. Sleep time is dynamically set.

Returns:

List of measured times.

Return type:

measured_times