flashinfer.comm¶

This module provides communication primitives and utilities for distributed computing, including CUDA IPC, AllReduce operations, and memory management utilities.

CUDA IPC Utilities¶

`CudaRTLibrary`([so_file])
`create_shared_buffer`(size_in_bytes[, group])	Creates a shared buffer and returns a list of pointers representing the buffer on all processes in the group.
`free_shared_buffer`(pointers[, group])	Frees a shared buffer.

DLPack Utilities¶

pack_strided_memory(ptr, segment_size, ...)

Pack GPU memory into a PyTorch tensor with specified stride.

Mapping Utilities¶

Mapping([world_size, rank, gpus_per_node, ...])

A node with 8 GPUs, tp_size = 4, cp_size = 1, pp_size = 2

TensorRT-LLM AllReduce¶

Types and Enums¶

`AllReduceFusionOp`()
`AllReduceFusionPattern`()
`AllReduceStrategyConfig`()
`AllReduceStrategyType`()

Core Operations¶

`trtllm_allreduce_fusion`(allreduce_in, ...[, ...])	Parameters: - allreduce_in: the input tensor. [token_num, hidden_dim] - world_size: the size of the process group. - world_rank: the rank of the current process. - token_num: the number of tokens in the sequence. - hidden_dim: the dimension of the hidden states. - workspace_ptrs: the workspace pointers. - launch_with_pdl: whether to launch with pdl. - use_oneshot: whether to use oneshot. If None, internal heuristics will be used. - trigger_completion_at_end: whether to trigger completion at the end. - fp32_acc: whether to use fp32 accumulation. - pattern_code: the pattern code. - allreduce_out: the output tensor. [token_num, hidden_dim] - residual_in: the residual input tensor. [token_num, hidden_dim] - residual_out: the residual output tensor. [token_num, hidden_dim] - norm_out: the norm output tensor. [token_num, hidden_dim] - quant_out: the quant output tensor. [token_num, hidden_dim] - scale_out: the scale output tensor. Initialization referece: tests/comm/test_trtllm_allreduce_fusion.py - rms_gamma: the rms gamma tensor. [hidden_dim] - rms_eps: the rms epsilon value. - scale_factor: the scale factor. For cudaGraphs safety, it should be a tensor. - layout_code: the layout code. - metadata: optional workspace metadata dict from create_ipc_workspace_for_all_reduce_fusion. If provided, validates that token_num <= max_token_num, world_size == tp_size, and hidden_dim == workspace hidden_dim. Raises ValueError if validation fails.
`trtllm_custom_all_reduce`(inp, out, tp_size, ...)	Parameters: - inp: the input tensor.
`trtllm_moe_allreduce_fusion`(world_size, ...)	Parameters: - world_size: the size of the process group.
`trtllm_moe_finalize_allreduce_fusion`(...)	Parameters: - allreduce_in: the input tensor.

Workspace Management¶

`trtllm_create_ipc_workspace_for_all_reduce`(...)	Parameters: - rank: the rank of the current process.
`trtllm_create_ipc_workspace_for_all_reduce_fusion`(...)	Parameters: - tp_rank: the rank of the current process.
`trtllm_destroy_ipc_workspace_for_all_reduce`(...)	Note: This function is used to destroy a workspace for all reduce.
`trtllm_destroy_ipc_workspace_for_all_reduce_fusion`(...)	Parameters: - workspace: the workspace to destroy.

Initialization and Utilities¶

`trtllm_lamport_initialize`(buffer_ptr, size, ...)
`trtllm_lamport_initialize_all`(buffer_0_ptr, ...)	Initialize 3 lamport buffers by negative zero.
`compute_fp4_swizzled_layout_sf_size`(...)	Helper function to compute the padded size of the fp4 swizzled layout.

vLLM AllReduce¶

`vllm_all_reduce`(fa, inp, out, reg_buffer, ...)	Performs an out-of-place all reduce.
`vllm_dispose`(fa)
`vllm_init_custom_ar`(ipc_tensors, rank_data, ...)
`vllm_register_buffer`(fa, fake_ipc_ptrs)
`vllm_register_graph_buffers`(fa, handles, offsets)
`vllm_get_graph_buffer_ipc_meta`(fa)
`vllm_meta_size`()

MNNVL (Multi-Node NVLink)¶

Core Classes¶

`MnnvlMemory`(mapping, size)
`McastGPUBuffer`(buf_size, group_size, ...[, ...])	Wrapper class for McastDeviceMemory to facilitate PyTorch tensor creation.

Utility Functions¶

`create_tensor_from_cuda_memory`(ptr, shape, ...)	Create a PyTorch tensor from a CUDA memory pointer using DLPack.
`alloc_and_copy_to_cuda`(host_ptr_array)	A helper function that allocates memory on cuda and copies the data from the host to the device.

TensorRT-LLM MNNVL AllReduce¶

`trtllm_mnnvl_all_reduce`(inp, ...[, out])	Perform a multi-node NVLink all-reduce operation across multiple GPUs.
`trtllm_mnnvl_fused_allreduce_rmsnorm`(...)	Performs MNNVL TwoShot Allreduce + RMSNorm.
`mpi_barrier`()