flashinfer.comm

This module provides communication primitives and utilities for distributed computing, including CUDA IPC, AllReduce operations, and memory management utilities.

CUDA IPC Utilities

CudaRTLibrary([so_file])

create_shared_buffer(size_in_bytes[, group])

Creates a shared buffer and returns a list of pointers representing the buffer on all processes in the group.

free_shared_buffer(pointers[, group])

Frees a shared buffer.

DLPack Utilities

pack_strided_memory(ptr, segment_size, ...)

Pack GPU memory into a PyTorch tensor with specified stride.

Mapping Utilities

Mapping([world_size, rank, gpus_per_node, ...])

A node with 8 GPUs, tp_size = 4, cp_size = 1, pp_size = 2

TensorRT-LLM AllReduce

Types and Enums

Core Operations

trtllm_allreduce_fusion(allreduce_in, ...)

Parameters: - allreduce_in: the input tensor.

trtllm_custom_all_reduce(inp, out, tp_size, ...)

Parameters: - inp: the input tensor.

trtllm_moe_allreduce_fusion(world_size, ...)

Parameters: - world_size: the size of the process group.

trtllm_moe_finalize_allreduce_fusion(...)

Parameters: - allreduce_in: the input tensor.

Workspace Management

trtllm_create_ipc_workspace_for_all_reduce(...)

Parameters: - rank: the rank of the current process.

trtllm_create_ipc_workspace_for_all_reduce_fusion(...)

Parameters: - tp_rank: the rank of the current process.

trtllm_destroy_ipc_workspace_for_all_reduce(...)

Note: This function is used to destroy a workspace for all reduce.

trtllm_destroy_ipc_workspace_for_all_reduce_fusion(...)

Parameters: - workspace: the workspace to destroy.

Initialization and Utilities

trtllm_lamport_initialize(buffer_ptr, size, ...)

trtllm_lamport_initialize_all(buffer_0_ptr, ...)

Initialize 3 lamport buffers by negative zero.

compute_fp4_swizzled_layout_sf_size(...)

Helper function to compute the padded size of the fp4 swizzled layout.

vLLM AllReduce

vllm_all_reduce(fa, inp, out, reg_buffer, ...)

Performs an out-of-place all reduce.

vllm_dispose(fa)

vllm_init_custom_ar(ipc_tensors, rank_data, ...)

vllm_register_buffer(fa, fake_ipc_ptrs)

vllm_register_graph_buffers(fa, handles, offsets)

vllm_get_graph_buffer_ipc_meta(fa)

vllm_meta_size()

TensorRT-LLM MNNVL AllReduce

trtllm_mnnvl_all_reduce(inp, ...[, out])

Perform a multi-node NVLink all-reduce operation across multiple GPUs.

trtllm_mnnvl_fused_allreduce_rmsnorm(...)

Performs MNNVL TwoShot Allreduce + RMSNorm.

mpi_barrier()