flashinfer.comm¶
This module provides communication primitives and utilities for distributed computing, including CUDA IPC, AllReduce operations, and memory management utilities.
CUDA IPC Utilities¶
|
|
|
Creates a shared buffer and returns a list of pointers representing the buffer on all processes in the group. |
|
Frees a shared buffer. |
DLPack Utilities¶
|
Pack GPU memory into a PyTorch tensor with specified stride. |
Mapping Utilities¶
|
A node with 8 GPUs, tp_size = 4, cp_size = 1, pp_size = 2 |
TensorRT-LLM AllReduce¶
Types and Enums¶
Core Operations¶
|
Parameters: - allreduce_in: the input tensor. |
|
Parameters: - inp: the input tensor. |
|
Parameters: - world_size: the size of the process group. |
Parameters: - allreduce_in: the input tensor. |
Workspace Management¶
Parameters: - rank: the rank of the current process. |
|
Parameters: - tp_rank: the rank of the current process. |
|
Note: This function is used to destroy a workspace for all reduce. |
|
Parameters: - workspace: the workspace to destroy. |
Initialization and Utilities¶
|
|
|
Initialize 3 lamport buffers by negative zero. |
Helper function to compute the padded size of the fp4 swizzled layout. |
vLLM AllReduce¶
|
Performs an out-of-place all reduce. |
|
|
|
|
|
|
|
|
MNNVL (Multi-Node NVLink)¶
Core Classes¶
|
|
|
Wrapper class for McastDeviceMemory to facilitate PyTorch tensor creation. |
Utility Functions¶
|
Create a PyTorch tensor from a CUDA memory pointer using DLPack. |
|
A helper function that allocates memory on cuda and copies the data from the host to the device. |
TensorRT-LLM MNNVL AllReduce¶
|
Perform a multi-node NVLink all-reduce operation across multiple GPUs. |
Performs MNNVL TwoShot Allreduce + RMSNorm. |
|