flashinfer.sampling#

Kernels for LLM sampling.

sampling_from_probs(probs, uniform_samples)

Fused GPU kernel for category sampling from probabilities.

top_p_sampling_from_probs(probs, ...[, ...])

Fused GPU kernel for top-p sampling (nucleus sampling) from probabilities, this operator implements GPU-based rejection sampling without explicit sorting.

top_k_sampling_from_probs(probs, ...[, ...])

Fused GPU kernel for top-k sampling from probabilities, this operator implements GPU-based rejection sampling without explicit sorting.

min_p_sampling_from_probs(probs, ...[, ...])

Fused GPU kernel for min_p sampling from probabilities,

top_k_top_p_sampling_from_logits(probs, ...)

Fused GPU kernel for top-k and top-p sampling from pre-softmax logits,

top_k_top_p_sampling_from_probs(probs, ...)

Fused GPU kernel for top-k and top-p sampling from probabilities,

top_p_renorm_probs(probs, top_p)

Fused GPU kernel for renormalizing probabilities by top-p thresholding.

top_k_renorm_probs(probs, top_k)

Fused GPU kernel for renormalizing probabilities by top-k thresholding.

top_k_mask_logits(logits, top_k)

Fused GPU kernel for masking logits by top-k thresholding.

chain_speculative_sampling(draft_probs, ...)

Fused-GPU kernel for speculative sampling for sequence generation (proposed in paper Accelerating Large Language Model Decoding with Speculative Sampling), where the draft model generates a sequence(chain) of tokens for each request.