flashinfer.sampling#

Kernels for LLM sampling.

`sampling_from_probs`(probs, uniform_samples)	Fused GPU kernel for category sampling from probabilities.
`top_p_sampling_from_probs`(probs, ...[, ...])	Fused GPU kernel for top-p sampling (nucleus sampling) from probabilities, this operator implements GPU-based rejection sampling without explicit sorting.
`top_k_sampling_from_probs`(probs, ...[, ...])	Fused GPU kernel for top-k sampling from probabilities, this operator implements GPU-based rejection sampling without explicit sorting.
`min_p_sampling_from_probs`(probs, ...[, ...])	Fused GPU kernel for min_p sampling from probabilities,
`top_k_top_p_sampling_from_logits`(probs, ...)	Fused GPU kernel for top-k and top-p sampling from pre-softmax logits,
`top_k_top_p_sampling_from_probs`(probs, ...)	Fused GPU kernel for top-k and top-p sampling from probabilities,
`top_p_renorm_probs`(probs, top_p)	Fused GPU kernel for renormalizing probabilities by top-p thresholding.
`top_k_renorm_probs`(probs, top_k)	Fused GPU kernel for renormalizing probabilities by top-k thresholding.
`top_k_mask_logits`(logits, top_k)	Fused GPU kernel for masking logits by top-k thresholding.
`chain_speculative_sampling`(draft_probs, ...)	Fused-GPU kernel for speculative sampling for sequence generation (proposed in paper Accelerating Large Language Model Decoding with Speculative Sampling), where the draft model generates a sequence(chain) of tokens for each request.