Releases · ggml-org/llama.cpp

14 minutes ago

common: add download cancellation and temp file cleanup (#21813)

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

Linux:

Windows:

openEuler:

1 hour ago

server: Expose build_info in router mode (#21835)

macOS/iOS:

Linux:

Windows:

openEuler:

1 hour ago

CUDA: Limit DeviceSegmentedSort to immediate mode (#21718)

  • CUDA: Limit DeviceSegmentedSort to immediate mode

DeviceSegmentedSort is currently not capturable in a cuda graph. Hence,
we have to go for the slower DeviceSegmentedRadixSort in that case.

Perf numbers on RTX Pro 6000 Blackwell Max-Q:
DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs)

ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s
ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s
ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s
ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s
ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s
ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s

DeviceSegmentedSort in immediate mode

ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s
ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s
ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s
ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s
ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s
ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s

  • Add test case for dispatch to DeviceSegmentedRadixSort

We currently lack a way to force graph mode in CUDA, patch callback to
invoke ggml_backend_compare_graph_backend twice to enforce each test to
run in graph mode

macOS/iOS:

Linux:

Windows:

openEuler:

3 hours ago

mtmd: use causal attn for gemma 4 audio (#21824)

macOS/iOS:

Linux:

Windows:

openEuler:

8 hours ago

Remove extra conditional check on debug mode. (#21798)

macOS/iOS:

Linux:

Windows:

openEuler:

10 hours ago

sycl: disable Q1_0 in backend and cleanup unused variables (#21807)

macOS/iOS:

Linux:

Windows:

openEuler:

13 hours ago

mtmd: fix crash when sending image under 2x2 pixels (#21711)

macOS/iOS:

Linux:

Windows:

openEuler:

14 hours ago

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (#19441)

  • add qwen3a
  • wip
  • vision ok
  • no more deepstack for audio
  • convert ASR model ok
  • qwen3 asr working
  • Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • nits
  • Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • fix bad merge
  • fix multi inheritance

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler:

mtmd: add Gemma 4 audio conformer encoder support (#21421)

  • mtmd: add Gemma 4 audio conformer encoder support

Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.

Architecture:

  • 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
  • Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
  • Full self-attention with sinusoidal RPE and sliding window mask (24)
  • Logit softcapping at 50.0, ClippableLinear clamping
  • Output: 1024 → 1536 → RMSNorm → multimodal embedder

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):

  • HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
  • Standard periodic Hann window (320 samples), zero-padded to FFT size
  • Semicausal left-padding (frame_length/2 samples)
  • Frame count matched to PyTorch (unfold formula)
  • No pre-emphasis, no Whisper-style normalization
  • Mel cosine similarity vs PyTorch: 0.9998

Key fixes:

  • Tensor loading dedup: prevent get_tensor() from creating duplicate
    entries in ctx_data. Fixed with std::set guard.
  • ClippableLinear clamp_info loading moved after per-layer tensors.
  • Sliding window mask (24 positions) matching PyTorch context_size.
  • Skip Whisper normalization for Gemma4 mel output.

Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).

Ref: #21325

macOS/iOS:

Linux:

Windows:

openEuler:

2 days ago

CUDA: skip compilation of superfluous FA kernels (#21768)

macOS/iOS:

Linux:

Windows:

openEuler: