Releases · ggml-org/llama.cpp

title: Releases · ggml-org/llama.cpp
source: https://github.com/ggml-org/llama.cpp/releases
author:
  - "[[ggml-org]]"
published:
created: 2026-04-13
description: LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.
tags:
  - LLM

Releases · ggml-org_llama.cpp

TheTom_llama-cpp-turboquant_ LLM inference in C_C++

14 minutes ago

github-actions b8778

aa00911

common: add download cancellation and temp file cleanup (#21813)

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

Linux:

Windows:

openEuler:

1 hour ago

github-actions b8777

ce8fd4b

b8777

server: Expose build_info in router mode (#21835)

macOS/iOS:

Linux:

Windows:

openEuler:

1 hour ago

github-actions b8776

9f5e1ed

b8776

CUDA: Limit DeviceSegmentedSort to immediate mode (#21718)

CUDA: Limit DeviceSegmentedSort to immediate mode

DeviceSegmentedSort is currently not capturable in a cuda graph. Hence,
we have to go for the slower DeviceSegmentedRadixSort in that case.

Perf numbers on RTX Pro 6000 Blackwell Max-Q:
DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs)

ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s
ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s
ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s
ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s
ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s
ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s

DeviceSegmentedSort in immediate mode

ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s
ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s
ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s
ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s
ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s
ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s

Add test case for dispatch to DeviceSegmentedRadixSort

We currently lack a way to force graph mode in CUDA, patch callback to
invoke ggml_backend_compare_graph_backend twice to enforce each test to
run in graph mode

macOS/iOS:

Linux:

Windows:

openEuler:

3 hours ago

github-actions b8775

920b3e7

b8775

mtmd: use causal attn for gemma 4 audio (#21824)

macOS/iOS:

Linux:

Windows:

openEuler:

8 hours ago

github-actions b8772

bafae27

b8772

Remove extra conditional check on debug mode. (#21798)

macOS/iOS:

Linux:

Windows:

openEuler:

10 hours ago

github-actions b8771

873c825

b8771

sycl: disable Q1_0 in backend and cleanup unused variables (#21807)

macOS/iOS:

Linux:

Windows:

openEuler:

13 hours ago

github-actions b8770

82764d8

b8770

mtmd: fix crash when sending image under 2x2 pixels (#21711)

macOS/iOS:

Linux:

Windows:

openEuler:

14 hours ago

github-actions b8769

21a4933

b8769

mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (#19441)

add qwen3a
wip
vision ok
no more deepstack for audio
convert ASR model ok
qwen3 asr working
Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

nits
Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

fix bad merge
fix multi inheritance

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler:

b8766

mtmd: add Gemma 4 audio conformer encoder support (#21421)

mtmd: add Gemma 4 audio conformer encoder support

Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.

Architecture:

12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
Full self-attention with sinusoidal RPE and sliding window mask (24)
Logit softcapping at 50.0, ClippableLinear clamping
Output: 1024 → 1536 → RMSNorm → multimodal embedder

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):

HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
Standard periodic Hann window (320 samples), zero-padded to FFT size
Semicausal left-padding (frame_length/2 samples)
Frame count matched to PyTorch (unfold formula)
No pre-emphasis, no Whisper-style normalization
Mel cosine similarity vs PyTorch: 0.9998

Key fixes:

Tensor loading dedup: prevent get_tensor() from creating duplicate
entries in ctx_data. Fixed with std::set guard.
ClippableLinear clamp_info loading moved after per-layer tensors.
Sliding window mask (24 positions) matching PyTorch context_size.
Skip Whisper normalization for Gemma4 mel output.

Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).

Ref: #21325

macOS/iOS:

Linux: