Releases · ggml-org/llama.cpp
14 minutes ago
common: add download cancellation and temp file cleanup (#21813)
Signed-off-by: Adrien Gallouët angt@huggingface.co
macOS/iOS:
Linux:
Windows:
openEuler:
1 hour ago
server: Expose build_info in router mode (#21835)
macOS/iOS:
Linux:
Windows:
openEuler:
1 hour ago
CUDA: Limit DeviceSegmentedSort to immediate mode (#21718)
- CUDA: Limit DeviceSegmentedSort to immediate mode
DeviceSegmentedSort is currently not capturable in a cuda graph. Hence,
we have to go for the slower DeviceSegmentedRadixSort in that case.
Perf numbers on RTX Pro 6000 Blackwell Max-Q:
DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs)
ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s
ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s
ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s
ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s
ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s
ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s
DeviceSegmentedSort in immediate mode
ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s
ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s
ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s
ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s
ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s
ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s
- Add test case for dispatch to DeviceSegmentedRadixSort
We currently lack a way to force graph mode in CUDA, patch callback to
invoke ggml_backend_compare_graph_backend twice to enforce each test to
run in graph mode
macOS/iOS:
Linux:
Windows:
openEuler:
3 hours ago
mtmd: use causal attn for gemma 4 audio (#21824)
macOS/iOS:
Linux:
Windows:
openEuler:
8 hours ago
Remove extra conditional check on debug mode. (#21798)
macOS/iOS:
Linux:
Windows:
openEuler:
10 hours ago
sycl: disable Q1_0 in backend and cleanup unused variables (#21807)
macOS/iOS:
Linux:
Windows:
openEuler:
13 hours ago
mtmd: fix crash when sending image under 2x2 pixels (#21711)
macOS/iOS:
Linux:
Windows:
openEuler:
14 hours ago
mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (#19441)
- add qwen3a
- wip
- vision ok
- no more deepstack for audio
- convert ASR model ok
- qwen3 asr working
- Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- nits
- Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- fix bad merge
- fix multi inheritance
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
macOS/iOS:
Linux:
Windows:
openEuler:
mtmd: add Gemma 4 audio conformer encoder support (#21421)
- mtmd: add Gemma 4 audio conformer encoder support
Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.
Architecture:
- 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
- Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
- Full self-attention with sinusoidal RPE and sliding window mask (24)
- Logit softcapping at 50.0, ClippableLinear clamping
- Output: 1024 → 1536 → RMSNorm → multimodal embedder
Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):
- HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
- Standard periodic Hann window (320 samples), zero-padded to FFT size
- Semicausal left-padding (frame_length/2 samples)
- Frame count matched to PyTorch (unfold formula)
- No pre-emphasis, no Whisper-style normalization
- Mel cosine similarity vs PyTorch: 0.9998
Key fixes:
- Tensor loading dedup: prevent get_tensor() from creating duplicate
entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors.
- Sliding window mask (24 positions) matching PyTorch context_size.
- Skip Whisper normalization for Gemma4 mel output.
Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).
Ref: #21325
macOS/iOS:
Linux:
Windows:
openEuler:
2 days ago
CUDA: skip compilation of superfluous FA kernels (#21768)
macOS/iOS:
Linux:
Windows:
openEuler: