whisper.cpp

Author	SHA1	Message	Date
HimariO	e22d38e4f2	llama : add Qwen2VL support + multimodal RoPE (llama/10361) * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-18 12:52:16 +02:00
Daniel Bevenius	e0be0de1ee	ggml : add check for grad_accs (ggml/1046) * ggml : add check for grad_accs This commit adds a check for grad_accs in ggml_graph_get_grad and ggml_graph_get_grad_acc functions. This is necessary to avoid segfaults when grad_accs is not initialized. The motivation for this change is that I find it nice to be able to print out a computation graph using ggml_graph_print but this function segfaults when grad_accs is not initialized: ```console (gdb) p g1 $2 = (ggml_cgraph ) 0x7ffff66004b0 (gdb) p g1 $3 = {size = 2048, n_nodes = 1, n_leafs = 2, nodes = 0x7ffff6600500, grads = 0x0, grad_accs = 0x0, leafs = 0x7ffff6604500, visited_hash_set = {size = 4099, used = 0x7ffff6610518, keys = 0x7ffff6608500}, order = GGML_CGRAPH_EVAL_ORDER_LEFT_TO_RIGHT} (gdb) p ggml_graph_print(g1) === GRAPH === n_nodes = 1 Program received signal SIGSEGV, Segmentation fault. 0x0000555555579775 in ggml_graph_get_grad (cgraph=0x7ffff66004b0,node=0x7ffff6600340) at /ggml/ggml/src/ggml.c:5990 5990 return igrad != GGML_HASHSET_FULL && ggml_bitset_get(cgraph->visited_hash_set.used, igrad) ? cgraph->grads[igrad] : NULL; ``` * squash! ggml : add check for grad_accs Fix the check in ggml_graph_get_grad. The check was incorrectly using cgraph->grad_accs instead of cgraph->grads.	2024-12-18 12:52:16 +02:00
Djip007	e990d1b791	ggml : refactor online repacking (llama/10446) * rename ggml-cpu-aarch64.c to .cpp * reformat extra cpu backend. - clean Q4_0_N_M and IQ4_0_N_M - remove from "file" tensor type - allow only with dynamic repack - extract cpu extra bufts and convert to C++ - hbm - "aarch64" - more generic use of extra buffer - generalise extra_supports_op - new API for "cpu-accel": - amx - aarch64 * clang-format * Clean Q4_0_N_M ref Enable restrict on C++ * add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack * added/corrected control on tensor size for Q4 repacking. * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add debug logs on repacks. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-18 12:52:16 +02:00
PAB	7895d39508	ggml : add `GGML_PAD_REFLECT_1D` operation (ggml/1034) * ggml_pad_reflect_1d defined in header * implemented on CPU * called the forward pass * impl Metal kernel * added Metal kernel * added OP_PAD_REFLECT_1D in test-backend-ops.cpp * add test-pad-reflect-1d test case * test case support multiple backend	2024-12-08 20:14:35 +02:00
Shupei Fan	330273901f	ggml-cpu: support IQ4_NL_4_4 by runtime repack (llama/10541) * ggml-cpu: support IQ4_NL_4_4 by runtime repack * ggml-cpu: add __ARM_FEATURE_DOTPROD guard	2024-12-08 20:14:35 +02:00
Diego Devesa	77e3e4a090	ggml : add support for dynamic loading of backends (llama/10469) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-08 20:14:35 +02:00
Diego Devesa	2a4b5c9d7e	cuda : optimize argmax (llama/10441) * cuda : optimize argmax * remove unused parameter ggml-ci * fixup : use full warps ggml-ci * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix ub * ggml : check ne00 <= INT32_MAX in argmax and argsort --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-12-08 20:14:35 +02:00
Johannes Gäßler	98f9916c9f	ggml-opt: fix data corruption (ggml/1022)	2024-12-08 20:14:35 +02:00
Georgi Gerganov	75670ae673	ggml : fix compile warnings (llama/0) ggml-ci	2024-11-20 21:00:08 +02:00
Johannes Gäßler	c9541741e6	ggml: new optimization interface (ggml/988) * ggml: new optimization interface remove test2.c, test3.c store adamw params in tensor move grads from tensor to graph * avoid segfault upon API misuse * add ggml-opt.h to public headers * remove dependence of ggml-opt.cpp on ggml-cpu.h	2024-11-20 21:00:08 +02:00
slaren	7e86030d4d	ggml : fix some build issues	2024-11-20 21:00:08 +02:00
Diego Devesa	746bf2596f	ggml : build backends as libraries (llama/10256) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-20 21:00:08 +02:00
Georgi Gerganov	d0b8335789	metal : optimize FA kernels (llama/10171) * ggml : add ggml_flash_attn_ext_get_prec * metal : use F16 precision in FA kernels ggml-ci * metal : minor clean-up * metal : compile-guard bf16 FA kernels ggml-ci * build : remove obsolete compile flag [no ci] * metal : prevent int overflows [no ci] * cuda : disable BF16 FA ggml-ci * metal : fix BF16 requirement for FA kernels ggml-ci * make : clean-up [no ci]	2024-11-15 15:21:04 +02:00
Zhiyuan Li	42398f13b0	Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (llama/10133) * rwkv6: rename to wkv6 * rwkv6: support avx2 avx512 armv8 armv9 * rwkv6: update cuda file name * rwkv6: rename params * wkv on sycl * sycl: add some ops * sycl: Enhance OP support judgment * wkv6: drop armv9 and tranfer to GGML style ggml-ci * sync : ggml * update the function to use appropriate types * fix define error * Update ggml/src/ggml-cpu.c * add appropriate asserts * move element-wise functions outside * put the declaration outside the loop * rewrite to be more inline with the common pattern for distributing threads * use recommended way GGML_TENSOR_LOCALS --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Plamen Minev <pacominev@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: Meng, Hengyu <airdldl@163.com>	2024-11-15 15:21:04 +02:00
Georgi Gerganov	d111a0987e	ggml : adjust is_first_call init value (llama/10193) ggml-ci	2024-11-15 15:21:04 +02:00
Diego Devesa	f69c8b6f1b	ggml : fix arch check in bf16_to_fp32 (llama/10164)	2024-11-15 15:21:04 +02:00
Diego Devesa	25da30bd60	ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (llama/10167)	2024-11-15 15:21:04 +02:00
Diego Devesa	9c817edb48	ggml : move CPU backend to a separate file (llama/10144)	2024-11-15 15:21:04 +02:00
Georgi Gerganov	0665168ef3	ggml : remove ggml_scratch (llama/10121) ggml-ci	2024-11-15 15:21:04 +02:00
Diego Devesa	3e231ab9cc	llama : fix buffer checks for mamba and rwk (llama/10111) * llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE	2024-11-15 15:21:04 +02:00
Diego Devesa	371bfaca8c	ggml : check tensor name lengths in gguf files (llama/10100)	2024-11-15 15:21:04 +02:00
Diego Devesa	63a4e09a0f	ggml : fix memory leaks when loading invalid gguf files (llama/10094) * ggml : fix gguf string leak when reading kv pairs fails * ggml : avoid crashing with GGML_ABORT when the KV has an invalid type * ggml : avoid crashing on failed memory allocations when loading a gguf file	2024-11-15 15:21:04 +02:00
Diego Devesa	1d48457aa6	llama : refactor model loader with backend registry (llama/10026)	2024-11-15 15:21:04 +02:00
Johannes Gäßler	ab0385f43b	CUDA: fix MMQ for non-contiguous src0, add tests (llama/10021) * CUDA: fix MMQ for non-contiguous src0, add tests * revise test code	2024-11-01 10:19:05 +02:00
Georgi Gerganov	741c138aa1	ggml : add asserts for type conversion in fattn kernels (llama/9971) ggml-ci	2024-11-01 10:19:05 +02:00
Ma Mingfei	e1936eb2a5	add amx kernel for gemm (llama/8998) add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend	2024-11-01 10:19:05 +02:00
Gilad S	ff5a838099	fix: use `vm_allocate` to allocate CPU backend buffer on macOS (llama/9875) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment	2024-11-01 10:19:05 +02:00
Diego Devesa	1531259b2c	ggml : fix BLAS with unsupported types (llama/9775) * ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it	2024-11-01 10:19:05 +02:00
Georgi Gerganov	aa037a60f3	ggml : alloc ggml_contexts on the heap (#2525 ) * whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc	2024-10-31 22:00:09 +02:00
Diego Devesa	cf977670e6	ggml-backend : add device and backend reg interfaces (llama/9707) Also: - metal : fix compute pass descriptor autorelease crash - ggml-backend : add device description to CPU backend - ggml: unify backend logging mechanism	2024-10-05 15:23:51 +03:00
Diego Devesa	1acfadb721	ggml-backend : add device and backend reg interfaces (llama/9707) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-05 15:23:51 +03:00
Johannes Gäßler	936cf3beb7	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-05 15:23:51 +03:00
Johannes Gäßler	bc92c2f8f0	ggml: refactor cross entropy loss CPU impl. (ggml/976)	2024-10-05 15:23:51 +03:00
Johannes Gäßler	5e9d6baa48	test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)	2024-10-03 12:22:17 +03:00
Borislav Stanimirov	31fdf05fda	ggml : fix ggml_cast (ggml/973)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	0ac6666cd2	ggml: fix gradient allocation logic (ggml/966) * ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg	2024-10-03 12:22:17 +03:00
Georgi Gerganov	6c91da80b8	ggml : define missing HWCAP flags (llama/9684) ggml-ci Co-authored-by: Willy Tarreau <w@1wt.eu>	2024-10-03 12:22:17 +03:00
Dan Johansson	c245168ba3	ggml : add run-time detection of neon, i8mm and sve (llama/9331) * ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section	2024-10-03 12:22:17 +03:00
Max Krasnyansky	02285dff81	threads: fix msvc build without openmp (llama/9615) We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.	2024-09-24 19:45:08 +03:00
Max Krasnyansky	08e8414f27	threads: improve ggml_barrier scaling with large number of threads (llama/9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended.	2024-09-24 19:45:08 +03:00
Georgi Gerganov	54e5095765	examples : adapt to ggml.h changes (ggml/0) ggml-ci	2024-09-24 19:45:08 +03:00
Georgi Gerganov	34291099fb	ggml : refactoring (llama/#0) - d6a04f87 - 23e0d70b	2024-09-24 19:45:08 +03:00
slaren	138e20b697	ggml : fix n_threads_cur initialization with one thread (llama/9538) * ggml : fix n_threads_cur initialization with one thread * Update ggml/src/ggml.c --------- Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2024-09-24 19:45:08 +03:00
Max Krasnyansky	a8d9abfa22	threadpool : skip polling for unused threads (llama/9461) * threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.	2024-09-24 19:45:08 +03:00
Yuri Khrustalev	4f4687cb74	ggml : ggml_type_name return "NONE" for invalid values (llama/9458) When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.	2024-09-24 19:45:08 +03:00
Ahmad Tameem	3f8f8a78a2	riscv : modify Makefile and add a RISCV_VECT to print log info (llama/9442) - Added ggml_cpu_has_riscv_v() in GGML to print system info in log - Modified Makefile to only use flag when cross compiling for RISC-V	2024-09-24 19:45:08 +03:00
Radoslav Gerganov	0677293503	rpc : fix segfault with nkvo (llama/9389) * rpc : fix nkvo * rpc : buf_size must not be static ref: #9337 --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 19:45:08 +03:00
Johannes Gäßler	c7515b0995	ggml/examples: add backend support for numerical optimization (ggml/949) * CUDA eval works * stochastic gradient descent op * Adam except decay * CUDA CROSS_ENTROPY_LOSS_BACK * CUDA mnist-fc training works * backend CLI arg * refactor gguf load * remove sched from opt_step_adam * implement l1 regularization (weight decay) * extra call to add optimizer * initialize gradients with ggml_graph_reset * gradient accumulation * increment iter per eval instead of epoch * adjust backend interfaces * fix ggml_graph_reset without backend * fix ggml graph export/import * fixup * rename * revert ggml_opt changes * more general CUDA repeat_back * update documentation, fix CNN * validation split * add clarifying comment * optimize PyTorch training * adjust buffer size, thread count * fix 0.0f validation split * Update examples/mnist/mnist-common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix gradient accumulation * tensor flag for accumulators -> tensor hash set * Update include/ggml.h Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * fix test prints * Update src/ggml-backend.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * better CUDA support for noncontiguous out_prod * add comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-09-24 19:45:08 +03:00
Georgi Gerganov	253ce30004	examples : add null threadpool args where needed (ggml/0) ggml-ci	2024-09-24 19:45:08 +03:00
slaren	d37fd275fd	ggml : always check bounds on get_rows operations (llama/9354)	2024-09-24 19:45:08 +03:00

1 2

91 Commits