whisper.cpp/ggml-cuda
slaren 1dce94cf26
ggml : mul_mat_id use the same tensor for all the experts (llama/6387)
* ggml : update mul_mat_id to use the same tensor for all the experts

* update cuda

* minor

* update metal

* update test-backend-ops

* fix cuda

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update convert.py

* update convert-hf-to-gguf.py

* update convert.py for mixtral hf models

* Update convert-hf-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cuda : support non-pow-2 number of experts

* allow quantize to work for split and merged experts models in the same way

* cleanup + disable mmap automatically with split tensors models

* update imatrix

* test-backend-ops : test qwen argsort

* update grok model loading

* llama : add merged experts tensors to the grok tensor map

* minor

* gguf : bump version

* fix quantizing of merged experts

* convert-hf-to-gguf.py : update grok (untested)

* make linter happy

* cuda/argsort : use shared memory instead of pool memory

* convert : fix grok tensor names

* metal : add support for non-pow-2 argsort

* llama : more loader cleanup, better error checking

* cuda : fix warning

* llama : still use mmap for loading old models, but copy the data to a host buffer

* add review note

* llama : remove ffn tensor counting + add sanity check

ggml-ci

* convert : fix handling of n_experts == None

ggml-ci

* imatrix : fix ncall counters

* llama : produce error if imatrix size does not match

* quantize : terminate on errors + trace logs

ggml-ci

* metal : pad shared memory to 16 bytes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-07 16:15:57 +03:00
..
acc.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
acc.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
alibi.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
alibi.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
arange.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
arange.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
argsort.cu ggml : mul_mat_id use the same tensor for all the experts (llama/6387) 2024-04-07 16:15:57 +03:00
argsort.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
binbcast.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
binbcast.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
clamp.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
clamp.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
common.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
concat.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
concat.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
convert.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
convert.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
cpy.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
cpy.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
dequantize.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
diagmask.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
diagmask.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
dmmv.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
dmmv.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
getrows.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
getrows.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
im2col.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
im2col.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
mmq.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
mmq.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
mmvq.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
mmvq.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
norm.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
norm.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
pad.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
pad.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
pool2d.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
pool2d.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
quantize.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
quantize.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
rope.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
rope.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
scale.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
scale.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
softmax.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
softmax.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
sumrows.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
sumrows.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
tsembd.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
tsembd.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
unary.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
unary.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
upscale.cu sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
upscale.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00
vecdotq.cuh sync : ggml (#2001) 2024-03-27 18:55:10 +02:00