Johannes Gäßler
|
b17ba2815b
|
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores
* try CI fix
* try CI fix
* try CI fix
* fix data race
* rever q2_K precision related changes
|
2024-06-16 18:19:48 +03:00 |
|
Johannes Gäßler
|
a99e213a82
|
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)
|
2024-06-16 18:19:48 +03:00 |
|
Johannes Gäßler
|
7483d2b61c
|
CUDA: use tensor cores for MMQ (llama/7676)
* CUDA: int8 tensor cores for MMQ (legacy quants)
* fix out-of-bounds writes
* __builtin_assume -> GGML_CUDA_ASSUME
* fix writeback returning too early
|
2024-06-16 18:19:48 +03:00 |
|
Johannes Gäßler
|
760497e1ab
|
CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)
|
2024-06-16 18:19:48 +03:00 |
|
Johannes Gäßler
|
e08c62149b
|
CUDA: refactor mmq, dmmv, mmvq (llama/7716)
* CUDA: refactor mmq, dmmv, mmvq
* fix out-of-bounds write
* struct for qk, qr, qi
* fix cmake build
* mmq_type_traits
|
2024-06-16 18:19:48 +03:00 |
|
Georgi Gerganov
|
2948c740a2
|
sync : ggml (#2001)
* sync : update scripts
* sync : ggml
* talk-llama : sync llama.cpp
* make : WHISPER_CUBLAS -> WHISPER_CUDA
* ci : try to fix sycl build
* talk-llama : fix make build
|
2024-03-27 18:55:10 +02:00 |
|