fft_out needs to be twice the frame_size, not the frame_step. It is resized in fft() anyway, but this change prevents an unnecessary reallocation.
n_fft must match the mel filter size, so it is best not to calculate it from the framesize.
We only need to get the magnitudes for half the spectrum since the other half is a mirror and not used in the mel filter loop later.
* Allow a regular expression to describe tokens to suppress.
Example: --suppress-tokens-re "[,\.]|[ ]?[0-9]+" will suppress commas, periods, and numeric tokens.
Technique inspired by https://github.com/openai/whisper/discussions/1041
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Blind change to fix Java test.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* whisper.cpp: impl dtw algo
* WIP: producing and placing DTW timestamps on tokens
* Fix compile and assertion errors. Attempt to DTW timestamp with single_segment=false.
* Fix mistake causing incorrect alignment of dtw timestamps
* implement N_TOP_MOST and CUSTOM alignment heads setting
* whisper: fix typo on alignment heads enum
* Fix issues related to changes in whisper.cpp
* Fixed excessive memory use when using DTW timestamps. Other minor fixes to DTW timestamping function
* decoder: save cross QKs only if requested
* Calling median filter with ggml_map_custom1
* Reimpl aheads n_top_most and custom. Sanity checks on chosen aheads
* Copying cross QKs from decoder backend correctly
* dtw: cleanup
* Fix incorrect n_frames passed to dtw when near end of audio
* Fix aheads_masks_init for backend != CPU
* whisper : minor style
* main : add dtw (wip)
* whisper: fix invalid memory access in aheads_masks_init
* main : add dtw (cont)
* whisper : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
As of #1486, whisper.cpp uses a unified KV cache with KQ masking.
As a result, depending on their location in the batch,
identical sequences in a batch can have slightly different outputs
due to floating point rounding errors during reduction.
See the discussion in #1941 for more details.
The beam search code used "has identical sum of log probabilities"
as a shorthand for "is an identical token sequence". However, per above,
identical tokens do not necessarily result in identical probabilities.
Instead, explicitly compare on sequences.
This is linear in cost when they are identical,
but the lengths are always small and the comparisons are cheap.
This increases diversity during beam search.
This improves output quality for some short samples I've been working
with, at no detectable performance cost.
I haven't checked against larger corpuses.
Fixes#1941
All else being otherwise equal, this encourages the beam candidate
selection to re-use the same decoder, which slightly
reduces the cache size.
I wouldn't expect it to make much of a performance difference,
but it helps when debug printing the cache and beam.
Added as part of understanding #1941.
* Update whisper.h
add whisper_lang_fullstr to retrieve the full language name
* Update whisper.cpp
add whisper_lang_fullstr to return the full language name
* fullstr -> str_full
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>