feat!: major performance & accuracy improvements in speech-to-text module#1132
feat!: major performance & accuracy improvements in speech-to-text module#1132IgorSwat wants to merge 8 commits into
Conversation
…ware-mansion/react-native-executorch into @is/speech-to-text-ultimate
| WHISPER_SMALL_EN, | ||
| TranscriptionResult, | ||
| SpeechToTextProps, | ||
| WHISPER_SMALL_EN_COREML, |
There was a problem hiding this comment.
why this is added after TranscriptionResult and SpeechToTextProps? ;p
| "react": "19.2.5", | ||
| "react-native": "0.83.4", | ||
| "react-native-audio-api": "0.12.0", | ||
| "react-native-audio-api": "0.11.5", |
There was a problem hiding this comment.
hey, why is that? We virtually never want to downgrade packages in demo apps.
There was a problem hiding this comment.
audio-api 0.12.0 causes build fails on iOS, and I think it's the same issue @benITo47 had when testing the 1.2.0 binaries some time ago.
There was a problem hiding this comment.
Could you please name when you have these fails? I don't have any on iOS simulator.
There was a problem hiding this comment.
Yeah, I even have 26.4.
| namespace rnexecutorch::models::speech_to_text { | ||
|
|
||
| /** | ||
| * Basically a different representation of token, |
There was a problem hiding this comment.
| * Basically a different representation of token, | |
| * Different representation of token, |
| for (size_t i = 1; i < sequenceIds.size(); ++i) { | ||
| std::span<uint64_t> single(sequenceIds.data() + i, 1); | ||
| logitsTensor = this->decode(single, encoderFeatures, startPos); | ||
| ++startPos; |
There was a problem hiding this comment.
| for (size_t i = 1; i < sequenceIds.size(); ++i) { | |
| std::span<uint64_t> single(sequenceIds.data() + i, 1); | |
| logitsTensor = this->decode(single, encoderFeatures, startPos); | |
| ++startPos; | |
| for (size_t i = 1; i < sequenceIds.size(); ++i, ++startPos) { | |
| std::span<uint64_t> single(sequenceIds.data() + i, 1); | |
| logitsTensor = this->decode(single, encoderFeatures, startPos); |
|
|
||
| return {.committed = move_to_vector(committed), | ||
| .nonCommitted = move_to_vector(nonCommitted)}; | ||
| // Return the results |
There was a problem hiding this comment.
| // Return the results |
| // Because of step 1, we know that if the last EOS exist in eos_, | ||
| // then it must be the last entry. | ||
| if (eos_.empty() || eos_.back().position != lastEosIndex) { | ||
| // Register last EOS entry |
There was a problem hiding this comment.
| // Register last EOS entry |
| std::vector<Segment> transcriptions = asr_->transcribe(input, options); | ||
|
|
||
| // Flatten segments into a single word sequence. | ||
| // This is basically our 'nonCommitted' part for now. |
There was a problem hiding this comment.
| // This is basically our 'nonCommitted' part for now. | |
| // This is our 'nonCommitted' part for now. |
| return std::vector<Word>(std::make_move_iterator(container.begin()), | ||
| std::make_move_iterator(container.end())); | ||
| OnlineASR::OnlineASR(const ASR *asr) : asr_(asr) { | ||
| // Reserve an expected amount of memory for audio buffer. |
There was a problem hiding this comment.
| // Reserve an expected amount of memory for audio buffer. |
|
|
||
| // Last-tick committed delta + whatever never made it past the commit | ||
| // threshold. | ||
| std::vector<Word> residual = std::move(result.committed); |
There was a problem hiding this comment.
| std::vector<Word> residual = std::move(result.committed); | |
| std::vector<Word> residual{std::move(result.committed)}; |
| @@ -1325,14 +1338,17 @@ | |||
| STYLE_TRANSFER_UDNIE, | |||
There was a problem hiding this comment.
Ok, so from 0.9 we will effectively drop support from our URL to original models (neither xnnpack nor coreml), right?
There was a problem hiding this comment.
I don't get it - the original models are XNNPACK ones, so they will still be available.
There was a problem hiding this comment.
WHISPER_TINY_EN_QUANTIZED is quantized xnnpack, WHISPER_TINY_EN is I guess full precision xnnpack, since there is no WHISPER_TINY_EN_QUANTIZED we dropped something, what exactly?
There was a problem hiding this comment.
Well, I just think the quantized models are pointless - they weigh only a little bit less than standard float32 models, there do not bring any significant inference speed up compared to baseline, and no one really downloads them on HF. I believe their existance just introduces unnecessary noise to the module.
There was a problem hiding this comment.
I see, I'm ok with removing some of those, now the only question is what should we remove, quantized or non-quantized. If they are just a bit smaller and just a bit faster, they are still better than original one, aren't they?
There was a problem hiding this comment.
Well, float32 baseline models are well tested and surely at least as accurate as quantized (and probably more accurate). If performance difference is minimal (or frankly not existing) then I don't like the idea of risking accuracy drops for some type of inputs.
There was a problem hiding this comment.
Sure thing, that explanation is absolutely fine for me, I mostly asked because I wanted to be on the same page :))
|
Also if this PR adds breaking change, please describe it directly below |
| std::span<uint64_t> firstToken(sequenceIds.data(), 1); | ||
| executorch::aten::Tensor logitsTensor = | ||
| this->decode(firstToken, encoderFeatures, startPos); | ||
| ++startPos; |
There was a problem hiding this comment.
Please abstract it into for loop, sth like this:
executorch::aten::Tensor logitsTensor = nullptr;
for (size_t i = 0; i < sequenceIds.size(); ++i, ++startPos) {
...
}| audioBuffer_.reserve(static_cast<size_t>(2 * params::kStreamChunkThreshold * | ||
| constants::kSamplingRate)); | ||
| bool OnlineASR::isReady() const { | ||
| std::scoped_lock<std::mutex> lock(streamingMutex); |
There was a problem hiding this comment.
std::scope_lock generally doesn't need to be explicitly templated with mutex type, you can drop it, please apply the same to the rest of the places.
| for (auto &segment : transcriptions) { | ||
| words.insert(words.end(), std::make_move_iterator(segment.words.begin()), | ||
| std::make_move_iterator(segment.words.end())); | ||
| std::move(segment.words.begin(), segment.words.end(), |
There was a problem hiding this comment.
std::ranges::move with back_inserter
| for (size_t i = 0; i < memory_.eos.size(); i++) { | ||
| const auto &eos = memory_.eos[i]; | ||
| if (eos.position >= words.size() || !utils::isEos(words[eos.position]) || | ||
| (eos.position > 0 && | ||
| eos.preceeding != words[eos.position - 1].content)) { | ||
| memory_.eos.erase(memory_.eos.begin() + i, memory_.eos.end()); |
There was a problem hiding this comment.
Probably for loop with iterators is more accurate here,
| // in a 'good' spot - where it will remove a significant audio chunk, yet | ||
| // won't affect most recent, unfinished speech samples. | ||
| size_t bufferSize = audioBuffer_.size(); | ||
| if (bufferSize > static_cast<size_t>(params::kStreamSafeBufferDuration * |
There was a problem hiding this comment.
Use std::cmp_greater instead.
|
|
||
| std::vector<Word> OnlineASR::commitAndClean(std::vector<Word> &transcript) { | ||
| const size_t bufferSize = audioBuffer_.size(); | ||
| const float midBufferThreshold = params::kStreamMaxDuration / 2.0F; |
| // recorded any speech. In this case we can safely cut the maximum amount of | ||
| // audio data. | ||
| if (memory_.eos.empty()) { | ||
| size_t cut = |
| } | ||
|
|
||
| return 0; | ||
| constexpr inline bool isEos(const Word &word) { |
There was a problem hiding this comment.
Will the be ever used in compile time scenario so you added constexpr here, since I don't think so

Description
This PR introduces several changes to the speech-to-text module based on Whisper models:
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Run demo app to test the live streaming mode.
Screenshots
Related issues
#1124
Checklist
Additional notes
I am still trying to figure out a way to export Whisper efficiently to Vulkan backend after some initial failures, to cover Android devices as well.