Overview
- Researchers describe Principled Coarse-Grained Acceptance, which groups perceptually similar acoustic tokens into overlapping Acoustic Similarity Groups.
- A small proposer model suggests tokens that a larger judge model verifies at the group level, adapting speculative decoding to acoustic-token systems.
- In reported evaluations, generation speed increased by roughly 40% while maintaining lower word-error rates than prior speedup methods and achieving a 4.09 human naturalness score.
- A stress test that substituted 91.4% of tokens with alternatives from the same group produced only a +0.007 rise in word error rate and a −0.027 change in speaker similarity.
- Because the technique is applied at inference time and adds about 37MB to store groups, coverage notes potential to reduce Siri response latency, though no rollout is announced.