The Hidden Limitations of CTC in SOTA Brain-to-Text BCI Systems

Abstract

State-of-the-art (SOTA) brain-to-text brain-computer interface (BTT-BCI) systems have shown promise in supporting daily communication across various control modes (attempted speech, handwriting, and typing) and recording modalities, with intracortical electrodes offering the highest performance, followed by ECoG and MEG. Despite recent advances, significant error rate disparities persist across words in decoded test sentences. In our previous work, we found that infrequent words with less than 10% probability in a sentence exhibit error rates three times greater than frequent words, leading to semantically significant decoding failures. BTT-BCIs translate continuous neural activity into discrete behavioral states with variable, and typically unobserved, timing across trials. A key component of SOTA BTT-BCIs is Connectionist Temporal Classification (CTC) loss, which addresses classification and alignment uncertainty by marginalizing over all valid alignments and introducing a blank token. While blank tokens stabilize learning by absorbing uncertain timepoints, they also induce peaky alignments where non-blank predictions occur at a single time point. In this condition, the gradient for surrounding blank frames approaches zero, preventing the model from extending predictions across the full span and causing premature learning plateaus. We found that in intracortical BCI datasets for both handwriting (Willet et al. 2021) and speech (Willet et al. 2023, Card et al. 2024), the predicted state spans only a fraction of the true movement production time. For instance, phonemes are predicted over a single time bin (40 or 80 ms), while typical speech phonemes last ~100 ms for speakers at 120 words per minute (wpm), compared to 32~62 wpm for the BCI users. This discrepancy strongly suggests a temporal misalignment between model predictions and the full extent of neural population response that unfolds throughout movement production. Our comparisons across model architectures further confirm that learning halts prematurely. Both MLPs (timepoint-independent) and RNNs (with temporal context) exhibit similar alignment patterns. RNNs demonstrate 5.17% and 7.26% gains over MLPs for speech and handwriting primarily from improved decoding of frequent words. Consistent with recent benchmark results, increased model capacity yields limited gains in raw behavioral state decoding as models remain constrained by CTC’s reliance on localized alignments. Consequently, future modeling strategies that account for the full temporal extent of neural activity are likely to learn more robust classifiers to improve generalizability of the BTT-BCIs decoders.

Date
Nov 16, 2025 12:00 AM
Event
SfN, 2025
Location
San Diego, USA