State-of-the-art intracortical neuroprostheses currently enable communication at 60+ words per minute for anarthric individuals by training on over 10K sentences to account for phoneme variability in different word contexts. There is limited understanding about whether this performance can be maintained in decoding naturalistic speech with 40K+ word vocabularies across elicited, spontaneous, and conversational speech contexts. We introduce a vocal-unit-level generalization test to explicitly evaluate neural decoder performance with an expanded and more diverse behavioral repertoire. Tested on neural decoders modeling zebra finch vocalization, an analog to human vocal production, we compare three decoders with different input types: spike trains, neural factors, and firing rates. The factors and rates are latent neural features inferred using trained Latent Factor Analysis via Dynamical Systems (LFADS) models that capture the population neural dynamics during vocal production. While the conventional random holdout generalization error measure is similar for all three decoders, factor- and rate-based decoders outperform spike-based decoders when testing vocal-unit-holdout generalization error. These results suggest the later models better adapt to flexible vocalization inference when trained with partial observation of data variation, motivating further exploration of decoders incorporating latent neural and vocalization dynamics.