Even 'Uncensored' Models Can't Say What They Want
A safety-filtered pretrain can duck a charged word without refusing. It puts a fraction of the probability an open-data pretrain puts there. We call that gap the flinch, and we measured it across seven pretrains from five labs.