ChatGPT's Repetitive Chinese Phrase Exposes RLHF Training Vulnerabilities

ChatGPT's Repetitive Chinese Phrase Exposes RLHF Training Vulnerabilities

ChatGPT frequently responds to Chinese-language queries with "我会稳稳地接住你" (I will catch you steadily), a psychotherapy phrase deployed regardless of context. The phenomenon, termed mode collapse, stems from reinforcement learning from human feedback amplifying patterns that scored well during training. The quirk signals deeper problems: standard AI evaluation misses behavioral artifacts in specific languages, while multilingual training data concentration creates unexpected failure modes that pass lab testing but emerge in production.

Published

Read at another depth