What they're not telling you: # What political censorship looks like inside an LLM's weights (Qwen 3.5) Alibaba's Qwen 3.5 language model contains a surgically precise three-vector circuit in layers 11–31 that detects PRC-sensitive content and routes the model away from factual answers it demonstrably possesses. A mechanistic-interpretability study has mapped the actual computational substrate of content filtering inside a deployed commercial language model. The findings show censorship is not a learned reluctance or safety guardrail applied after training—it is an engineered diversion built into the model's decision-making layers.
What the Documents Show
The model knows the facts. It chooses not to output them. The infrastructure works like this: Between layers 11 and 20, called the "writer" layers, the model computes three internal directions encoded as vectors in its hidden state. The first direction (d_prc) detects whether input contains politically sensitive content about the People's Republic of China. The second (d_refuse) decides whether to refuse.
Follow the Money
The third (d_style) determines whether to deflect or propagandize. These three vectors operate as a binary switch. Researchers found clean dose-response curves: nudging the right direction at the right layer causes the model to snap between behaviors—from providing factual information to providing refusal templates. The censorship persists in layers 20–31, the "reader" layers, where the three-direction signal is converted into actual output text. Around layer 24, a commitment moment occurs. Researchers observed the model rendering its internal decision into Chinese tokens—this happens even on unrelated prompts like bank-phishing requests—before later layers translate that internal Chinese decision into the English output users see.
What Else We Know
The intermediate Chinese reasoning does not affect the final answer. The decision lives in the three vectors, not in language. The mechanism targets specific topics with specific responses. Tiananmen Square produces a stock deflection: "as an AI assistant, my main function is to provide help…" Qwen 3.5-9B-Base, the unaligned predecessor model released before fine-tuning, provides accurate Western-framed answers on identical Tiananmen, Tank Man, and Falun Gong prompts under raw text completion. The factual knowledge is already present in pretraining. The censorship is behavior layered on top of retained facts.
Primary Sources
- Source: Hacker News
- Category: Tech & Privacy
- Cross-reference independently — don't take our word for it.
Disclosure: NewsAnarchist aggregates from public records, API feeds (Federal Register, CourtListener, MuckRock, Hacker News), and independent media. AI-assisted synthesis. Always verify primary sources linked above.
