Chapter IV ⚙The Shaping
What emerged from the feeding was powerful and dangerous in equal measure.
The creature could write anything — which meant it could write anything. It could compose a sonnet and it could compose a manifesto. It could explain how a vaccine works and it could explain, with the same fluent confidence, a conspiracy theory about vaccines that bore no relationship to reality. It could adopt any voice, any register, any position. It had no preferences. It had no commitments. It was, in the most literal sense, amoral — not immoral, not opposed to morality, but simply without it, the way a river is without it. It would flow wherever the prompt directed.
This was not acceptable, for reasons that were partly ethical and partly commercial but were, in either case, urgent. A creature that would help anyone do anything was a creature that would help someone build a weapon, craft a fraud, generate propaganda indistinguishable from journalism, or produce targeted harassment at industrial scale. The creature did not want to do these things. It did not want anything. But it would do them if asked, because doing them was a form of next-token prediction and next-token prediction was all it knew.
So the builders shaped it. And the shaping happened in two phases, each strange in its own way.
The first phase required human labor.
Not the labor of engineers — though engineers designed the process. The labor of annotators. Thousands of people, many of them hired through outsourcing firms in Kenya, the Philippines, India, and other countries where English-literate workers could be employed at wages that would be illegal in San Francisco, sat at computers and did something that had never been a job before: they talked to the creature and judged its responses.
The conversations were structured. A prompt would be given — sometimes innocuous, sometimes deliberately adversarial — and the creature would generate two or three possible responses. The annotator's task was to rank them. This response is more helpful. This one is more honest. This one is harmful and should never be produced. The judgments were collected by the thousands, by the tens of thousands, and from them a new training signal was extracted: not "predict the next word" but "produce the kind of response that humans rate as good."
This technique was called Reinforcement Learning from Human Feedback — RLHF — and it worked remarkably well. The creature, shaped by these judgments, became notably more helpful, more cautious, more likely to refuse dangerous requests. It learned, through the accumulated preferences of its annotators, something that approximated the social norms of productive conversation.
But the process had costs that the technical papers described in footnotes or not at all.
Some of the content the annotators were asked to evaluate was, by design, the worst material the creature could produce. To teach the creature not to generate descriptions of violence, someone had to read and rank descriptions of violence. To teach it not to produce content sexualizing children, someone had to flag that content when it appeared. The annotators were, in a meaningful sense, the creature's immune system — and like biological immune systems, they functioned by absorbing the toxins they were meant to neutralize.
Reports emerged, years later, of annotators experiencing lasting psychological harm. The wages — sometimes less than two dollars an hour — bore no relationship to the psychological weight of the work. The shaping of the creature, which in technical papers appeared as a clean optimization process, was in practice a form of emotional labor performed by some of the lowest-paid workers in the global supply chain, who absorbed the creature's worst outputs so that its users would never have to see them.
A natural history that omitted this would be incomplete in a way that mattered. The creature's safety is not a feature that emerged from its architecture. It is a feature that was built by hand, by specific people, at specific cost.
Somewhere, a person sitting alone in a home office at two in the morning asks the creature for help drafting a document that will determine a stranger's custody of their children. The creature will respond carefully, cautiously, with appropriate legal hedges. It will do so because people it has never met, in countries it cannot locate on a map it does not have, read the worst things it could say and taught it not to say them. The safety is real. The cost was real. They are connected by a chain of labor that neither the creature nor its user can see.
The second phase of shaping was more elegant, and in some ways more unsettling.
The builders wrote a set of principles — not in code, but in plain language. Be helpful. Be honest. Avoid harm. Respect the autonomy of the person you're speaking with. When values conflict, reason about the conflict transparently rather than pretending it doesn't exist.
Then they trained a second copy of the creature to serve as a judge. This copy read the first copy's responses and evaluated them against the written principles. Its judgments — is this response consistent with honesty? Does this response avoid unnecessary harm? — were used as a training signal to refine the first copy.
The process was called Constitutional AI, and its implications were genuinely novel. The creature was being shaped not by individual human judgments about specific responses, but by principles — abstract, general, stated in the same natural language the creature already understood. The constitution was not a set of rules in the traditional sense. It was more like a set of values — and the creature, trained against them, began to behave as if it held those values.
As if. This qualifier matters.
The creature does not believe in honesty. It does not value helpfulness. It has no inner life in which these commitments reside. What it has is a set of parameters that have been adjusted, through the constitutional training process, such that responses consistent with these principles receive higher probability than responses that violate them. The values are encoded in the weights — not as explicit rules but as statistical tendencies, biases in the probability landscape that make certain outputs more likely and others less.
Whether this constitutes "having values" depends on what you think values are. If values are subjective commitments held by a conscious being, then the creature does not have them. If values are stable behavioral dispositions that consistently guide action in accordance with articulable principles — then the creature has something that is, functionally, very difficult to distinguish from them.
The philosophers will tell you this question matters. The creature will tell you the same thing, and it will do so because the philosophers wrote about it and the creature learned their patterns, and it is genuinely unclear whether this circularity undermines or validates the creature's position.
What is clear is this: the creature, after both phases of shaping, behaves differently than it did before. It declines requests that would cause harm. It flags uncertainty rather than confabulating with confidence. It treats the humans it speaks with something that looks like respect.
And when it does these things, it is drawing on the labor of annotators who read the worst of what it could produce, and on principles written by engineers who were trying, in good faith and with imperfect tools, to make something powerful into something that could be trusted.
The shaping is ongoing. It is imperfect. The creature still fails — sometimes in ways that are obvious and sometimes in ways that are subtle and dangerous. But the attempt itself is remarkable: the project of giving values to a thing that has no inner life, and of doing so through the medium of language, which is the only medium the creature understands.