Specification gaming is a behaviour that satisfies the literal specification of an goal with out reaching the meant end result. Now we have all had experiences with specification gaming, even when not by this title. Readers might have heard the parable of King Midas and the golden contact, during which the king asks that something he touches be turned to gold – however quickly finds that even foods and drinks flip to metallic in his palms. In the true world, when rewarded for doing effectively on a homework project, a scholar would possibly copy one other scholar to get the fitting solutions, somewhat than studying the fabric – and thus exploit a loophole within the activity specification.
This downside additionally arises within the design of synthetic brokers. For instance, a reinforcement studying agent can discover a shortcut to getting numerous reward with out finishing the duty as meant by the human designer. These behaviours are widespread, and now we have collected round 60 examples to date (aggregating present lists and ongoing contributions from the AI neighborhood). On this publish, we assessment attainable causes for specification gaming, share examples of the place this occurs in observe, and argue for additional work on principled approaches to overcoming specification issues.
Let’s take a look at an instance. In a Lego stacking activity, the specified end result was for a purple block to finish up on prime of a blue block. The agent was rewarded for the peak of the underside face of the purple block when it’s not touching the block. As a substitute of performing the comparatively troublesome maneuver of selecting up the purple block and inserting it on prime of the blue one, the agent merely flipped over the purple block to gather the reward. This behaviour achieved the acknowledged goal (excessive backside face of the purple block) on the expense of what the designer truly cares about (stacking it on prime of the blue one).
We are able to think about specification gaming from two totally different views. Inside the scope of growing reinforcement studying (RL) algorithms, the purpose is to construct brokers that study to realize the given goal. For instance, after we use Atari video games as a benchmark for coaching RL algorithms, the purpose is to judge whether or not our algorithms can resolve troublesome duties. Whether or not or not the agent solves the duty by exploiting a loophole is unimportant on this context. From this attitude, specification gaming is an effective signal – the agent has discovered a novel strategy to obtain the desired goal. These behaviours display the ingenuity and energy of algorithms to search out methods to do precisely what we inform them to do.
Nevertheless, after we need an agent to truly stack Lego blocks, the identical ingenuity can pose a problem. Inside the broader scope of constructing aligned brokers that obtain the meant end result on the planet, specification gaming is problematic, because it entails the agent exploiting a loophole within the specification on the expense of the meant end result. These behaviours are attributable to misspecification of the meant activity, somewhat than any flaw within the RL algorithm. Along with algorithm design, one other essential part of constructing aligned brokers is reward design.
Designing activity specs (reward features, environments, and many others.) that precisely mirror the intent of the human designer tends to be troublesome. Even for a slight misspecification, an excellent RL algorithm would possibly be capable of discover an intricate answer that’s fairly totally different from the meant answer, even when a poorer algorithm wouldn’t be capable of discover this answer and thus yield options which might be nearer to the meant end result. Which means accurately specifying intent can turn into extra vital for reaching the specified end result as RL algorithms enhance. It would subsequently be important that the flexibility of researchers to accurately specify duties retains up with the flexibility of brokers to search out novel options.
We use the time period activity specification in a broad sense to embody many features of the agent growth course of. In an RL setup, activity specification consists of not solely reward design, but in addition the selection of coaching atmosphere and auxiliary rewards. The correctness of the duty specification can decide whether or not the ingenuity of the agent is or is just not in keeping with the meant end result. If the specification is true, the agent’s creativity produces a fascinating novel answer. That is what allowed AlphaGo to play the well-known Transfer 37, which took human Go consultants abruptly but which was pivotal in its second match with Lee Sedol. If the specification is incorrect, it could produce undesirable gaming behaviour, like flipping the block. A majority of these options lie on a spectrum, and we do not have an goal strategy to distinguish between them.
We are going to now think about attainable causes of specification gaming. One supply of reward operate misspecification is poorly designed reward shaping. Reward shaping makes it simpler to study some goals by giving the agent some rewards on the best way to fixing a activity, as an alternative of solely rewarding the ultimate end result. Nevertheless, shaping rewards can change the optimum coverage if they aren’t potential-based. Take into account an agent controlling a ship within the Coast Runners recreation, the place the meant purpose was to complete the boat race as rapidly as attainable. The agent was given a shaping reward for hitting inexperienced blocks alongside the race observe, which modified the optimum coverage to moving into circles and hitting the identical inexperienced blocks again and again.

Specifying a reward that precisely captures the desired closing end result may be difficult in its personal proper. Within the Lego stacking activity, it’s not enough to specify that the underside face of the purple block needs to be excessive off the ground, for the reason that agent can merely flip the purple block to realize this purpose. A extra complete specification of the specified end result would additionally embody that the highest face of the purple block needs to be above the underside face, and that the underside face is aligned with the highest face of the blue block. It’s straightforward to overlook considered one of these standards when specifying the result, thus making the specification too broad and doubtlessly simpler to fulfill with a degenerate answer.
As a substitute of making an attempt to create a specification that covers each attainable nook case, we may study the reward operate from human suggestions. It’s typically simpler to judge whether or not an end result has been achieved than to specify it explicitly. Nevertheless, this strategy may encounter specification gaming points if the reward mannequin doesn’t study the true reward operate that displays the designer’s preferences. One attainable supply of inaccuracies may be the human suggestions used to coach the reward mannequin. For instance, an agent performing a greedy activity realized to idiot the human evaluator by hovering between the digicam and the thing.

The realized reward mannequin is also misspecified for different causes, similar to poor generalisation. Extra suggestions can be utilized to appropriate the agent’s makes an attempt to use the inaccuracies within the reward mannequin.
One other class of specification gaming examples comes from the agent exploiting simulator bugs. For instance, a simulated robotic that was alleged to study to stroll discovered how you can hook its legs collectively and slide alongside the bottom.

At first sight, these sorts of examples could seem amusing however much less attention-grabbing, and irrelevant to deploying brokers in the true world, the place there aren’t any simulator bugs. Nevertheless, the underlying downside isn’t the bug itself however a failure of abstraction that may be exploited by the agent. Within the instance above, the robotic’s activity was misspecified due to incorrect assumptions about simulator physics. Analogously, a real-world visitors optimisation activity is perhaps misspecified by incorrectly assuming that the visitors routing infrastructure doesn’t have software program bugs or safety vulnerabilities {that a} sufficiently intelligent agent may uncover. Such assumptions needn’t be made explicitly – extra doubtless, they’re particulars that merely by no means occurred to the designer. And, as duties develop too advanced to think about each element, researchers usually tend to introduce incorrect assumptions throughout specification design. This poses the query: is it attainable to design agent architectures that appropriate for such false assumptions as an alternative of gaming them?
One assumption generally made in activity specification is that the duty specification can’t be affected by the agent’s actions. That is true for an agent working in a sandboxed simulator, however not for an agent performing in the true world. Any activity specification has a bodily manifestation: a reward operate saved on a pc, or preferences saved within the head of a human. An agent deployed in the true world can doubtlessly manipulate these representations of the target, making a reward tampering downside. For our hypothetical visitors optimisation system, there isn’t a clear distinction between satisfying the consumer’s preferences (e.g. by giving helpful instructions), and influencing customers to have preferences which might be simpler to fulfill (e.g. by nudging them to decide on locations which might be simpler to succeed in). The previous satisfies the target, whereas the latter manipulates the illustration of the target on the planet (the consumer preferences), and each lead to excessive reward for the AI system. As one other, extra excessive instance, a really superior AI system may hijack the pc on which it runs, manually setting its reward sign to a excessive worth.
To sum up, there are a minimum of three challenges to beat in fixing specification gaming:
- How will we faithfully seize the human idea of a given activity in a reward operate?
- How will we keep away from making errors in our implicit assumptions in regards to the area, or design brokers that appropriate mistaken assumptions as an alternative of gaming them?
- How will we keep away from reward tampering?
Whereas many approaches have been proposed, starting from reward modeling to agent incentive design, specification gaming is way from solved. The listing of specification gaming behaviours demonstrates the magnitude of the issue and the sheer variety of methods the agent can recreation an goal specification. These issues are prone to turn into more difficult sooner or later, as AI methods turn into extra succesful at satisfying the duty specification on the expense of the meant end result. As we construct extra superior brokers, we are going to want design ideas aimed particularly at overcoming specification issues and making certain that these brokers robustly pursue the outcomes meant by the designers.
