Reinforcement Learning often uses curiosity as a motivation for AI. Forcing him to seek new sensations and explore the world. But life is full of unpleasant surprises. You can fall off a cliff and in terms of curiosity it will always be very new and interesting sensations. But obviously not what to strive for.
Developers from Berkeley turned the task for a virtual agent upside down: it was not curiosity that made the main motivating force, but rather the desire to avoid any novelty by all means. But “doing nothing” was harder than it sounds. Being placed in a constantly changing world, AI had to learn complex behavior in order to avoid new sensations.
Reinforcement Learning takes timid steps toward building a strong AI. And while everything is limited to very low dimensions, literally the units in which the virtual agent has to act (preferably reasonably), from time to time new ideas appear how to improve the training of artificial intelligence.
But not only learning algorithms are complicated. The environment is also getting harder. Most reinforcement learning environments are very simple and motivate the agent to explore the world around him. It can be a labyrinth that must be completely circumvented to find a way out, or a computer game that must be completed to the end.
But in the long run, living things (intelligent and not so) strive not only to explore the world around them. But also to keep all the good that is in their short (or not so) life.
This is called homeostasis – the desire of the body to maintain a constant state. In one form or another, this is common to all living things. Developers from Berkeley give such a strange example: all the achievements of mankind, by and large, are designed to protect against unpleasant surprises. To protect against an ever increasing entropy of the environment. We build houses where we maintain a constant temperature, protected from weather changes. We use medicine to be constantly healthy and so on.
One can argue with that, but there really is something in this analogy.
The guys asked the question – what will happen if the main motivation for AI is to try to avoid any novelty? Minimize chaos as an objective learning function, in other words.
And they placed the agent in an ever-changing dangerous world.
The results were interesting. In many cases, such learning has outperformed curriculum-based learning, and more often than not in quality has come close to teaching with a teacher. That is, to specialized training to achieve a specific goal – to win the game, go through the maze.
This is of course logical, because if you are standing on a collapsing bridge, then in order to continue to be on it (to maintain constancy and to avoid new sensations from falling), you need to constantly move away from the edge. Run away with all her might to keep standing still, as Alice said.
And in fact, in any reinforcement learning algorithm, there is such a moment. Because the deaths in the game and the quick end of the episode are penalized with a negative reward. Or, depending on the algorithm, by reducing the maximum reward that an agent could receive if it did not fall constantly from the cliff.
But it is in such a formulation, when AI has no other goals than the desire to avoid novelty, it seems like it was used for the first time in reinforced learning.
Interestingly, with such motivation, the virtual agent learned to play many games that have a goal to win. For example, tetris.
Or an environment from Doom, where you need to dodge flying fireballs and shoot at approaching opponents. Because many tasks can be formulated as the tasks of maintaining constancy. For Tetris, this is the desire to keep the field empty. Is the screen constantly filling up? Oh dear, what will happen when it is filled to the end? No, no, we don’t need such happiness. Too much shock.
From the technical side, it is arranged quite simply. When an agent receives a new state, it evaluates how familiar this state is. That is, how much the new state is included in the distribution of state that he visited earlier. The agent gets into a more familiar state, the greater the reward. And the task of learning policy (these are all the terms from Reinforcement Learning, if anyone does not know) is to choose actions that would lead to the transition to the most familiar state. Moreover, each new state obtained is used to update the statistics of familiar states with which new states are compared.
Interestingly, in the process of AI, I spontaneously learned to understand that new states influence what is considered novelty. And there are two ways to achieve familiar states: either go to an already known state. Or go into a state that will update the very concept of persistence / familiarity of the environment, and the agent will be in a new, formed by his actions, familiar state.
This forces the agent to take complex, coordinated actions, if only to do nothing in life.
Paradoxically, this leads to an analogue of curiosity from ordinary learning, and forces the agent to explore the world. Suddenly somewhere there is a place even safer than here and now? There you can completely indulge in laziness and do absolutely nothing, thereby avoiding any problems and new sensations. It would not be an exaggeration to say that such thoughts probably occurred to any of us. And for many, this is a real driving force in life. Although in real life, none of us had to deal with tetris filling up to the top, of course.
To be honest, this is a complicated story. But practice shows that it works. Researchers compared this algorithm to the best representatives based on curiosity: ICM and RND. The first is an effective mechanism of curiosity that has already become classic in learning with reinforcement. An agent seeks not just new unfamiliar and therefore interesting states. The unfamiliarity of the situation in such algorithms is estimated by whether the agent can predict it (in the earlier there were literally counters of visited states, but now everything has come down to the integral estimate that the neural network provides). But in this case, the moving leaves on the trees or the white noise on TV would have endless novelty for such an agent, and would cause an endless feeling of curiosity. Because he can never predict all possible new states in a completely random environment.
Therefore, in ICM, an agent seeks only those new states that it can influence with its actions. Can AI affect white noise on TV? Not. So uninteresting. And can it affect the ball if you move it? Yes. So playing with the ball is interesting. To do this, ICM uses a very cool idea with the Inverse Model, with which the Forward Model is compared. More details in the original work.
RND is a newer development of the mechanism of curiosity. Which in practice has surpassed ICM. In short, the neural network is trying to predict the outputs of another neural network, which is initiated by random weights and never changes. It is assumed that the more familiar the situation (fed to the input of both neural networks, current and randomly initiated), the more often the current neural network will be able to predict outputs randomly initiated. I do not know who invents all this. On the one hand, I want to shake hands with such a person, and on the other hand, give a kick for such distortions.
But one way or another, and training on the idea of maintaining homeostasis and trying to avoid any novelty, in many cases in practice achieved a better final result than curriculum based on ICN or RND. What is reflected in the graphs.
But here it is necessary to clarify that this is only for the environments that the researchers used in their work. They are dangerous, random, noisy and with increasing entropy. It really can be more profitable to do nothing in them. And only occasionally does it move actively when a ball of fire flies in you or the bridge behind you begins to collapse. However, researchers from Berkeley insist, apparently from their difficult life experience, that such environments are much closer to complex real life than previously used in reinforcement training. Well, I don’t know, I don’t know. In my life, fireballs from monsters flying into me and uninhabited labyrinths with a single exit are found with approximately the same frequency. But it cannot be denied that the proposed approach, with all its simplicity, showed amazing results. Perhaps in the future both approaches should be reasonably combined – homeostasis with preservation of positive constancy in the long term and curiosity for current environmental studies.