environment to optimization methods. Mitigating
and managing risk for an RL system while it is learning in the real world is not completely novel or
unique to RL. Control engineers have had to confront similar problems from the beginning of using
automatic control in situations where a controller’s
behavior can have unacceptable, possibly catastrophic, consequences. As RL moves out into the
real world, developers have an obligation to adapt
and extend best practices that have guided applications of more established technologies that have
improved the quality, efficiency, and cost-effective-ness of processes upon which we have come to rely.
I delivered brief opening remarks at the First Multi-
disciplinary Conference on Reinforcement Learning
and Decision Making held at Princeton University in
2013. After recounting my early anxiety that I was
doing nothing but reinventing the wheel, I urged the
mostly young audience to not let this sort of anxiety
inhibit their research. But, I went on to say: if you do
reinvent the wheel, please call it a wheel, or perhaps
an improved wheel, instead of giving it a new name
unconnected from the fabric of history. Effort to do
this by me and others in studying RL — which con-
tains a lot of wheel-like parts — has resulted in the
multidisciplinary fabric that has sustained my inter-
est in the subject.
My intention in this article has been to convey a
sense of this multidisciplinary ground that RL covers
by describing some of the connections, surprises, and
challenges that have impressed me over the years
during which my students and I focused on RL.
Exploration of Klopf’s idea of hedonistic neurons led
to excursions through some of the early history of AI,
to psychology’s theories of learning, to appreciation
of DP and the power of Monte Carlo methods. Then
the striking parallels between TD algorithms and the
brain’s dopamine system revealed strong connections between RL algorithms and reward processing
in the brain. It is fair to say that the scientific merit
of Klopf’s hypothesis of the hedonistic neuron — the
exploration of which started me out upon this journey — has been amply demonstrated, and as neuroscience reveals more about how reward processing
works in the brain, we might see more detailed support for the idea that individual neurons implement
the law of effect. Finally, witnessing the potency of
deep neural networks coupled with RL and Monte
Carlo tree search in DeepMind’s Go-playing programs opened a vista onto possibilities for RL to help
improve the quality, fairness, and sustainability of
life on our planet, provided its risks can be successfully managed.
The author thanks his many talented and creative
students who made the journey described in this arti-
cle possible, and the Air Force Office of Scientific
Research and the National Science Foundation for
their financial support.
Athalye, V. R.; Santos, E. J.; Carmena, J. M.; and Costa, R. M.
2018. Evidence for a Neural Law of Effect. Science 359(6379):
Barto, A. G. 1995. Adaptive Critics and the Basal Ganglia. In
Models of Information Processing in the Basal Ganglia, edited
by J. C. Houk, J. L. Davis, and D. G. Beiser, 215–32. Cambridge, MA: The MIT Press.
Barto, A. G., and Duff, M. 1994. Monte Carlo Matrix Inversion and Reinforcement Learning. In Advances in Neural
Information Processing Systems: Proceedings of the 1993 Conference, edited by J. D. Cohen, G. Tesauro, and J. Alspector,
687–94. San Francisco, CA: Morgan Kaufmann.
Barto, A. G.; Sutton, R. S.; and Anderson, C. W. 1983. Neu-ronlike Elements That Can Solve Difficult Learning Control
Problems. IEEE Transactions on Systems, Man, and Cybernetics 13( 5): 835–46. Reprinted in Neurocomputing: Foundations
of Research, 1988, edited by J. A. Anderson and E. Rosenfeld,
535–49. Cambridge, MA: The MIT Press.
Bellman, R. 1953. An Introduction to the Theory of Dynamic
Programming. RAND monograph R-245. Santa Monica, CA:
The Rand Corporation.
Bostrom, N. 2014. Superintelligence: Paths, Dangers, Strategies.
Oxford, UK: Oxford University Press.
Clark, W. A., and Farley, B. G. 1955. Generalization of Pattern Recognition in a Self-Organizing System. In Proceedings
of the 1955 Western Joint Computer Conference, 86–91.
Crites, R. H. 1996. Large-Scale Dynamic Optimization Using
Teams of Reinforcement Learning Agents. PhD dissertation,
Department of Computer and Information Science, University of Massachusetts, Amherst, MA.
Farley, B. G., and Clark, W. A. 1954. Simulation of Self-Organizing Systems by Digital Computer. IRE Transactions on
Information Theory 4( 4): 76–84. doi.org/10.1109/TIT.1954.
Klopf, A. H. 1972. Brain Function and Adaptive Systems — A
Heterostatic Theory. Technical Report AFCRL-72-0164. Bedford, MA: Air Force Cambridge Research Laboratories. (A
summary appears in Proceedings of the International Conference on Systems, Man, and Cybernetics. 1974. New York: Institute of Electrical and Eloectronics Engineers.)
Klopf, A. H. 1982. The Hedonistic Neuron: A Theory of Memory, Learning, and Intelligence. Washington, DC: Hemisphere.
Michie, D. 1967. Memo Functions: A Language Feature with
“Rote-Learning” Properties. Research Memorandum MIP-R-
29. Edinburgh, UK: University of Edinburgh, Department of
Machine Intelligence and Perception.
Michie, D. 1968. “Memo” Functions and Machine Learning.
Nature 218(5136): 19–22. doi.org/10.1038/218019a0.
Michie, D., and Chambers, R. A. 1968. BOXES: An Experiment in Adaptive Control. In Machine Intelligence 2, edited
by E. Dale and D. Michie, 137–52. Edinburgh, UK: Oliver
Minsky, M. L. 1954. Theory of Neural-Analog Reinforcement Systems and Its Application to the Brain-Model Prob-Articles