报告题目：On the Policy-Value Duality of Neural Network Models
报告人：黄铂钧 日本乐天研究所 杰出首席科学家（Distinguished Principal Scientist）
Cross-Entropy-Minimization (CEM) and Max-Likelihood-Estimation (MLE) are two prevailing interpretations to the now-standard training procedure of (deep) neural network models in current AI practice. The decision rules entailed by both interpretations however contradict with how the trained models are used for decision making in the real world. The contradiction becomes especially salient in sequential-decision-making scenarios, such as in Machine Translation, where the aforementioned disparity between theoretical expectation and empirical inadequacy is famously known as the beam search curse among practitioners.
In this talk we discuss a value-learning theory that explains empirical observations about neural network models at both training and decision time. A key idea is to recognize a policy-value duality regarding the identity/role of neural network models in AI practice, which is in analogy to the wave-particle duality of matter in physical reality. In contrast to the CEM/MLE interpretations which consider a neural network model as a probability distribution function, we argue that the same stochastic-gradient process of CEM/MLE training is implicitly and equivalently shaping the neural network model towards an optimal value function. This value-based interpretation better explains the effectiveness of greedy usage of neural network models in practice as well as observed decision pathologies such as the beam search curse in Machine Translation.
At a deeper level, we will explore mathematical structures behind the policy-value duality under a Bellman-Lagrangian Duality framework, in which value learning and policy/ probability learning are dual problems to each other, and the optimal value and optimal policy collectively form a saddle point of the corresponding Bellman-Lagrangian function. We will see that CEM/MLE-based training corresponds to a unilateral optimization of the Bellman-Lagrangian function (over the value space alone) in imitation/supervised learning setting. We will then briefly relate our duality theory to a game-theoretic approach of policy-value co-learning in harder learning settings such as reinforcement learning.
Bojun Huang(黄铂钧) is now a Distinguished Principal Scientist in Rakuten Institute of Technology, Japan. Before joining Rakuten Institute of Technology, he was a research scientist manager in Ascent Robotics and an associate researcher in Microsoft Research Asia.
Bojun has broad research interest in general artificial intelligence technologies and especially expertises in reinforcement learning. He has published dozens of first author research papers in top international computer science conferences, including NeurIPS, SIGMOD, WWW, AAAI, DISC etc. Other than his academic achievements, Bojun has very rich project experience in technology landing and obtained several US patents as the first patentee. Bojun also serves as a science writer for the NEWTON-Science World magazine, writing scientific articles to introduce up-to-date AI technologies.
Bojun is also the course advisor of ‘Reinforcement Learning and Artificial Intelligence’ in 2021 international summer school at Shandong University.