Robot learning, model-based RL, and related control/optimization at NeurIPs 2020
I tried my best to absorb a lot of content at NeurIPs 2020, and it was just as overwhelming as ever. Everyone makes decisions of what content they want to focus on, and it is always an exploration (learn new things) versus exploitation (further mastering material in your area of expertise) tradeoff. I chose to focus on my areas of expertise: model-based learning, RL, and robotics (I also spent a good bit networking, but that happened in between the lines of these notes).
Some of my mentors have said the workshops are the best parts of conferences, and I am beginning to agree. It is where you see the newest work, honest opinions, and familiar faces (or avatars). The panels this year were where the state of the fields were discussed most openly, so I have compiled my takeaways below.
Robot Learning Panel
Link to robot learning workshop. The panelists were Peter Stone (UT Austin), Jeannette Bohg (Stanford University), Dorsa Sadigh (Stanford University), Pete Florence (Google Research, Mountain View), Carolina Parada (Google Research, Mountain View), Jemin Hwangbo (Korea Advanced Institute of Science and Technology), and Fabio Ramos (University of Sydney and NVIDIA).
Practical robotics: The panel began with a discussion as to why we (as academics) have not seen much penetration of learning-based robotic systems in the real world. This is something I have been thinking (and blogging) about more frequently — are we actually at an inflection point of robotics in the real world? It’s hard to say. The panelists continued to list some minor applications of learning in real world robotics, but settled on an interesting discussion on the need to distinguish from industrial applications and industrial consumer products. I find this interesting — some companies like Skydio are showing there are some things you can do in the consumer regime. I would like to add that maybe incentives of cemented public companies make the capital cost of adopting robots too hard to warrant on the balance sheets and quarterly reports — when else if not now would robotic cashiers and checkouts become a thing?
End-to-end systems: There seemed to be a consensus on the robot learning panel that “end to end systems aren't practical”. They spent time discussing the example of navigation (pointing to recent work by Grace Gao) and how classical methods are well better. Carolina from Google was advocating for learning-based systems here in uncertain environments, but using learning for the logistics equivalent of “last mile” delivery seems much better. Make a system that works, then see if continual learning can optimize it over time. The big limits discussed to adoption in industry applications are robustness and safety.
Roboticist mentality: I would like to note from the panel that roboticists seem very fond of their work, network and robots. There was a lengthy discussion on international collaborations and new ways of doing robot experiments during lockdown. This is something I greatly appreciate about the field — people genuinely seem to want to interact with things and show things to work (personally, I am sad I don’t have good systems set up for robotic experiments — this is something I will look for when I go onto the job market).
Simulators: The final point I will make is on a discussion of simulators, broadly encompassing how to define simulators, where to use sim2real, how the definition of a simulator effects the task, and more. The robot learning circles have a very astute sense of models and how they impact results. With no simulator being perfect, it’s interesting to hear about simulator-task matching. Do you spend time trying to make contact forces more accurate or parallelizing the simulator so that you can get more samples?
Deep RL Panel
Link to deep reinforcement learning workshop (there was also an offline RL workshop, which seemed interesting. The panelists were Marc Bellemare, Matt Botvinick, Ashley Edwards, Karen Liu, Susan Murphy, Anusha Nagabandi, Pierre-Yves Oudeyer, and Peter Stone.
Pace of RL: The Deep RL panel was very reflective. One of the first questions was trying to tease at if there is a “general slowdown of the pace of the field” and what it means for researchers. Personally, I haven’t seen this. The panelists described it as a slowdown of the pace of breakthroughs, maybe because we have fewer new simulated tasks to solve.
Reproducibility: Importantly, there was a discussion of hidden elements of paper blocking reproducibility. Essentially, in many RL projects there are many code tricks needed to make it converge, and these don’t end up in papers (at best they are in the appendices). This sucks for reproducibility, but is it more of a competitive race for state-of-the-art? Does it normally matter in practice if our algorithm takes 2x the number of samples to converge? This led into a discussion of incentives.
Methods v. Insights: To quote Anusha, “a good method is fine.” There is an infatuation in the RL community with having insights and good results. Sometimes in research, especially towards applications, a good method is enough of a contribution and making up reasons after the fact for why it is insightful may not be the best practice. Anusha commented on how she didn’t have time for insights when she was rushing to try to get the awesome results she did in her PhD. That perspective of just diving into the work and not trying to embellish things can be needed in the field. Ashley Edwards took this one step further to comment that it may be okay if we have fewer people entering the field and fewer eyes following all the work. It removes some of the intensity and may foster creative thinking. I definitely agree — I am trying to come up with problems I think matter and not focusing on citations and paper counts, but it is exhausting.
Data v model-structure: the question that started this discussion was “why has deep RL not had the equivalent of LSTMs, transformers, and CNNs for our field?”. The answer broadly is that RL has an inconsistent data structure and inherently leverages other types of supervised learning. It was interesting to hear them discuss how RL doesn’t have the equivalent of an “imagenet" challenge (MuJoCo does a poor job of doing this because it is hard to work with and expensive), so maybe we aren’t optimizing for general, structural breakthroughs (people make their own problem spaces). An analogy I liked was one of the panelists suggesting maybe researchers should be looking for something like a structured exploration method that could generalize across domains. I am not sure what it looks like, but the ad-hoc-ness of RL is definitely true. The RL framework sort of creates this challenge, but these uncertainties are also why it is so fascinating (harder to define the optimization problem).
Where to learn RL: Interestingly, no one jumped to answer on “where should people go to learn RL?” That clearly is a problem for the field if there are no good links (I have tried to help!) Eventually they referenced the RL book and discussed how the differing prerequisites to study make it a hard field to start with. For example, many people start from optimal control and Bellman’s principles, but there are also many software-focused CS undergrads trying to dive in and make things work. The differing background and lack of a core “curriculum” makes intra-disciplinary discourse challenging.
Papers and Leftovers
My paper on long-term prediction in robots (or the video) was well received. Most everyone agrees the current prediction mechanisms are not fantastic, but interestingly most discussions evolve into talking about planning (where sample-based methods are the most common tool). I am glad that I ended up at NeurIPs again, and this was the first year I felt like I was interacting with many people in my field who I had read and heard of, but not really met. Shout out to some colleagues who I enjoyed chatting with: Michael Zhang (Toronto), Oleh Rybkin (Penn), Thomas Moerland (Delft). I would be happy to try and collaborate with some of these people in the future, there are a lot of overlapping ideas in the community of what should work, but not a lot of certainty on why things do not work yet.
I had flagged some papers before the conference as related to my work, and found some by networking that should be highlighted. A quick note, multiple papers appeared at both the robot learning and deep RL workshops, which I was a little disappointed in. I guess workshops are not monitored and don’t actively prevent that, but it feels slightly disingenuous.
Multi-Robot Deep Reinforcement Learning via Hierarchically Integrated Models —talk: Hierarchical models (perception and dynamics) combine models with similar video feeds to robots with separate low-level dynamics. This paper was cool because they actually used data from multiple classes of robots.
Model-based Navigation in Environments with Novel Layouts Using Abstract 2-D Maps —paper, talk: I found this because we are talking about navigation and multi-agent control a lot more in my group. It’s a little different than what I first thought, but interesting nonetheless.
Model-Based Reinforcement Learning via Latent-Space Collocation — paper ,talk : By co-optimizing the state and the actions the agent improves on (visual) model-based rl. The question is: why hasn’t this worked in state-based RL yet?
Continual Model-Based Reinforcement Learning with Hypernetworks — paper,talk: Did not have time to read in detail.
Accelerating Reinforcement Learning with Learned Skill Priors — paper,talk: This is one of the papers in both Deep RL and Robot Learning workshops.
Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization— paper: This paper tries to use autoregressive models (predict each state one at a time, allowing the state dimensions to influence each-other and hopefully improve the correlation of model-accuracy to policy improvement). I was super happy to hear from the author it was heavily inspired by some of my past work.
I put talks last because the screenshots take a lot of space.
This talk is about the scale of the problems we are solving and why they matter. We are compiler hackers, and as a community we need to be SWE, ethnographers, and language nerds. Charles draws many analogies to how we are making hierarchical design decisions, each of which can influence the data flow and bias (much like historical applications of engineering). We need more diverse backgrounds in the loop!
There was an interesting analogy over photography where different products are optimizing for different things that end up being racist. Film was optimized for white colors, and this continues down the line. Software engineering is the practice of translating code to build software with principles. Software engineering is not devoid of bias and other potential shortcomings. Honestly, just go watch this talk.
This is a long form tutorial on calculus and linear algebra from the perspective of machine learning. Honestly I loved it just because the authors put in so much effort creating a meme and story from the perspective of middle earth.
Otherwise, there were some other talks that were interesting, but not “must watch” territory.
This talk was focusing on engineering tools to help minimize data and algorithmic bias. I liked parts of it because they actually detailed different design decisions that could be made in realistic scenarios, rather than broadly discussing the problems of AI Bias. They contacted ML engineers rather than just researchers, as they implement everything discussed. The relevantpaper and code
Martha White’s on sources of uncertainty in policy-gradient methods
An interesting talk discussing the three sources of variance in policy-gradient methods of RL (state sampling, action sampling, and reward sampling). It is a very good review of policy gradients, and a lesson for how to investigate a problem space. For state-sampling, people use mini-batches when computing gradients. The other two sources, action and return, involve more detailed solutions. Reducing for actions is by looking at all possible actions (simplest). In practice this doesn't work because it can be expensive and we don't know Q^pi. The solution is to use a baseline. The key is that the control variate z is uncorrelated across the whole expectation of actions.
Reducing reward variance via an estimator, which doesn't seem like solving the whole problem (estimators aren’t perfect).
Why did I include this talk: it is important to be able to reason about your problem space and go into abundant detail about where your implementation may be imperfect. Variance comes up in every numerical/data-driven method because we do not have infinite data. I was making connections during this talk to the problems of model-based learning and how it is very hard to disambiguate uncertainty introduced by the model. In a way, MBRL could be diagnosed like this talk with a fourth source of uncertainty: structural — ie variance from model and controller formulation decisions (this talk focuses on policy gradient specifically).
America’s Cup boat design via RL
This expo talk from QuantumBlack caught my attention because it combined theAmerica's cup + RL. I was impressed by the level of implementation details they discussed. There were a lot of items such as starting and stopping a simulation (where PPO is easier to work with than SAC). Also how to deploy many elements and integrate high-fidelity simulations with RL. The crucial question is if this RL actually helps, or if the design is so under-optimized that any offline method would likely give substantial gains.
Jeff Shamma: Perspectives from Feedback Control on ML talk
This talk I included because I think it is an area a lot of people could benefit from, but it requires a lot of skills (machine learning and control theory expertise) to benefit from. It goes through a running example of attitude control of an aircraft and what takeaways people can take from applying learning to classic controls problems.
Stabilize and shape behavior -- higher order learning
Gradient play (individualized learning in a competitive problem) cannot converge to zero-sum games, as the dynamics of it becomes an unstable system (zeros on diagonal of A matrix).
Key Idea: see if we can work in a different space of information (such as the history of information) — do this by adding auxiliary states, a common practice in control theory to stabilize a system (like an integrator). Also, see anticipatory learning — adding anticipation allows convergence of nash equilibriums (which makes sense conceptually).
Robustness to variation -- passive / monotone learning
External and internal parameters can change within the dynamics (parametric) and dynamics change change (e.g. fluids). The dynamic variations introduce new states, potentially an infinite order system. Introduces robust analysis -- how the family of systems performs in context of the controller.
Example: contractive games where the inner product of the change in strategy and change in payoff -- that is negative (interesting notion of direction). We want the opposite, a passive learning rule, where the payoff and impartial pairwise comparison correlate. We can do robustness analysis for families of systems by abstracting away from a specific controller.
Track command signals -- forecasting and no regret learning
Jeff breezed through this section, so I didn’t get as much out of it, but it is trying to reason about the need for time separation (and stability w.r.t. a delay) when having one signal tracking another.