STA 4273 / CSC 2547 Spring 2018:
Learning Discrete Latent Structure
Overview
New inference methods allow us to train learn generative latentvariable models. These models can generate novel images and text, find meaningful latent representations of data, take advantage of large unlabeled datasets, and even let us do analogical reasoning automatically. However, most generative models such as GANs and variational autoencoders currently have prespecified model structure, and represent data using fixeddimensional continuous vectors. This seminar course will develop extensions to these approaches to learn model structure, and represent data using mixed discrete and continuous data structures such as lists of vectors, graphs, or even programs. The class will have a major project component, and will be run in a similar manner to Differentiable Inference and Generative Models
Prerequisites:
This course is designed to bring students to the current state of the art, so that ideally, their course projects can make a novel contribution. A previous course in machine learning such as CSC321, CSC411, CSC412, STA414, or ECE521 is strongly recommended. However, the only hard requirements are linear algebra, basic multivariate calculus, basics of working with probability, and basic programming skills.
To check if you have the background for this course, try taking this Quiz. If more than half the questions are too difficult, you might want to put some extra work into preparation.
Where and When
 Spring term, 2018
 Instructor: David Duvenaud
 Email: duvenaud@cs.toronto.edu (put “STA4273” in the subject)
 Location, office hours, teaching assistants: TBD
What is discete latent structure?
Loosely speaking, it referes to any discrete quantity that we wish to estimate or optimize. Concretely, in this course we’ll consider using gradientbased stochastic optimization to train models like:
 Variational autoencoders with latent binary vectors, mixture models, or lists of vectors
 Differentiable versions of stacks, deques, and Turing machines
 Generative models of text, graphs, and programs
 Treestructured recursive neural networks
Why discrete latent struture?
 Computational efficency  Making models fully differentiable sometimes requires us to sum over all possiblities to compute gradients, for instance in soft attention models. Making hard choices about which computation to perform breaks differentiability, but is faster and requires less memory.
 Reinforcement learning  In many domains, the set of possible actions is discrete. Planning and learning in these domains requires integrating over possible future actions.
 Interpretability and Communication  Models with millions of continuous parameters, or vectorvalued latent states, are usually hard to interpret. Discrete structure is easier to communicate using language. Conversely, communicating using words is an example of learning and planning in a discrete domain.
Why not discrete latent struture?
 It’s hard to compute gradients  It’s hard to estimate gradients through functions of discrete random variables. It is so difficult that much of this course will be dedicated to investigating different techniques for doing so. Developing these techniques are an active research area, with several large developments in the last few years.
Course Structure
Aside from the first two and last two lectures, each week a different group of students will present on a set of related papers covering an aspect of these methods. I’ll provide guidance to each group about the content of these presentations.
Inclass discussion will center around understanding the strengths and weaknesses of these methods, their relationships, possible extensions, and experiments that might better illuminate their properties.
The hope is that these discussions will lead to actual research papers, or resources that will help others understand these approaches.
Grades will be based on:
 One assignment, designed to help you become familiar with unbiased gradient estimators, such as REINFORCE (also known as the scorefunction estimator) and REBAR
 Class presentations
 Project proposal
 Project presentation
 Project report and code
Project
Students can work on projects individually,in pairs, or even in triplets. The grade will depend on the ideas, how well you present them in the report, how clearly you position your work relative to existing literature, how illuminating your experiments are, and wellsupported your conclusions are. Full marks will require a novel contribution.
Each group of students will write a short (around 2 pages) research project proposal, which ideally will be structured similarly to a standard paper. It should include a description of a minimum viable project, some nicetohaves if time allows, and a short review of related work. You don’t have to do what your project proposal says  the point of the proposal is mainly to have a plan and to make it easy for me to give you feedback.
Towards the end of the course everyone will present their project in a short, roughly 5 minute, presentation.
At the end of the class you’ll hand in a project report (around 4 to 8 pages), ideally in the format of a machine learning conference paper such as NIPS.
Projects will be graded according to an updated version of Last year’s project report grading rubric
Tentative Schedule (Work in progress)
 The reparameterization trick  As a warmup, we’ll understand why and how people are moving away from the REINFORCE estimator.
 Dealing with nondifferentiability  Discrete variables makes gradient estimation hard, but there has been a lot of recent progress on developing unbiased gradient estimators.
 The original REINFORCE paper.
 Gradient Estimation Using Stochastic Computation Graphs
 REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models
 Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
 MuProp: Unbiased Backpropagation for Stochastic Neural Networks
 Neural Variational Inference and Learning in Belief Networks
 Variational inference for Monte Carlo objectives
 Categorical Reparameterization with GumbelSoftmax
 The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
 Differentiable Data Structures
 Discrete latent structures  Variational autoencoders and GANs typically use continuous latent variables, but there is recent work on getting them to use discrete random variables.
 The Helmholtz Machine  The forerunner of VAEs used binary latent variables.
 Attend, Infer, Repeat: Fast Scene Understanding with Generative Models  The latent variables can be a list or set of vectors.
 Composing graphical models with neural networks for structured representations and fast inference  the prior on latent variables can be any tractable graphical model, and we can use this inference as part of the recognition step.
 Learning Hard Alignments with Variational Inference
 Reinforcement learning
 Connecting Generative Adversarial Networks and ActorCritic Methods
 Evolution Strategies as a Scalable Alternative to Reinforcement Learning
 Emergence of Grounded Compositional Language in MultiAgent Populations
 ModelBased Planning in Discrete Action Spaces  “it is in fact possible to effectively perform planning via backprop in discrete action spaces”
 Adversarial training
 Adversarial Autoencoders  One surprisingly effective hack for training discrete random variables is to let them be continuous, and have a discriminator check if they’re discrete.
 Adversarially Regularized Autoencoders for Generating Discrete Structures
 GANS for Sequences of Discrete Elements with the Gumbelsoftmax Distribution
 Bayesian nonparametrics  models of infinitelylarge discrete objects.
 Learning model structure
 The discovery of structural form  put a grammar on model structures and built a different model for each dataset automatically.
 Exploiting compositionality to explore a large space of model structures  another systematic search through model structure using a grammar.
 Bayesian Compression for Deep Learning  putting a sparse prior on a neural network’s weights is a principled way to learn its structure.
 Latentvariable language models
 Breaking Sticks and Ambiguities with Adaptive Skipgram  word2vec with multiple meanings for each word.
 Program Synthesis for CharacterLevel Language Modeling
 Hierarchical Multiscale Recurrent Neural Networks
 Program induction
 Solomonoff Induction
 Probabilistic programming  Automatic inference in arbitary models specified by a program.
 Sampling for Bayesian Program Learning
 Programming with a Differentiable Forth Interpreter

Project presentations I
 Project presentations II