STA 4273 / CSC 2547 Spring 2018:

Learning Discrete Latent Structure

Overview

New inference methods allow us to train learn generative latent-variable models. These models can generate novel images and text, find meaningful latent representations of data, take advantage of large unlabeled datasets, and even let us do analogical reasoning automatically. However, most generative models such as GANs and variational autoencoders currently have pre-specified model structure, and represent data using fixed-dimensional continuous vectors. This seminar course will develop extensions to these approaches to learn model structure, and represent data using mixed discrete and continuous data structures such as lists of vectors, graphs, or even programs. The class will have a major project component, and will be run in a similar manner to Differentiable Inference and Generative Models

Prerequisites:

This course is designed to bring students to the current state of the art, so that ideally, their course projects can make a novel contribution. A previous course in machine learning such as CSC321, CSC411, CSC412, STA414, or ECE521 is strongly recommended. However, the only hard requirements are linear algebra, basic multivariate calculus, basics of working with probability, and basic programming skills.

To check if you have the background for this course, try taking this Quiz. If more than half the questions are too difficult, you might want to put some extra work into preparation.

Where and When

Spring term, 2018
Instructor: David Duvenaud
Email: duvenaud@cs.toronto.edu (put “STA4273” in the subject)
Location: Galbraith 119
Time: Fridays, 2-4pm
Office hours: Mondays, 3:30-4:30pm, in 384 Pratt
Piazza: https://piazza.com/utoronto.ca/winter2018/csc2547/

What is discrete latent structure?

Loosely speaking, it referes to any discrete quantity that we wish to estimate or optimize. Concretely, in this course we’ll consider using gradient-based stochastic optimization to train models like:

Variational autoencoders with latent binary vectors, mixture models, or lists of vectors
Differentiable versions of stacks, deques, and Turing machines
Generative models of text, graphs, and programs
Tree-structured recursive neural networks

Why discrete latent struture?

Computational efficency - Making models fully differentiable sometimes requires us to sum over all possiblities to compute gradients, for instance in soft attention models. Making hard choices about which computation to perform breaks differentiability, but is faster and requires less memory.
Reinforcement learning - In many domains, the set of possible actions is discrete. Planning and learning in these domains requires integrating over possible future actions.
Interpretability and Communication - Models with millions of continuous parameters, or vector-valued latent states, are usually hard to interpret. Discrete structure is easier to communicate using language. Conversely, communicating using words is an example of learning and planning in a discrete domain.

Why not discrete latent struture?

It’s hard to compute gradients - It’s hard to estimate gradients through functions of discrete random variables. It is so difficult that much of this course will be dedicated to investigating different techniques for doing so. Developing these techniques are an active research area, with several large developments in the last few years.

Course Structure

Aside from the first two and last two lectures, each week a different group of students will present on a set of related papers covering an aspect of these methods. I’ll provide guidance to each group about the content of these presentations.

In-class discussion will center around understanding the strengths and weaknesses of these methods, their relationships, possible extensions, and experiments that might better illuminate their properties.

The hope is that these discussions will lead to actual research papers, or resources that will help others understand these approaches.

Grades will be based on:

[15%] One assignment due Feb 4th.
[15%] Class presentations
[15%] Project proposal, due Feb 13th.
[15%] Project presentations, March 16th and 23rd. Rubric
[40%] Project report and code, due April 10th. Rubric

Submit assignments through Markus.

Project

Students can work on projects individually,in pairs, or even in triplets. The grade will depend on the ideas, how well you present them in the report, how clearly you position your work relative to existing literature, how illuminating your experiments are, and well-supported your conclusions are. Full marks will require a novel contribution.

Each group of students will write a short (around 2 pages) research project proposal, which ideally will be structured similarly to a standard paper. It should include a description of a minimum viable project, some nice-to-haves if time allows, and a short review of related work. You don’t have to do what your project proposal says - the point of the proposal is mainly to have a plan and to make it easy for me to give you feedback.

Towards the end of the course everyone will present their project in a short, roughly 5 minute, presentation.

At the end of the class you’ll hand in a project report (around 4 to 8 pages), ideally in the format of a machine learning conference paper such as NIPS. Rubric

Tentative Schedule

Week 1 - Jan 12th - Optimization, integration, and the reparameterization trick

This lecture will set the scope of the course, the different settings where discrete structure must be estimated or chosen, and the main existing approaches. As a warm-up, we’ll look at the REINFORCE and reparameterization gradient estimators.

Lecture 1 slides

Week 2 - Jan 19th - Gradient estimators for non-differentiable computation graphs

Lecture 2 slides

Discrete variables makes gradient estimation hard, but there has been a lot of recent progress on developing unbiased gradient estimators.

Recommended reading:

Material that will be covered:

The original REINFORCE paper.
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables - a simple trick: turn all the step functions into sigmoids, and use backprop to get a biased gradient estimate.
Categorical Reparameterization with Gumbel-Softmax - the exact same idea as the Concrete distribution, published simultaneously.
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models - fixes the concrete estimator to make it unbiased, and also gives a way to tune the temperature automatically.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models - one of the modern explanations of the reparameterization trick.
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Related work:

MuProp: Unbiased Backpropagation for Stochastic Neural Networks - another unbiased gradient estimator based on a Taylor expansion.
The Generalized Reparameterization Gradient - shows how to partially reparameterize some otherwise un-reparameterizable distributions.
Developing Bug-Free Machine Learning Systems With Formal Mathematics - shows how to use formal tools to verify that a gradient estimator is unbiased.

Week 3 - Jan 26th - Deep Reinforcement learning and Evolution Strategies

Slides:

Recommended reading:

A Visual Guide to Evolution Strategies
Evolution Strategies as a Scalable Alternative to Reinforcement Learning - replaces the exact gradient inside of REINFORCE with another call to REINFORCE.

Material that will be covered:

Optimization by Variational Bounding
Natural Evolution Strategies
On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent - shows that ES might work in high dimensions because most of the dimensions don’t usually matter.
Model-Based Planning in Discrete Action Spaces - “it is in fact possible to effectively perform planning via backprop in discrete action spaces”
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic - learns a linear surrogate function off-policy.

Week 4 - Feb 2nd - Differentiable Data Structures and Adaptive Computation

Attempts learn programs using gradient-based methods, and program induction in general.

Slides:

Recommended reading:

Other material:

Pointer Networks
Reinforcement Learning Neural Turing Machines - attempts to train the NTM with REINFORCE.
Recurrent Models of Visual Attention - trains a hard attention model inside an RNN.
Programming with a Differentiable Forth Interpreter
Sampling for Bayesian Program Learning
Neural Sketch Learning for Conditional Program Generation
Adaptive Computation Time for Recurrent Neural Networks
Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms
Divide and Conquer Networks
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Week 5 - Feb 9th - Discrete latent structure

Variational autoencoders and GANs typically use continuous latent variables, but there is recent work on getting them to use discrete random variables.

Slides:

Recommended reading: