STA 4273 / CSC 2547 Spring 2018:

# Learning Discrete Latent Structure

## Overview

New inference methods allow us to train learn generative latent-variable models. These models can generate novel images and text, find meaningful latent representations of data, take advantage of large unlabeled datasets, and even let us do analogical reasoning automatically. However, most generative models such as GANs and variational autoencoders currently have pre-specified model structure, and represent data using fixed-dimensional continuous vectors. This seminar course will develop extensions to these approaches to learn model structure, and represent data using mixed discrete and continuous data structures such as lists of vectors, graphs, or even programs. The class will have a major project component, and will be run in a similar manner to Differentiable Inference and Generative Models

## Prerequisites:

This course is designed to bring students to the current state of the art, so that ideally, their course projects can make a novel contribution. A previous course in machine learning such as CSC321, CSC411, CSC412, STA414, or ECE521 is strongly recommended. However, the only hard requirements are linear algebra, basic multivariate calculus, basics of working with probability, and basic programming skills.

To check if you have the background for this course, try taking this Quiz. If more than half the questions are too difficult, you might want to put some extra work into preparation.

### Where and When

- Spring term, 2018
- Instructor: David Duvenaud
- Email: duvenaud@cs.toronto.edu (put “STA4273” in the subject)
- Location: Galbraith 119
- Time: Fridays, 2-4pm
- Office hours: Mondays, 3:30-4:30pm, in 384 Pratt
- Piazza: https://piazza.com/utoronto.ca/winter2018/csc2547/

## What is discrete latent structure?

Loosely speaking, it referes to any discrete quantity that we wish to estimate or optimize. Concretely, in this course we’ll consider using gradient-based stochastic optimization to train models like:

- Variational autoencoders with latent binary vectors, mixture models, or lists of vectors
- Differentiable versions of stacks, deques, and Turing machines
- Generative models of text, graphs, and programs
- Tree-structured recursive neural networks

## Why discrete latent struture?

**Computational efficency**- Making models fully differentiable sometimes requires us to sum over all possiblities to compute gradients, for instance in soft attention models. Making hard choices about which computation to perform breaks differentiability, but is faster and requires less memory.**Reinforcement learning**- In many domains, the set of possible actions is discrete. Planning and learning in these domains requires integrating over possible future actions.**Interpretability and Communication**- Models with millions of continuous parameters, or vector-valued latent states, are usually hard to interpret. Discrete structure is easier to communicate using language. Conversely, communicating using words is an example of learning and planning in a discrete domain.

## Why not discrete latent struture?

**It’s hard to compute gradients**- It’s hard to estimate gradients through functions of discrete random variables. It is so difficult that much of this course will be dedicated to investigating different techniques for doing so. Developing these techniques are an active research area, with several large developments in the last few years.

## Course Structure

Aside from the first two and last two lectures, each week a different group of students will present on a set of related papers covering an aspect of these methods. I’ll provide guidance to each group about the content of these presentations.

In-class discussion will center around understanding the strengths and weaknesses of these methods, their relationships, possible extensions, and experiments that might better illuminate their properties.

The hope is that these discussions will lead to actual research papers, or resources that will help others understand these approaches.

Grades will be based on:

- [15%] One assignment due Feb 4th.
- [15%] Class presentations
- [15%] Project proposal, due Feb 13th.
- [15%] Project presentations, March 16th and 23rd. Rubric
- [40%] Project report and code, due April 10th. Rubric

Submit assignments through Markus.

### Project

Students can work on projects individually,in pairs, or even in triplets. The grade will depend on the ideas, how well you present them in the report, how clearly you position your work relative to existing literature, how illuminating your experiments are, and well-supported your conclusions are. Full marks will require a novel contribution.

Each group of students will write a short (around 2 pages) research project proposal, which ideally will be structured similarly to a standard paper.
It should include a description of a minimum viable project, some nice-to-haves if time allows, and a short review of related work.
You don’t have to do what your project proposal says - the point of the proposal is mainly to have *a* plan and to make it easy for me to give you feedback.

Towards the end of the course everyone will present their project in a short, roughly 5 minute, presentation.

At the end of the class you’ll hand in a project report (around 4 to 8 pages), ideally in the format of a machine learning conference paper such as NIPS. Rubric

## Tentative Schedule

### Week 1 - Jan 12th - Optimization, integration, and the reparameterization trick

This lecture will set the scope of the course, the different settings where discrete structure must be estimated or chosen, and the main existing approaches. As a warm-up, we’ll look at the REINFORCE and reparameterization gradient estimators.

### Week 2 - Jan 19th - Gradient estimators for non-differentiable computation graphs

Discrete variables makes gradient estimation hard, but there has been a lot of recent progress on developing unbiased gradient estimators.

Recommended reading:

- Gradient Estimation Using Stochastic Computation Graphs
- Backpropagation through the Void: Optimizing control variates for black-box gradient estimation code

Material that will be covered:

- The original REINFORCE paper.
- The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables - a simple trick: turn all the step functions into sigmoids, and use backprop to get a biased gradient estimate.
- Categorical Reparameterization with Gumbel-Softmax - the exact same idea as the Concrete distribution, published simultaneously.
- REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models - fixes the concrete estimator to make it unbiased, and also gives a way to tune the temperature automatically.
- Stochastic Backpropagation and Approximate Inference in Deep Generative Models - one of the modern explanations of the reparameterization trick.
- Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Related work:

- MuProp: Unbiased Backpropagation for Stochastic Neural Networks - another unbiased gradient estimator based on a Taylor expansion.
- The Generalized Reparameterization Gradient - shows how to partially reparameterize some otherwise un-reparameterizable distributions.
- Developing Bug-Free Machine Learning Systems With Formal Mathematics - shows how to use formal tools to verify that a gradient estimator is unbiased.

### Week 3 - Jan 26th - Deep Reinforcement learning and Evolution Strategies

Slides:

- Introduction to Evolutionary Strategy Algorithms
- Variational Optimization and Natural Evolution Strategies
- Variational Optimization Examples
- Evolution Strategies vs SGD
- Intro to Deep Reinforcement Learning slides

Recommended reading:

- A Visual Guide to Evolution Strategies
- Evolution Strategies as a Scalable Alternative to Reinforcement Learning - replaces the exact gradient inside of REINFORCE with another call to REINFORCE.

Material that will be covered:

- Optimization by Variational Bounding
- Natural Evolution Strategies
- On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent - shows that ES might work in high dimensions because most of the dimensions don’t usually matter.
- Model-Based Planning in Discrete Action Spaces - “it is in fact possible to effectively perform planning via backprop in discrete action spaces”
- Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic - learns a linear surrogate function off-policy.

### Week 4 - Feb 2nd - Differentiable Data Structures and Adaptive Computation

Attempts learn programs using gradient-based methods, and program induction in general.

Slides:

Recommended reading:

Other material:

- Pointer Networks
- Reinforcement Learning Neural Turing Machines - attempts to train the NTM with REINFORCE.
- Recurrent Models of Visual Attention - trains a hard attention model inside an RNN.
- Programming with a Differentiable Forth Interpreter
- Sampling for Bayesian Program Learning
- Neural Sketch Learning for Conditional Program Generation
- Adaptive Computation Time for Recurrent Neural Networks
- Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms
- Divide and Conquer Networks
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

### Week 5 - Feb 9th - Discrete latent structure

Variational autoencoders and GANs typically use continuous latent variables, but there is recent work on getting them to use discrete random variables.

Slides:

Recommended reading:

- Attend, Infer, Repeat: Fast Scene Understanding with Generative Models - The latent variables can be a list or set of vectors.
- Thinking Fast and Slow with Deep Learning and Tree Search
- Efficient Nonmyopic Active Search

Other material:

- The Helmholtz Machine - The forerunner of VAEs used binary latent variables.
- Composing graphical models with neural networks for structured representations and fast inference the prior on latent variables can be any tractable graphical model, and we can use this inference as part of the recognition step.
- Learning Hard Alignments with Variational Inference - in machine translation, the alignment between input and output words can be treated as a discrete latent variable.
- Neural Discrete Representation Learning - trains an RNN with discrete hidden units, using the straigh-through estimator.
- Neural Variational Inference and Learning in Belief Networks
- Variational inference for Monte Carlo objectives

### Week 6 - Feb 16th - Adversarial training and text models

It’s not obvious how to train GANs to produce discrete structures, because this cuts off the gradient to the discriminator.

Slides:

- MaskGAN
- Adversarial autoencoders
- Adversarially Regularized Autoencoders
- Generating and designing DNA with deep generative models
- Hierarchical Multiscale Recurrent Neural Networks

Recommended reading:

- Connecting Generative Adversarial Networks and Actor-Critic Methods
- Hierarchical Multiscale Recurrent Neural Networks

Other material:

- Adversarial Autoencoders - One surprisingly effective hack for training discrete random variables is to let them be continuous, and have a discriminator check if they’re discrete.
- Adversarially Regularized Autoencoders for Generating Discrete Structures
- GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution
- Program Synthesis for Character-Level Language Modeling
- Generating and designing DNA with deep generative models
- Emergence of Grounded Compositional Language in Multi-Agent Populations

### Week 7 - Feb 23rd - Bayesian nonparametrics

Models of infinitely-large discrete objects.

Slides:

- Introduction to Bayesian Nonparametrics
- Introduction to Probabilistic Programming
- Latent LSTM Allocation
- Nonparametric Variational Auto-encoders for Hierarchical Representation Learning

Recommended reading:

- Slides on Bayesian nonparametrics
- Lecture notes on Bayesian nonparametrics
- Breaking Sticks and Ambiguities with Adaptive Skip-gram - word2vec with multiple meanings for each word.

Related material:

- Warped Mixtures for Nonparametric Cluster Shapes
- Structure Discovery in Nonparametric Regression through Compositional Kernel Search
- Learning the Structure of Deep Sparse Graphical Models
- Probabilistic programming - Automatic inference in arbitary models specified by a program.
- Latent LSTM Allocation: Joint Clustering and Non-Linear Dynamic Modeling of Sequential Data

### Week 8 - March 2nd - Learning model structure

Slides:

- Additive Structure for Bayesian Optimization
- Program Synthesis for Character Level Language Modelling
- Learning a SAT Solver from Single-Bit Supervision
- Efficient Neural Architecture Search via Parameter Sharing
- Learning Sparse Neural Networks through L0 Regularization
- SMASH: One-Shot Model Architecture Search through HyperNetworks

Recommended reading:

- Learning Sparse Neural Networks through L0 Regularization
- Efficient Neural Architecture Search via Parameter Sharing
- SparseMAP: Differentiable Sparse Structured Inference
- The discovery of structural form - put a grammar on model structures and built a different model for each dataset automatically.
- Exploiting compositionality to explore a large space of model structures - another systematic search through model structure using a grammar.

Related material:

- The Emergence of Organizing Structure in Conceptual Representation - an updated version of “the discovery of structural form”.
- Bayesian Compression for Deep Learning - putting a sparse prior on a neural network’s weights is a principled way to learn its structure.
- SMASH: One-Shot Model Architecture Search through HyperNetworks
- Learning a SAT Solver from Single-Bit Supervision

### Week 9 - March 9th - Graphs, permutations and parse trees

Slides:

- Thinking Fast and Slow with Deep Learning and Tree Search
- GraphRNN: A Deep Generative Model for Graphs
- Grammar VAE and Syntax-Directed VAE
- Learning Latent Permutations with Gumbel-Sinkhorn Networks Demo notebook!

Recommended reading:

- Learning Latent Permutations with Gumbel-Sinkhorn Networks
- Grammar Variational Autoencoder
- Junction Tree Variational Autoencoder for Molecular Graph Generation

Related material:

- Reparameterizing the Birkhoff Polytope for Variational Permutation Inference
- Syntax-Directed Variational Autoencoder for Structured Data - a followup to the Grammar VAE
- Automatic chemical design using a data-driven continuous representation of molecules
- Learning to Compose Words into Sentences with Reinforcement Learning
- Learning to Represent Programs with Graphs

### Weeks 10 and 11 - March 16th and 23rd - Project presentations

The last two weeks were project presentations, 38 in total. A few students were brave enough to have their slides online:

- Character-level Language Models With Word-level Learning
- Discretizing Neural Turing Machines
- GANs for Word Embeddings

### Epilogue: Some projects that developed out of this course

Some of the course projects have turned into papers:

- Adversarial Contrastive Estimation, published in Association for Computational Linguistics, 2018
- Discretely Relaxing Continuous Variables for tractable Variational Inference, in NIPS 2018
- Towards Understanding Linear Word Analogies, under review.