Research notes

posted by Deepak
on Apr 13, 2018

Training RNNs (recurrent neural networks) on long sequences of data is usually done via a technique called truncated backpropagation through time (bptt). Standard backpropagation (backprop), for non-recurrent/sequential neural networks, is when we update (train) the parameters of the model by taking the derivative of the cost function applied to the outputs of the network (for a batch of training data) w.r.t. the parameters (outermost layer first and then the inner layers one-by-one using the chain rule—hence *backpropagation*), and multiplying it by a small negative step-size (so the new parameters yield a smaller cost). For RNNs, since the output at any point in a sequence is dependent on the prior points in a sequence, backprop doesn’t stop with just the present step but goes all the way back to the start of the sequence (hence *bptt*). If the sequences are long, we can run the backprop every `k`

steps, setting the initial hidden state of the RNN at what it was before the `k`

steps.

The problem is, while we want to train with the outputs of a whole batch at a time (e.g., 100 training data samples at a time), in general the batch samples are unlikely to all have the same sequence length. The canonical solution to this is to add dummy blank entries to the shorter sequences, padding them so that all sequences in the batch end up being as long as the longest one, and have the cost computed in a way so the blanks don’t matter. See `torch.nn.utils.rnn.pack_padded_sequence`

in PyTorch, for example.

**I present a more efficient approach: don’t pad sequences; just dynamically swap out sequences as they end, replacing them in their batch slot with the next training data sample.** This is tricky but doable in PyTorch, whose imperative approach makes the triple-nested `for`

loop at the heart of the algorithm possible.

Here is the code:

```
def train_bptt(decoder, contexts, targets, visible_seed, hidden_seed,
optimizer, batch_size, bptt, cost_fn=nn.PairwiseDistance()):
"""
Train a nn.Module object using given optimizer and training data sequences,
using backpropagation through time (bptt), <bptt> sequence steps per
optimization step.
Unlike typical implementations, sequences in each batch are not required
to have the same length, and no <blank> padding is done to enforce that.
Shorter sequences are swapped out of the batch dynamically, replaced with
next sequence in the data. This makes training more efficient.
Trains one epoch, returns computed loss at each step, as a list.
decoder:
nn.Module object that implements a
.forward(visible, context, hidden)
method which takes in current visible features, a context vector, and
current hidden state, and returns the next visible output features.
i.e. Like a standard RNN which takes in an additional context vector at
each step.
contexts, targets:
lists of sequence data; containing context vectors and target output values.
contexts[i] must have same length as targets[i].
i.e. input sequences must have same length as output sequences.
visible_seed, hidden_seed:
initial feature vector to jumpstart the sequence, and initial hidden state
for start of the sequence. Of shape [<features>] and
[<num_layers>, <n_hidden>] respectively.
optimizer:
a torch.optim object, e.g. initialized by torch.optim.Adam(...).
batch_size, bptt:
Number of sequences to train concurrently, and number of steps to proceed
in sequence before calling an optimization step.
"""
# indices holds sequence index in each slot of the batch.
indices = np.arange(batch_size)
# positions holds, for each sequence in the batch, the position where
# we are currently at.
positions = [0 for i in range(batch_size)]
visible = torch.stack([visible_seed for i in range(batch_size)]).unsqueeze(0)
hiddens = torch.stack([hidden_seed for i in range(batch_size)], 1)
losses = []
marker = batch_size
while marker < len(targets):
optimizer.zero_grad()
# The following two lists hold output of the decoder, and
# the values they are to be costed against.
predicted, actual = [], []
for counter in range(bptt):
inputs = torch.stack([contexts[i][p]
for i, p in zip(indices, positions)])
outputs, hiddens = decoder(visible, inputs, hiddens)
predicted.append(outputs[0])
visible = outputs.clone() # can implement teacher forcing here.
actual.append(torch.stack([targets[i][p]
for i, p in zip(indices, positions)]))
for b, index, position in zip(list(range(batch_size)),
indices, positions):
if len(targets[index]) > position + 1:
positions[b] += 1
else: #load next sequence
marker += 1
# we wrap around to start of dataset, if some long
# seqence near end of dataset isn't done yet.
indices[b] = marker % len(targets)
positions[b] = 0
visible[0, b] = visible_seed
hiddens[:, b] = hidden_seed
loss = torch.mean(cost_fn(torch.cat(predicted), torch.cat(actual)))
loss.backward()
torch.nn.utils.clip_grad_norm(decoder.parameters(), 1.)
optimizer.step()
losses.append(loss.data[0])
# The following frees up memory, discarding computation graph
# in between optimization steps.
visible = visible.detach()
hiddens = hiddens.detach()
return losses
```

To describe in summary the three nested-loops that make up the algorithm: The code calls for an optimization every `bptt`

steps, on the `batch_size`

of data samples being processed. Whenever a data sample runs out (its sequence ended), it gets swapped out for the next one—this gets checked and done every step. Steps proceed until the very last sequence in the training data ends^{1}.

This code is available on one of my github pages. Cheers.

^{1} So data samples at the start of the training data list do get sampled twice, to keep the batch slots filled until the last sequence is exhausted; shuffle your training data between epochs (which is advisable in any case) to avoid a bias.

posted by Deepak
on Jul 19, 2017

A Restricted Boltzmann Machine (RBM) is a fully-connected layer in a neural network that is trained as an autoencoder via a special method called contrastive divergence. See the RBM section at deeplearning.net for the theory. Unlike a simple feed-forward network trained as an autoencoder, an RBM is a generative model that can be used to a) generate samples from the data distribution, and b) produce log-likelihood values for a given input sample. I have implemented a variety of RBM types in my TensorFlow based `comprehend`

package on github. Here I test them on the Labelled Faces in the Wild (LFW) dataset^{1}. They do learn what typical faces look like, generating samples of them from scratch. They are also able to move a given actual face towards a more typical representation.

*Fig. 1: 64 random faces from the LFW dataset, after preprocessing*

Figure 1 shows a sampling from the LFW dataset, after they have been frontalized^{2}, converted to grayscale, resized to 64x64 pixels, with pixel values normalized to have mean `0.0`

, standard deviation `0.3`

and clipped to range `[-1.0, 1.0]`

. These processed images, 13200 of them, form our input data. For the purpose of dimension reduction, and to take advantage of the local structure in the data, I first use a convolutional neural network (CNN) before adding an RBM layer on top. The CNN has five layers, with the first four having a convolution stride of 2 doing the bulk of the dimension reduction. The input layer, i.e. the images, with its dimension of `64x64x1=4096`

is reduced to `4x4x32=512`

at the output layer of the CNN. I train the CNN layer by layer as an autoencoder with L2 cost, using standard back propagation and the Adam optimizer.

The output of the CNN, i.e. the hidden values of the top layer, are then saved to disk and serve as the input data for a single layer RBM, of 500 hidden units. RBMs use a sigmoid activation function, and so values in the range `[0.0, 1.0]`

. This is easily addressed by a simple linear transformation of the input data. The harder issue to deal with is the fact that vanilla RBMs model probabilities of binary input values: the data is presumed to be either very close to 0.0 or very close to 1.0. When training an RBM, values are sampled to be either 0 or 1. This works great for the MNIST dataset, but not here where the input data is continuous. In fact, this is the main reason for this post: I wanted to test my modifications to vanilla RBMs to handle continuous-valued input (I had already tested vanilla RBMs with MNIST, where they work great).

Literature on continuous-valued RBMs is sparse; what I came across suggested sampling using a standard Gaussian with some standard deviation, e.g. 0.1, centered at the values output by the sigmoid activation function. This seemed not quite right to me as the resulting sampled values will require clipping to stay in range `[0.0, 1.0]`

. So I am using sampling from a beta distribution, whose range is naturally `[0.0, 1.0]`

, instead. See the `RRBM`

and `R3BM`

classes in the `networks.py`

module of my `comprehend`

package on github. There was some adaptation involved, as the beta distribution is not normally parameterized by its mean and standard deviation.

*Fig. 2: Faces from Figure 1 after autoencoding by the CNN-RBM network*

After the RBM was trained, it was stacked on top of the CNN (with the simple linear transformation in between the two to translate the ranges of values). Figure 2 shows the results when the faces from Figure 1 are fed as input to this CNN-RBM network (encoding), and output is fed back down (decoding) from the RBM on top to perform an autoencoding. The faces look regularized, towards an average-looking face—glasses and moustaches go missing. We can control the extent this happens by mixing the hidden values in the upward pass with those on the downward pass at some intermediate layer (not shown).

*Fig. 3: Faces generated from scratch by the RBM*

Now RBMs can also be used to generate samples. Starting with random input values, after 1000 iterations of Gibbs sampling, I obtain the faces in Figure 3 (when the RBM samples are reversed using the CNN). They look like real faces, though smoother and closer to the canonical face than the actual samples. To get more resolution, we would likely need a deep multi-layered RBM^{3}.

Now for some fun: let’s try interpolating faces. My initial intention was to interpolate between faces at the RBM hidden layer level. But as the RBM likes to typicalize faces, losing distinctiveness, I here do it at the top CNN layer. Figure 4 below shows the result of mixing George W. Bush’s faces with Tony Blair’s. I believe we would get better results if we use a deep RBM and mix hidden values at the top of that—will leave that for another day.

*Fig. 4: Top row faces mixed with bottom row faces to get middle row faces*

^{1} Erik Learned-Miller, Gary B. Huang, Aruni RoyChowdhury, Haoxiang Li, and Gang Hua.
*Labeled Faces in the Wild: A Survey.* Advances in Face Detection and Facial Image Analysis, 2016.

^{2} Tal Hassner, Shai Harel, Eran Paz and Roee Enbar. *Effective Face Frontalization in Unconstrained Images.* IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.

^{3} G. E. Hinton and R. R. Salakhutdinov. *Reducing the Dimensionality of
Data with Neural Networks.* Science, Vol. 313, 2006.

posted by Deepak
on Aug 05, 2016

How do we go about understanding what each hidden unit in a RNN is capturing? We are all familiar with the visualization of the first hidden layer weights of a vanilla feed-forward network, such as the header image of this blog. Such a visualization shows which feature of an input image activates the hidden unit. Here I present an analogous procedure for RNN weights.

For the RNN case, we seek to understand what sequence of inputs drives a hidden unit to activate. Say we have the RNN described in my earlier post,
with sequential rows of MNIST images as inputs, each input being a row of 28 pixel values (and the output is the predicted next row of 28 pixel values). For each hidden unit, can we figure out the sequence of **L** rows, for some **L**, say `L = 10`

, that best activate it?

The space of all possible sequences of 10 input rows is quite large. Furthermore, we are concerned mainly with sequences that appear in our data, i.e. the input distribution. Therefore, it makes sense to just sample a sufficient number of **L** row sequences from the input data itself, and see which of those activate the hidden unit in question. For our sequential MNIST RNN example, we can generate these samples by taking each MNIST image in the training set, and select **L** sequential rows starting from a random row index between `0`

and `28 - L`

.

Now we can construct a feature for each hidden unit in a number of ways. We can simply find the single sequence from the samples generated as described above that maximally activates the hidden unit, and display that as the visualization for the hidden unit. This approach has the problem that such a maximal sequence may have multiple features, some which are not relevant to the hidden unit. Another approach is to take a weighted average of all sequences, with the weights being the activation of the hidden unit after the corresponding sequence. Then the irrelevant features will get diluted out. This is the approach I implemented. I used the pre-sigmoid activation, floored at `0.0`

, as weights, to get a sharper average. See the figure below, for `L = 10`

.

There are 169 cells in the grid above, one for each hidden unit. In each cell, the top half is the pre-sigmoid-activation-weighted average `L = 10`

sequence; in other words, the 10-row height image the hidden unit likes to see. The bottom half shows the subsequent 10 rows that result when the RNN is rolled-forward after the input sequence, being fed the output it predicts after each row. This indicates what the RNN expects to see after the input image the hidden unit likes to see.

We can see that many of the hidden units activate after seeing the top half of the digit 0, and go on to predict the bottom half correctly. 9, 8, 7, 3 and 2 are well represented as well. The digit 1 appears to be missing; perhaps because the parameter **L** is not discriminating for it, and that affects results somehow. That leaves the digits 4, 5 and 6; suggesting that no single hidden unit by itself has captured the structure in any of these digits.

This visualization has now been implemented in my `comprehend`

package on github. Comments welcome.

posted by Deepak
on Jun 16, 2016

Pretty much everyone in the field has had a shot at training some artificial neural net on the MNIST database of handwritten digits—the 60,000 samples of 28x28 pixel grayscale images. The canonical task with this dataset is to predict the class label associated with each sample, one of `{0, 1, ..., 9}`

. Here, I present something a bit more novel: a recurrent neural net (RNN) that learns an autoencoding of the MNIST data, and is thereby able to complete a truncated version of the input.

RNNs operate on sequential input. So I convert each MNIST image into a sequence of 28 inputs, each input being a row of 28 pixel values. The RNN state (i.e. the hidden values) is reset to **0** before processing each image, otherwise changing and carrying over values between each row of inputs. The output of the RNN has the same dimension as that of the input—28—and is interpreted as the predicted next row of the input. So, when each MNIST image is processed, we have a sequence of **x**_{0…27} rows as input and a sequence of **y**_{0…27} rows as output. The cost minimized for training is then the difference between **x**_{1…27} and **y**_{0…26}. I used the cross entropy loss function here.

The above image shows test input and output of the RNN after training. The RNN had 169 nodes (13x13) in a single hidden layer, and used tied weights between the output layer and input layer. i.e. *W _{hy}* is set to

The image above shows results when the trained RNN is fed only the first half of test images, and is asked to extrapolate the rest. Extrapolation is by means of feeding the RNN successive rows of predicted values. i.e. **x**_{i+1} is set to **y**_{i} and we proceed until we get to 28 rows. We now have to conclude that the network did learn actual patterns in the data, and is able to complete a half-seen pattern. Of course, it makes mistakes. But when we look at the corresponding test input, we find that there is true ambiguity.

You can get some of the above results by running the command-line below, after installing my `comprehend`

package on github.

```
$ python test.py --model RNN --epochs 16 --hidden 169 --batch 20 \
--mosaic --verbose
```

posted by Deepak
on Jun 15, 2016

It is often useful to visualize the weights learnt by the first hidden layer of an artificial neural net. We can then see which features are being picked up by each node in the layer. The following image is such a visualization. It is a heatmap of the weights of each node in the first hidden layer of an RBM trained on the MNIST dataset. There are 450 nodes, tiled into 25 columns by 18 rows. We can see the nodes picking up patterns in the data; and we notice similarities between some of these features. Can we structure the presentation better to highlight these similarities?

It would be nice if we could cluster the 450 tiles such that similar tiles appear near each other in the visualization. We can pull out some heavy-duty clustering algo that works in the high dimensional space of the features (28x28 = 784 dimensions), and present the features one cluster at a time, as perhaps one row per cluster. This would work, but would probably need some bespoke attention every time we have a different set of tiles, as it is not quite a push-button procedure. Is there a simpler way?

We can compute the 450x450 pairwise correlation coefficients between the 450 features, and just arrange the features in a single line such that adjacent features have high correlation. We will then get sequences of clusters, that we can break at cluster boundaries where the running correlation has a dip. Formally:

Find an ordering

`O[i]`

for features`F[i]`

,`1 < i <= N`

features, such that the sum of correlation coefficients,`corr_coef(F[O[j+1]], F[O[j]])`

for`1 < j < N`

, is maximized.

This sounds familiar, doesn’t it? Because, it is equivalent to the Traveling Salesman Problem! The N features are the N cities the salesman needs to visit, the distances between the cities being the negative of the correlation coefficient between the features (so minimizing the distance traveled is the same as maximizing the sum correlations between the features in order).

I used a TSP solver on the weights shown above, and then wrapped the resulting ordering by row to generate the image below. Much neater, isn’t it?

The TSP solver, which gives an approximate solution, has *O(N^4)* computational complexity and uses *O(N^2)* memory. It took a minute to run. I have implemented a much simpler greedy algorithm, which is less optimal but runs in *O(N^2)* time and *O(N^2)* memory (i.e., same as what’s required for computing the correlation matrix), and runs with no noticeable delay. See `comprehend.features.corrsort()`

in my github project.

For further improvement, one can segment the ordered sequence of features into subsequences, i.e. clusters with high correlation within and lower correlation without. Then, add blank tiles to use as delimiters between subsequences, taking care to avoid linebreaking a subsequence. In other words, treat each subsequence like one would treat a word when typesetting. Center justification with ragged edges would probably look nicest. Any takers?