Multimodal Neural Language Models

Kiros, Salakhutdinov and Zemel,

Department of Computer Science,

University of Toronto

Paper review for CS671A

Amlan Kar

Neural Language Models

Aim to create distributed representations of tokens using neural networks

Distributed representation

Why ?

No fabricated features
Reduces the curse of dimensionality
Possibility of distance metrics that entail semantic similarity
Recent advances outperform n-gram models

Log-Bilinear models (Mnih & Hinton, 2007)

Log Bi-linear Model

Feed-forward neural network with 1 hidden layer.
Works on distributed word models (word vectors)

Bias Vector

Modality Based Log Bi-linear Model (MLBL-B)

For each training tuple of words, there is an associated vector X (of the desired added modality)
Treated as a simple additive bias

Bias Vector

Factored 3-way Log Bi-linear Model (MLBL-B)

Uses a word representation matrix for each component of the added modality
But this is in 3D (a tensor) ! How do we get the final to output layer weights ?

Weights determined by modality vector x

Intuition ?

Think of it as a weight matrix conditioned to the image input it is getting
Effectively biases context words dynamically based on added modality input

Problems ?

Using a 3-way tensors entails requirement of weights of a cubic order ( K x D x M )
M and D are generally high dimensional !
Puts constraints on vocabulary size K !

Solution ?

Factor R into three lower-rank matrices

where,

Paper review for CS671A

Tensor Factorization ? What is F ?

Image from Kohei Hayashi's talk on Generalization of Tensor Factorization and Applications

This E can now be treated as our word embedding space !

This is different as it now incorporates multimodal information !

Factor Outputs - Putting in the multimodality

This is basically the dot product of the word representation vector with the image representation vector after projecting them to a Fx1 vector using the tensor factorization matrices.

We shall finally use f to generate the next word just like in the Log-Bilinear model

How ?

Use each column of

Joint Image-Text Feature Learning

Very difficult to train on each image-word pair
Given training images, learn a dictionary of features using spherical k-means on random rxr patches of the images
Convolve the images with k_f features to obtain a 3-D matrix
On each slice of the matrix, perform max-pooling to obtain a GxG grid of most relevant features
This leaves us with a GxGxk_f input matrix for the CNN

The final output of the CNN is used as the bias vector in the computation shown before.