Improving Language Understanding in Machines through Anticipation.

Cornille, Nathan and Collel, Guillem and Moens, Marie-Francine

Abstract

Note: this abstract was written many months before the poster was made, due to Covid-related delay of the conference. The poster presents conclusions, while the abstract presents an idea. This abstract presents the idea for an ongoing research project, rather than to present any experimental results or conclusions. Neuroscientific insights have formed the inspiration for advancements in machine learning on a number of occasions, such as hierarchical convolutional neural networks with increasingly large receptive fields, the idea of attention, or the idea of predictive coding. The latter forms the inspiration for the model presented in this abstract, which is nicknamed Pervasive Internal Regression (PIR). PIR stands for an unsupervised representation learning method where, in addition to learning to model the input, the model learns to model internal activations within _each_ layer, at a future timestep. It is inspired by how prediction-error signals in biological brains occur at different processing levels as well. This is similar to the idea of Gradient-Isolated Learning, where Contrastive Predictive Coding (CPC) is done at multiple blocks along the depth of the model. Unlike CPC, where the model just learns to distinguish activations in future timesteps from activations at unrelated timesteps, with PIR the predictive component actually learns to _generate_ the activations in future timesteps. To prevent the generation of degenerate, trivial-to-predict activations (e.g. always only zeroes), a loss is used that contrasts the distance to an activation of a future timestep on the one hand, to activations at the same layer, but at unrelated timesteps on the other, in a similar way to CPC. There are a number of hypothesized advantages to this approach. Firstly, at training time, the distributed loss can have the same parallellization advantage as Gradient-Isolated-Learning: because different layers of the model have a separate loss signal, backpropagation doesn't need to be end-to-end. Secondly, at inference time, the predicted and actual activations will be merged. If the predictions are sufficiently accurate, they can be useful in a number of different ways. First, in real-time settings, the predicted activations can be 'assumed', which can improve speed in a similar way to branch prediction in compilers. Second, the size of the difference between predicted and actual activations can be an informative feature in itself. Practically, the approach will be tested on different modalities that have a sequential nature. Firstly, it will be tested on written language. Here, the SuperGLUE benchmark, which consists of a number of NLP tasks deemed easy for humans but still hard for computers, will be used to evaluate the performance. Large datasets such as the BooksCorpus and the C4 corpus will be the sources for unsupervised pretraining. Secondly, the LibriSpeech dataset can serve to evaluate performance on speech-related tasks, such as classifying speaker identity and predicting phone labels. Finally, the Penn Action Dataset can be used to evaluate performance on video prediction.

Info

Publication Date:	November 2020
URL:	https://lirias.kuleuven.be/2996624?limo=0