WP3: Learning of Representations from Language and Visual Data

WP3 focuses on the development of innovative algorithms designed to capture the intricate relational structure of language and the geometric and appearance facets of visual data. The emphasis is on anticipatory representations inspired by how the brain works. The objective is to create more robust and informative representations for both visual and textual data.

In the past, there has been a lot of research on text representation learning methods, and with CALCULUS we add to the discussions in this line of research [6,10], contribute evaluation benchmarks [13], and make algorithmic contributions, especially with regards to anticipatory models inspired by the brain. Specifically, several of our publications are concerned with integrating predictive coding theory into the pretraining of neural networks [1, 4, 9, 14], which aim to make representations predictive of future stimuli. While the former works are concerned with textual stimuli only, we argue that a powerful language representation needs to be competent at predicting visual stimuli, too, and learn to navigate the environment of a confined world [3]. For example, our autonomous agent LAD (Layout-aware Dreamer) [15], imagines the destination of its goal for deciding the next action.

Beyond this work, visual-linguistic representation learning is central to the goals of CALCULUS: A burgeoning area of research focuses on the intersection of text and image data, aiming to ground text representations within the context of visual stimuli. In this paradigm, textual semantics are not just inferred from text but also anchored in associated images, creating a more holistic understanding. The research here aligns precisely with the overarching aim of WP3, which is to establish robust and contextually aware representations from both language and visual data, ultimately resulting in more intelligent and predictive systems. Indeed, we show that learning visually grounded representations improves performance even on tasks that only involve text [7]. Furthermore, we investigate whether multimodal representations have learned visual structures analogous to linguisic structures [12], discover the underlying causal structure of the data [11], and learn to map verbal descriptions of objects’ spatial relationships to the image [5], with practical applications such as self-driving [2,8].

A third research path in the context of anticipatory, brain-inspired context is closely aligned with neuroscientific methods and insights to explore the connection between human brain activity and machine-learned language representations. These works involve neural encoding techniques to measure how well the activations in artificial neural networks correspond with human neural processes during language processing. Various types of sentence embedding models and fine-tuning approaches are being investigated for their effectiveness in replicating patterns found in human brain activity [16, 17, 18].

# Year Title Authors Venue Description
1 2019 Improving Natural Language Understanding through Anticipation-Enriched Representations. Cornille, Nathan and Moens, Marie-Francine HBP 2019 Poster with first idea for internal-self-prediction objective for BERT, presented at Human Brain Project workshop in Glasgow.
2 2020 Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding Deruyttere, Thierry and Collell, Guillem and Moens, Marie-Francine A new spatial memory module and a spatial reasoner for the Visual Grounding task. We focus on integrating the regions of a Region Proposal Network into a new multi-step reasoning model.
3 2020 Learning Grammar in Confined Worlds Spinks, Graham and Cartuyvels, Ruben and Moens, Marie-Francine LNEE In this position paper we argue that modern machine learning approaches fail to adequately address how grammar and common sense should be learned.We advocate for experiments with the use of abstract, confined world environments where agents interact with the emphasis on learning world models.
4 2020 Improving Language Understanding in Machines through Anticipation. Cornille, Nathan and Collel, Guillem and Moens, Marie-Francine NAISys 2020 Poster that reflects on some of the issues with an internal contrastive objective that aims to improve representation learning.
5 2020 Decoding Language Spatial Relations to 2D Spatial Arrangements Radevski, Gorjan and Collell, Guillem and Moens, Marie-Francine and Tuytelaars, Tinne EMNLP 2020 We propose Spatial-Reasoning Bert (SR-Bert) for the problem of multimodal spatial understanding by decoding a set of language-expressed spatial relations to a set of 2D spatial arrangements in a multi-object and multi-relationship setting.
6 2021 Discrete and continuous representations and processing in deep learning: Looking forward Ruben Cartuyvels, Graham Spinks, Marie-Francine Moens AI Open A position paper that reflects on the role of discrete and continuous representations and processing in the deep learning era.
7 2021 Visual Grounding Strategies for Text-Only Natural Language Processing Sileo, Damien Conception, categorization and strategies to leverage multimodal pretraining for text-only tasks
8 2021 Giving Commands to a Self-Driving Car: How to Deal with Uncertain Situations? Deruyterre, Thierry and Milewski, Victor and Moens, Marie-Francine When a command is given to a self-driving cars, this can cause ambiguous solutions. A method to solve this through visual and textual means is proposed.
9 2021 Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations Araujo, Vladimir and Villa, Andres and Mendoza, Marcelo and Moens, Marie-Francine and Soto, Alvaro EMNLP 2021 We propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations.
10 2023 A Brief Overview of Universal Sentence Representation Methods: A Linguistic View. Li, Ruiqi and Moens, Marie-Francine Accepted for upcomming issue!
11 2022 Critical Analysis of Deconfounded Pretraining to Improve Visio-Linguistic Models Cornille, Nathan and Laenen, Katrien and Moens, Marie-Francine We critically analyze a recent technique that uses the toolbox of causality to improve on OOD performance, elucidating to what extent it actually finds confounders, under what assumptions it performs deconfounding, and whether the reported OOD performance is actually linked to the causal tools.
12 2022 Finding Structural Knowledge in Multimodal-BERT Milewski, Victor and de Lhoneux, Miryam and Moens, Marie-Francine ACL 2022 we introduce scene trees, by mapping the linguistic dependency tree ontop of regions, to investigate if BERT learns structures over the image regions.
13 2022 Evaluation Benchmarks for Spanish Sentence Representations Vladimir Araujo, Andrés Carvallo, Souvik Kundu, José Cañete, Marcelo Mendoza, Robert E. Mercer, Felipe Bravo-Marquez, Marie-Francine Moens, Alvaro Soto LREC 2022 A new benchmark for spanish sentence representations
14 2023 Learning Sentence-Level Representations with Predictive Coding Araujo, Vladimir and Moens, Marie-Francine and Soto, Alvaro This work explores how to improve sentence-level representations of pre-trained models by borrowing ideas from predictive coding theory
15 2023 Layout-aware Dreamer for Embodied Visual Referring Expression Grounding Li, Mingxiao and Wang, Zehao and Tuytelaars, Tinne and Moens, Marie-Francine AAAI-23 We have designed an autonomous agent called Layout-aware Dreamer (LAD) including two novel modules, the Layout Learner and the Goal Dreamer, to mimic a humans cognitive decision process
16 2023 Fine-tuned vs. Prompt-tuned Supervised Representations: Which Better Account for Brain Language Representations? Jingyuan Sun and Marie-Francine Moens IJCAI 2023 Investiging various supervised method and the correlation to how brains represent language.
17 2023 Investigating Neural Fit Approaches for Sentence Embedding Model Paradigms Helena Balabin, Antonietta Gabriella Liuzzi, Jingyuan Sun, Patrick Dupont, Rik Vandenberghe, Marie-Francine Moens ECAI 2023 We analyze the link (i.e., neural fit) between functional MRI data and pre-trained language models using different brain networks, neural fit approaches and sentence modeling paradigms.
18 2023 Tuning In to Neural Encoding: Linking Human Brain and Artificial Supervised Representations of Language Jingyuan Sun, Xiaohan Zhang and Marie-Francine Moens ECAI 2023 Linking human brain and supervised ANN representations of the Chinese language.
19 2023 Causal Factor Disentanglement for Few-Shot Domain Adaptation in Video Prediction Cornille, Nathan and Sun, Jinguan and Laenen, Katrien and Moens, Marie-Francine We evaluate whether we can use Causal Factor Disentanglement to isolate parameters that model different causal mechanisms, and subsequently adapt more quickly in response to a Sparse Mechanism Shift.