WP3: Learning of Representations from Language and Visual Data

WP3 focuses on the development of innovative algorithms designed to capture the intricate relational structure of language and the geometric and appearance facets of visual data. The emphasis is on anticipatory representations inspired by how the brain works. The objective is to create more robust and informative representations for both visual and textual data.

In the past, there has been a lot of research on text representation learning methods, and with CALCULUS we add to the discussions in this line of research [6,10], contribute evaluation benchmarks [13], and make algorithmic contributions, especially with regards to anticipatory models inspired by the brain. Specifically, several of our publications are concerned with integrating predictive coding theory into the pretraining of neural networks [1, 4, 9, 14], which aim to make representations predictive of future stimuli. While the former works are concerned with textual stimuli only, we argue that a powerful language representation needs to be competent at predicting visual stimuli, too, and learn to navigate the environment of a confined world [3]. For example, our autonomous agent LAD (Layout-aware Dreamer) [15], imagines the destination of its goal for deciding the next action.

Beyond this work, visual-linguistic representation learning is central to the goals of CALCULUS: A burgeoning area of research focuses on the intersection of text and image data, aiming to ground text representations within the context of visual stimuli. In this paradigm, textual semantics are not just inferred from text but also anchored in associated images, creating a more holistic understanding. The research here aligns precisely with the overarching aim of WP3, which is to establish robust and contextually aware representations from both language and visual data, ultimately resulting in more intelligent and predictive systems. Indeed, we show that learning visually grounded representations improves performance even on tasks that only involve text [7]. Furthermore, we investigate whether multimodal representations have learned visual structures analogous to linguisic structures [12], discover the underlying causal structure of the data [11], and learn to map verbal descriptions of objects’ spatial relationships to the image [5], with practical applications such as self-driving [2,8].

A third research path in the context of anticipatory, brain-inspired context is closely aligned with neuroscientific methods and insights to explore the connection between human brain activity and machine-learned language representations. These works involve neural encoding techniques to measure how well the activations in artificial neural networks correspond with human neural processes during language processing. Various types of sentence embedding models and fine-tuning approaches are being investigated for their effectiveness in replicating patterns found in human brain activity [16, 17, 18].

#	Year	Title	Authors	Venue	Description
1	2019	Improving Natural Language Understanding through Anticipation-Enriched Representations.	Cornille, Nathan and Moens, Marie-Francine	HBP 2019	Poster with first idea for internal-self-prediction objective for BERT, presented at Human Brain Project workshop in Glasgow.
2	2020	Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding	Deruyttere, Thierry and Collell, Guillem and Moens, Marie-Francine		A new spatial memory module and a spatial reasoner for the Visual Grounding task. We focus on integrating the regions of a Region Proposal Network into a new multi-step reasoning model.
3	2020	Learning Grammar in Confined Worlds	Spinks, Graham and Cartuyvels, Ruben and Moens, Marie-Francine	LNEE	In this position paper we argue that modern machine learning approaches fail to adequately address how grammar and common sense should be learned.We advocate for experiments with the use of abstract, confined world environments where agents interact with the emphasis on learning world models.
4	2020	Improving Language Understanding in Machines through Anticipation.	Cornille, Nathan and Collel, Guillem and Moens, Marie-Francine	NAISys 2020	Poster that reflects on some of the issues with an internal contrastive objective that aims to improve representation learning.
5	2020	Decoding Language Spatial Relations to 2D Spatial Arrangements	Radevski, Gorjan and Collell, Guillem and Moens, Marie-Francine and Tuytelaars, Tinne	EMNLP 2020	We propose Spatial-Reasoning Bert (SR-Bert) for the problem of multimodal spatial understanding by decoding a set of language-expressed spatial relations to a set of 2D spatial arrangements in a multi-object and multi-relationship setting.
6	2021	Discrete and continuous representations and processing in deep learning: Looking forward	Ruben Cartuyvels, Graham Spinks, Marie-Francine Moens	AI Open	A position paper that reflects on the role of discrete and continuous representations and processing in the deep learning era.
7	2021	Visual Grounding Strategies for Text-Only Natural Language Processing	Sileo, Damien		Conception, categorization and strategies to leverage multimodal pretraining for text-only tasks
8	2021	Giving Commands to a Self-Driving Car: How to Deal with Uncertain Situations?	Deruyterre, Thierry and Milewski, Victor and Moens, Marie-Francine		When a command is given to a self-driving cars, this can cause ambiguous solutions. A method to solve this through visual and textual means is proposed.
9	2021	Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations	Araujo, Vladimir and Villa, Andres and Mendoza, Marcelo and Moens, Marie-Francine and Soto, Alvaro	EMNLP 2021	We propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations.
10	2023	A Brief Overview of Universal Sentence Representation Methods: A Linguistic View.	Li, Ruiqi and Moens, Marie-Francine		Accepted for upcomming issue!
11	2022	Critical Analysis of Deconfounded Pretraining to Improve Visio-Linguistic Models	Cornille, Nathan and Laenen, Katrien and Moens, Marie-Francine		We critically analyze a recent technique that uses the toolbox of causality to improve on OOD performance, elucidating to what extent it actually finds confounders, under what assumptions it performs deconfounding, and whether the reported OOD performance is actually linked to the causal tools.
12	2022	Finding Structural Knowledge in Multimodal-BERT	Milewski, Victor and de Lhoneux, Miryam and Moens, Marie-Francine	ACL 2022	we introduce scene trees, by mapping the linguistic dependency tree ontop of regions, to investigate if BERT learns structures over the image regions.
13	2022	Evaluation Benchmarks for Spanish Sentence Representations	Vladimir Araujo, Andrés Carvallo, Souvik Kundu, José Cañete, Marcelo Mendoza, Robert E. Mercer, Felipe Bravo-Marquez, Marie-Francine Moens, Alvaro Soto	LREC 2022	A new benchmark for spanish sentence representations
14	2023	Learning Sentence-Level Representations with Predictive Coding	Araujo, Vladimir and Moens, Marie-Francine and Soto, Alvaro		This work explores how to improve sentence-level representations of pre-trained models by borrowing ideas from predictive coding theory
15	2023	Layout-aware Dreamer for Embodied Visual Referring Expression Grounding	Li, Mingxiao and Wang, Zehao and Tuytelaars, Tinne and Moens, Marie-Francine	AAAI-23	We have designed an autonomous agent called Layout-aware Dreamer (LAD) including two novel modules, the Layout Learner and the Goal Dreamer, to mimic a humans cognitive decision process
16	2023	Fine-tuned vs. Prompt-tuned Supervised Representations: Which Better Account for Brain Language Representations?	Jingyuan Sun and Marie-Francine Moens	IJCAI 2023	Investiging various supervised method and the correlation to how brains represent language.
17	2023	Investigating Neural Fit Approaches for Sentence Embedding Model Paradigms	Helena Balabin, Antonietta Gabriella Liuzzi, Jingyuan Sun, Patrick Dupont, Rik Vandenberghe, Marie-Francine Moens	ECAI 2023	We analyze the link (i.e., neural fit) between functional MRI data and pre-trained language models using different brain networks, neural fit approaches and sentence modeling paradigms.
18	2023	Tuning In to Neural Encoding: Linking Human Brain and Artificial Supervised Representations of Language	Jingyuan Sun, Xiaohan Zhang and Marie-Francine Moens	ECAI 2023	Linking human brain and supervised ANN representations of the Chinese language.
19	2023	Causal Factor Disentanglement for Few-Shot Domain Adaptation in Video Prediction	Cornille, Nathan and Sun, Jinguan and Laenen, Katrien and Moens, Marie-Francine		We evaluate whether we can use Causal Factor Disentanglement to isolate parameters that model different causal mechanisms, and subsequently adapt more quickly in response to a Sparse Mechanism Shift.