In this WP, we use the anticipatory models and representations for natural language understanding tasks, like recognition of events and actions in text as well as their participating entities and coreferents, and their spatial and temporal relations.
A first challenge was the design and evaluation of efficient storage and retrieval of the enriched representations. E.g., we propose an incremental retrieval scheme for continuous representations of textual facts triggered by words and phrases in both input questions (in a QA context) and previously gathered facts [5]. The retrieval is refined by jointly contextualizing and processing the initially retrieved continuous representations, where semantic roles and coreferences can play a role. Coreferences play a role in another work, where we define mechanisms inspired by human memory – rehearsal and anticipation – to improve selection and storage of information in a memory network, so it can later be more efficiently and more effectively retrieved and used for answering questions [8]. Cosine similarity or other distance metrics are often used to retrieve semantically similar items (text or images) based on their continuous representations. In [6] the sensitivity of these metrics to common transformations like centering is tested, both for image and text embeddings provided by popular off-the-shelf feature extractors.
In [3] we learn matrix representations of objects in a novel way, exploiting distances to different contextual reference frames, taking inspiration from human memory. These representations can be used for highly efficient and configurable semantic retrieval, among other applications.
The next line of work is concerned with using the enriched models to parse both spatial and temporal information. To start, we evaluate to what extent multimodal pretrained transformers, which can be seen as visually enriched language models, parse and retain semantic roles and relations present in input text and/or images [7]. The roles and relations might be explicit or implicit in the input text. We find that current visually enriched language models do not yet parse and retain in their continuous representations the underlying semantic structures to a satisfying extent.
We give an overview of current methods that parse temporal information from text, and methods that reason with temporal relations [1]. We propose and evaluate a method to parse probabilistic absolute event timelines for both events of which the temporal information is both explicit and implicit in the text [2]. Finally, we also focus on parsing spatial relations instead of temporal ones [4]. We propose a method to decode 2D spatial arrangements from input sentences, by finetuning pretrained text transformers. This works for both implicitly and explicitly mentioned arrangements.
We will integrate our developed methods in this WP with the methods from the next WP (WP 5, on inference and learning of new knowledge), into a story demonstrator that translates stories to events in a visual, virtual world.
# | Year | Title | Authors | Venue | Description |
---|---|---|---|---|---|
1 | 2019 | A survey on temporal reasoning for temporal information extraction from text | Leeuwenberg, Artuur and Moens, Marie-Francine | JAIR 2019 | This article presents a comprehensive survey of the research from the past decades on temporal reasoning for automatic temporal information extraction from text, providing a case study on the integration of symbolic reasoning with machine learning-based information extraction systems. |
2 | 2020 | Towards Extracting Absolute Event Timelines from English Clinical Reports | Leeuwenberg, Tuur and Moens, Marie-Francine | IEEE | An approach towards extraction of more complete temporal information for all events, and obtain probabilistic absolute event timelines by modeling temporal uncertainty with information bounds. |
3 | 2020 | Structured (De)composable Representations Trained with Neural Networks | Spinks, Graham and Moens, Marie-Francine | End-to-end deep learning technique to learn structured and composable representations. | |
4 | 2020 | Decoding Language Spatial Relations to 2D Spatial Arrangements | Radevski, Gorjan and Collell, Guillem and Moens, Marie-Francine and Tuytelaars, Tinne | EMNLP 2020 | We propose Spatial-Reasoning Bert (SR-Bert) for the problem of multimodal spatial understanding by decoding a set of language-expressed spatial relations to a set of 2D spatial arrangements in a multi-object and multi-relationship setting. |
5 | 2020 | Autoregressive Reasoning over Chains of Facts with Transformers | Ruben Cartuyvels, Graham Spinks, Marie-Francine Moens | COLING 2020 | An iterative inference algorithm for multi-hop explanation regeneration, that retrieves relevant factual evidence in the form of text snippets, given a natural language question and its answer. |
6 | 2021 | How Do Simple Transformations of Text and Image Features Impact Cosine-based Semantic Match | Collell, Guillem and Moens, Marie-Francine | ECIR 2021 | We investigate the impact of transformations on semantic distances between embeddings produced by common language models and image CNNs. |
7 | 2022 | Finding Structural Knowledge in Multimodal-BERT | Milewski, Victor and de Lhoneux, Miryam and Moens, Marie-Francine | ACL 2022 | we introduce scene trees, by mapping the linguistic dependency tree ontop of regions, to investigate if BERT learns structures over the image regions. |
8 | 2023 | A Memory Model for Question Answering from Streaming Data Supported by Rehearsal and Anticipation of Coreference Information | Vladimir Araujo, Alvaro Soto, Marie-Francine Moens | ACL 2023 | Drawing inspiration from human mechanisms, we propose a memory model that performs rehearsal and anticipation while processing inputs to memorize important information for solving question answering tasks from streaming data. |