Data Collection and Preprocessing

Name Self-created? Users Work published? Description Link/reference
CoCa Causes yes Nathan Cornille yes Causal direction / confounder labels for objects in Conceptual Captions images
SMS-TRIS yes Nathan Cornille no Sparse Mechanism Shifted -TempoRally Intervened Sequences: Variants of TRIS datasets in which each time one mechanism is shifted
SuperGLUE no Nathan Cornille yes Benchmark with various difficult language understanding tasks https://super.gluebenchmark.com/</div>
Conceptual Captions no Nathan Cornille yes 3.3 million images+captions
VQA v2 no Nathan Cornille yes .25M images with .76M questions and 10M answers
Flickr30k no Nathan Cornille yes 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators
TRIS no Nathan Cornille yes Synthetic video frames generated from unobserved causal factors (like position, color). The factors evolve according to fixed causal mechanisms, except that they are sometimes intervened on,
MS-COCO captions no Victor Milewski yes 123k images, 5 captions each. Used for image captioning and as language input for sgg. Has a 51k image overlap with VisualGenome
Visual Genome no Victor Milewski & Wolf Nuyts yes 108k images. Each with a scene graph, 50 region descriptions and region graphs, ~35 objects, ~21 pairwise relationships.
Flickr30k-entities no Victor Milewski yes Same as flickr30k, but all objects are linked to the entities (often noun phrases) in the caption.
Penn Treebank no Victor Milewski yes Tree annotated sentences. Used to evaluate dependency tree parsing capabilities of multimodal-bert models through probing.
Scene Trees for Flickr30k-entities yes Victor Milewski yes A reduction of dependencies to just describe the objects/head-nouns. It is a dependency structure over the image regions.
MSCOCO no Ruben Cartuyvels & Wolf Nuyts yes Used the captions and the bounding boxes but not the images: to train models to predict the bounding boxes from the captions
WorldTree v2 no Ruben Cartuyvels yes Dataset of multiple choice elementary science exam questions and their answers, linked to a set of multiple textual facts that together explain the answer, used to train an autoregressive retrieval model that looks for these facts when given the question, for the TextGraphs workshop competition
BLLIP Wall Street Journal Corpus no Ruben Cartuyvels yes Sentences and constituency tree annotations, used to pretrain text encoders with next-token prediction on linearized parse trees. After pretraining on this dataset, we used these text encoders for text-to-layout prediction (on MSCOCO).
USCOCO no Ruben Cartuyvels & Wolf Nuyts no Unexpected Situations with Common Objects in Contex: An evaluation set for text-to-layout prediction of grammatically correct sentences and layouts (sets of bounding boxes, that represent visual “imagined” situations), describing compositions of entities and relations that are unlikely to be found in the USCOCO training data
CC-500 no Wolf Nuyts yes Dataset containing 500 sentences with 2 objects and a color for each object to test the generation of correct colors of text-image models.
ABC-6K no Wolf Nuyts yes Part of MS-COCO containing at least 2 color attributes
DAA-200 yes Wolf Nuyts no Difficult Adverserial Attributes: dataset mined from VisualGenome: containing 100 graphs with images from the VSG dataset where each graph contains two nodes with one attribute each. For each graph an adversarial graph is generated by swapping the attributes of both objects with each other. For each of the 200 graphs a sentence of the form: A〈attribute 1〉〈object 1〉and A〈attribute 2〉〈object 2〉 is generated.
BISON-0.6 yes Maria Trusca no 1437 images with two captions are allocated per image. The first caption describes the image and the second caption indicates how the image should be edited. A set of word alignments between each pair of captions is also specified. The images and the captions are extracted from the BISON dataset: https://arxiv.org/pdf/1901.06595.pdf
Dream yes Maria Trusca no 100 images. The dataset is defined as BISON-0.6. The images are generated using Wombo Dream https://dream.ai/create.
Senteval and discoeval in spanish yes Vladimir Araujo yes Benchmarks to evaluate representations from Pre-trained language models in spanish.
Senteval and discoeval no Vladimir Araujo yes Benchmarks to evaluate representations from Pre-trained language models in English.
Pragmeval no Vladimir Araujo yes Benchmark to evaluate pragmatic knowledge of pretrianed language models.
bAbI no Vladimir Araujo yes Synthetic dataset for story understanding through question and answer.
NarrativeQA no Vladimir Araujo yes Dataset for question answering from long context realistic text.
ActivitynetQA no Vladimir Araujo yes Dataset for question answering from videos.
MNIST no Aristotelis Chrysakis yes Contains grayscale images of handwritten digits. Each image's dimensions are 28-28, and each image belongs in exactly one out of 10 classes. The data is split in a training set of 60,000 images and a testing set of 10,000 images.
FashionMNIST no Aristotelis Chrysakis yes Contains grayscale images of clothing items. Each image's dimensions are 28-28, and each image belongs in exactly one out of 10 classes. The data is split in a training set of 60,000 images and a testing set of 10,000 images.
CIFAR-10 no Aristotelis Chrysakis yes A collection of 60,000 colored images of dimension 32-32. Each image contains an object out of 10 different classes, such as airplanes, cars, birds, etc.
CIFAR-100 no Aristotelis Chrysakis yes A collection of 60,000 colored images of dimension 32-32. Each image contains an object out of 100 different classes.
tinyImageNet no Aristotelis Chrysakis yes It contains 100,000 colored images of dimension 64-64 split into 200 classes. It is a subset of the original ImageNet dataset.
BOLD5000 no Jingyuan Sun yes Chang, N., Pyles, J. A., Marcus, A., Gupta, A., Tarr, M. J., & Aminoff, E. M. (2019). BOLD5000, a public fMRI dataset while viewing 5000 visual images. Scientific data, 6(1), 49.
no Jingyuan Sun yes Horikawa, T., & Kamitani, Y. (2017). Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications, 8(1), 15037.
no Jingyuan Sun yes Pereira, F., Lou, B., Pritchett, B., Ritter, S., Gershman, S. J., Kanwisher, N., ... & Fedorenko, E. (2018). Toward a universal decoder of linguistic meaning from brain activation. Nature communications, 9(1), 963.
no Jingyuan Sun yes Wang, S., Zhang, X., Zhang, J. et al. A synchronized multimodal neuroimaging dataset for studying brain language processing. Sci Data 9, 590 (2022). https://doi.org/10.1038/s41597-022-01708-5