CoCa Causes |
yes |
Nathan Cornille |
yes |
Causal direction / confounder labels for objects in Conceptual Captions images |
|
SMS-TRIS |
yes |
Nathan Cornille |
no |
Sparse Mechanism Shifted -TempoRally Intervened Sequences: Variants of TRIS datasets in which each time one mechanism is shifted |
|
SuperGLUE |
no |
Nathan Cornille |
yes |
Benchmark with various difficult language understanding tasks |
https://super.gluebenchmark.com/</div> |
Conceptual Captions |
no |
Nathan Cornille |
yes |
3.3 million images+captions |
|
VQA v2 |
no |
Nathan Cornille |
yes |
.25M images with .76M questions and 10M answers |
|
Flickr30k |
no |
Nathan Cornille |
yes |
31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators |
|
TRIS |
no |
Nathan Cornille |
yes |
Synthetic video frames generated from unobserved causal factors (like position, color). The factors evolve according to fixed causal mechanisms, except that they are sometimes intervened on, |
|
MS-COCO captions |
no |
Victor Milewski |
yes |
123k images, 5 captions each. Used for image captioning and as language input for sgg. Has a 51k image overlap with VisualGenome |
|
Visual Genome |
no |
Victor Milewski & Wolf Nuyts |
yes |
108k images. Each with a scene graph, 50 region descriptions and region graphs, ~35 objects, ~21 pairwise relationships. |
|
Flickr30k-entities |
no |
Victor Milewski |
yes |
Same as flickr30k, but all objects are linked to the entities (often noun phrases) in the caption. |
|
Penn Treebank |
no |
Victor Milewski |
yes |
Tree annotated sentences. Used to evaluate dependency tree parsing capabilities of multimodal-bert models through probing. |
|
Scene Trees for Flickr30k-entities |
yes |
Victor Milewski |
yes |
A reduction of dependencies to just describe the objects/head-nouns. It is a dependency structure over the image regions. |
|
MSCOCO |
no |
Ruben Cartuyvels & Wolf Nuyts |
yes |
Used the captions and the bounding boxes but not the images: to train models to predict the bounding boxes from the captions |
|
WorldTree v2 |
no |
Ruben Cartuyvels |
yes |
Dataset of multiple choice elementary science exam questions and their answers, linked to a set of multiple textual facts that together explain the answer, used to train an autoregressive retrieval model that looks for these facts when given the question, for the TextGraphs workshop competition |
|
BLLIP Wall Street Journal Corpus |
no |
Ruben Cartuyvels |
yes |
Sentences and constituency tree annotations, used to pretrain text encoders with next-token prediction on linearized parse trees. After pretraining on this dataset, we used these text encoders for text-to-layout prediction (on MSCOCO). |
|
USCOCO |
no |
Ruben Cartuyvels & Wolf Nuyts |
no |
Unexpected Situations with Common Objects in Contex: An evaluation set for text-to-layout prediction of grammatically correct sentences and layouts (sets of bounding boxes, that represent visual “imagined†situations), describing compositions of entities and relations that are unlikely to be found in the USCOCO training data |
|
CC-500 |
no |
Wolf Nuyts |
yes |
Dataset containing 500 sentences with 2 objects and a color for each object to test the generation of correct colors of text-image models. |
|
ABC-6K |
no |
Wolf Nuyts |
yes |
Part of MS-COCO containing at least 2 color attributes |
|
DAA-200 |
yes |
Wolf Nuyts |
no |
Difficult Adverserial Attributes: dataset mined from VisualGenome: containing 100 graphs with images from the VSG dataset where each graph contains two nodes with one attribute each. For each graph an adversarial graph is generated by swapping the attributes of both objects with each other. For each of the 200 graphs a sentence of the form: A〈attribute 1〉〈object 1〉and A〈attribute 2〉〈object 2〉 is generated. |
|
BISON-0.6 |
yes |
Maria Trusca |
no |
1437 images with two captions are allocated per image. The first caption describes the image and the second caption indicates how the image should be edited. A set of word alignments between each pair of captions is also specified. The images and the captions are extracted from the BISON dataset: https://arxiv.org/pdf/1901.06595.pdf |
|
Dream |
yes |
Maria Trusca |
no |
100 images. The dataset is defined as BISON-0.6. The images are generated using Wombo Dream https://dream.ai/create. |
|
Senteval and discoeval in spanish |
yes |
Vladimir Araujo |
yes |
Benchmarks to evaluate representations from Pre-trained language models in spanish. |
|
Senteval and discoeval |
no |
Vladimir Araujo |
yes |
Benchmarks to evaluate representations from Pre-trained language models in English. |
|
Pragmeval |
no |
Vladimir Araujo |
yes |
Benchmark to evaluate pragmatic knowledge of pretrianed language models. |
|
bAbI |
no |
Vladimir Araujo |
yes |
Synthetic dataset for story understanding through question and answer. |
|
NarrativeQA |
no |
Vladimir Araujo |
yes |
Dataset for question answering from long context realistic text. |
|
ActivitynetQA |
no |
Vladimir Araujo |
yes |
Dataset for question answering from videos. |
|
MNIST |
no |
Aristotelis Chrysakis |
yes |
Contains grayscale images of handwritten digits. Each image's dimensions are 28-28, and each image belongs in exactly one out of 10 classes. The data is split in a training set of 60,000 images and a testing set of 10,000 images. |
|
FashionMNIST |
no |
Aristotelis Chrysakis |
yes |
Contains grayscale images of clothing items. Each image's dimensions are 28-28, and each image belongs in exactly one out of 10 classes. The data is split in a training set of 60,000 images and a testing set of 10,000 images. |
|
CIFAR-10 |
no |
Aristotelis Chrysakis |
yes |
A collection of 60,000 colored images of dimension 32-32. Each image contains an object out of 10 different classes, such as airplanes, cars, birds, etc. |
|
CIFAR-100 |
no |
Aristotelis Chrysakis |
yes |
A collection of 60,000 colored images of dimension 32-32. Each image contains an object out of 100 different classes. |
|
tinyImageNet |
no |
Aristotelis Chrysakis |
yes |
It contains 100,000 colored images of dimension 64-64 split into 200 classes. It is a subset of the original ImageNet dataset. |
|
BOLD5000 |
no |
Jingyuan Sun |
yes |
|
Chang, N., Pyles, J. A., Marcus, A., Gupta, A., Tarr, M. J., & Aminoff, E. M. (2019). BOLD5000, a public fMRI dataset while viewing 5000 visual images. Scientific data, 6(1), 49. |
|
no |
Jingyuan Sun |
yes |
|
Horikawa, T., & Kamitani, Y. (2017). Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications, 8(1), 15037. |
|
no |
Jingyuan Sun |
yes |
|
Pereira, F., Lou, B., Pritchett, B., Ritter, S., Gershman, S. J., Kanwisher, N., ... & Fedorenko, E. (2018). Toward a universal decoder of linguistic meaning from brain activation. Nature communications, 9(1), 963. |
|
no |
Jingyuan Sun |
yes |
|
Wang, S., Zhang, X., Zhang, J. et al. A synchronized multimodal neuroimaging dataset for studying brain language processing. Sci Data 9, 590 (2022). https://doi.org/10.1038/s41597-022-01708-5 |