Data Collection and Preprocessing

Name	Self-created?	Users	Work published?	Description	Link/reference
CoCa Causes	yes	Nathan Cornille	yes	Causal direction / confounder labels for objects in Conceptual Captions images	https://drive.google.com/file/d/17CTPMoZ4uJH6cSQaxD6Vmyv_bVLJwVJp/view?usp=sharing
SMS-TRIS	yes	Nathan Cornille	no	Sparse Mechanism Shifted -TempoRally Intervened Sequences: Variants of TRIS datasets in which each time one mechanism is shifted
SuperGLUE	no	Nathan Cornille	yes	Benchmark with various difficult language understanding tasks	https://super.gluebenchmark.com/</div>
Conceptual Captions	no	Nathan Cornille	yes	3.3 million images+captions	https://aclanthology.org/P18-1238/
VQA v2	no	Nathan Cornille	yes	.25M images with .76M questions and 10M answers	https://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html
Flickr30k	no	Nathan Cornille	yes	31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators	https://paperswithcode.com/dataset/flickr30k
TRIS	no	Nathan Cornille	yes	Synthetic video frames generated from unobserved causal factors (like position, color). The factors evolve according to fixed causal mechanisms, except that they are sometimes intervened on,
MS-COCO captions	no	Victor Milewski	yes	123k images, 5 captions each. Used for image captioning and as language input for sgg. Has a 51k image overlap with VisualGenome	https://cocodataset.org/#home
Visual Genome	no	Victor Milewski & Wolf Nuyts	yes	108k images. Each with a scene graph, 50 region descriptions and region graphs, ~35 objects, ~21 pairwise relationships.	http://visualgenome.org/
Flickr30k-entities	no	Victor Milewski	yes	Same as flickr30k, but all objects are linked to the entities (often noun phrases) in the caption.	https://bryanplummer.com/Flickr30kEntities/
Penn Treebank	no	Victor Milewski	yes	Tree annotated sentences. Used to evaluate dependency tree parsing capabilities of multimodal-bert models through probing.	https://catalog.ldc.upenn.edu/LDC99T42
Scene Trees for Flickr30k-entities	yes	Victor Milewski	yes	A reduction of dependencies to just describe the objects/head-nouns. It is a dependency structure over the image regions.	https://aclanthology.org/2022.acl-long.388/
MSCOCO	no	Ruben Cartuyvels & Wolf Nuyts	yes	Used the captions and the bounding boxes but not the images: to train models to predict the bounding boxes from the captions	https://cocodataset.org/#home
WorldTree v2	no	Ruben Cartuyvels	yes	Dataset of multiple choice elementary science exam questions and their answers, linked to a set of multiple textual facts that together explain the answer, used to train an autoregressive retrieval model that looks for these facts when given the question, for the TextGraphs workshop competition	https://allenai.org/data/worldtree-v2
BLLIP Wall Street Journal Corpus	no	Ruben Cartuyvels	yes	Sentences and constituency tree annotations, used to pretrain text encoders with next-token prediction on linearized parse trees. After pretraining on this dataset, we used these text encoders for text-to-layout prediction (on MSCOCO).	https://catalog.ldc.upenn.edu/LDC99T42
USCOCO	no	Ruben Cartuyvels & Wolf Nuyts	no	Unexpected Situations with Common Objects in Contex: An evaluation set for text-to-layout prediction of grammatically correct sentences and layouts (sets of bounding boxes, that represent visual â€œimaginedâ€ situations), describing compositions of entities and relations that are unlikely to be found in the USCOCO training data
CC-500	no	Wolf Nuyts	yes	Dataset containing 500 sentences with 2 objects and a color for each object to test the generation of correct colors of text-image models.
ABC-6K	no	Wolf Nuyts	yes	Part of MS-COCO containing at least 2 color attributes
DAA-200	yes	Wolf Nuyts	no	Difficult Adverserial Attributes: dataset mined from VisualGenome: containing 100 graphs with images from the VSG dataset where each graph contains two nodes with one attribute each. For each graph an adversarial graph is generated by swapping the attributes of both objects with each other. For each of the 200 graphs a sentence of the form: A〈attribute 1〉〈object 1〉and A〈attribute 2〉〈object 2〉 is generated.
BISON-0.6	yes	Maria Trusca	no	1437 images with two captions are allocated per image. The first caption describes the image and the second caption indicates how the image should be edited. A set of word alignments between each pair of captions is also specified. The images and the captions are extracted from the BISON dataset: https://arxiv.org/pdf/1901.06595.pdf
Dream	yes	Maria Trusca	no	100 images. The dataset is defined as BISON-0.6. The images are generated using Wombo Dream https://dream.ai/create.
Senteval and discoeval in spanish	yes	Vladimir Araujo	yes	Benchmarks to evaluate representations from Pre-trained language models in spanish.
Senteval and discoeval	no	Vladimir Araujo	yes	Benchmarks to evaluate representations from Pre-trained language models in English.
Pragmeval	no	Vladimir Araujo	yes	Benchmark to evaluate pragmatic knowledge of pretrianed language models.
bAbI	no	Vladimir Araujo	yes	Synthetic dataset for story understanding through question and answer.
NarrativeQA	no	Vladimir Araujo	yes	Dataset for question answering from long context realistic text.
ActivitynetQA	no	Vladimir Araujo	yes	Dataset for question answering from videos.
MNIST	no	Aristotelis Chrysakis	yes	Contains grayscale images of handwritten digits. Each image's dimensions are 28-28, and each image belongs in exactly one out of 10 classes. The data is split in a training set of 60,000 images and a testing set of 10,000 images.	http://yann.lecun.com/exdb/mnist/
FashionMNIST	no	Aristotelis Chrysakis	yes	Contains grayscale images of clothing items. Each image's dimensions are 28-28, and each image belongs in exactly one out of 10 classes. The data is split in a training set of 60,000 images and a testing set of 10,000 images.	https://github.com/zalandoresearch/fashion-mnist
CIFAR-10	no	Aristotelis Chrysakis	yes	A collection of 60,000 colored images of dimension 32-32. Each image contains an object out of 10 different classes, such as airplanes, cars, birds, etc.	https://www.cs.toronto.edu/~kriz/cifar.html
CIFAR-100	no	Aristotelis Chrysakis	yes	A collection of 60,000 colored images of dimension 32-32. Each image contains an object out of 100 different classes.	https://www.cs.toronto.edu/~kriz/cifar.html
tinyImageNet	no	Aristotelis Chrysakis	yes	It contains 100,000 colored images of dimension 64-64 split into 200 classes. It is a subset of the original ImageNet dataset.	https://www.kaggle.com/competitions/tiny-imagenet/data
BOLD5000	no	Jingyuan Sun	yes		Chang, N., Pyles, J. A., Marcus, A., Gupta, A., Tarr, M. J., & Aminoff, E. M. (2019). BOLD5000, a public fMRI dataset while viewing 5000 visual images. Scientific data, 6(1), 49.
	no	Jingyuan Sun	yes		Horikawa, T., & Kamitani, Y. (2017). Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications, 8(1), 15037.
	no	Jingyuan Sun	yes		Pereira, F., Lou, B., Pritchett, B., Ritter, S., Gershman, S. J., Kanwisher, N., ... & Fedorenko, E. (2018). Toward a universal decoder of linguistic meaning from brain activation. Nature communications, 9(1), 963.
	no	Jingyuan Sun	yes		Wang, S., Zhang, X., Zhang, J. et al. A synchronized multimodal neuroimaging dataset for studying brain language processing. Sci Data 9, 590 (2022). https://doi.org/10.1038/s41597-022-01708-5