Datasets

= Datasets and Models, Ingredients and Recipes =

It's not just a list of random choices and tools, it's a critical consideration... our choices matter for what we make.

&quot;All data are local. Indeed, data are cultural artifacts created by people, and their dutiful machines, at a time, in a place, and with the instruments at hand for audiences that are conditioned to receive them.&quot; [...] &quot;[We must learn] to analyze data settings rather than data sets.&quot; (Loukissas) Existing datasets perpetuate under-representation and &quot;a range of harmful and problematic representation.&quot; They &quot;use cheap tricks&quot;, &quot;make ethically dubious questions seem answerable&quot;, and &quot;strips away context&quot; (Paulladua et al. 2020)

Data: Every Input Was Someone Else's Output
Note: This curation of datasets is a sketch in progress being continually updated, and will include a discussion of what makes a more ethical, more critical dataset--is such a thing possible?

Text and Image

 * WIT, Wikipedia-based Image Text Dataset, Google's open-source multimodal scraping of Wikipedia
 * Conceptual Captions

Image Data

 * deepDataset Firefox extension, build your own
 * [COCO] Microsoft
 * [ImageNet] built on WordNet, Stanford Vision Lab
 * [Open Images] Google
 * Better Images of AI, stock images of metaphors for AI

Text Data

 * CommonCrawl ?
 * C4: Colossal Cleaned Crawled Corpus, see [^T5] below
 * Newsroom (https://paperswithcode.com/paper/newsroom-a-dataset-of-13-million-summaries), &quot;1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017 and use a variety of summarization strategies combining extraction and abstraction&quot;
 * NYTimes Annotated Corpus, &quot;over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata&quot;

Audio Data

 * Mozilla CommonVoice, &quot;open source, multi-language dataset of voices that anyone can use to train speech-enabled applications&quot;, 2021-07-21, 65 GB, 75,879 voices
 * WikiCommons

Models
&quot;When assessing whether a task is solvable, we first need to ask: should it be solved? And if so, should it be solved by AI?&quot; (Jacobson et al. 2020)

Language Processing

 * DeepSpeech, Mozilla (created from CommonVoice)
 * Word2Vec
 * T5 using C4: Colossal Cleaned Crawled Corpus