Datasets

From Intersectional AI Toolkit


A Critical Field Guide to Working with Machine Learning Datasets is coming soon in collaboration with the Knowing Machines research project

Datasets and Models, Ingredients and Recipes[edit | edit source]

It's not just a list of random choices and tools, it's a critical consideration... our choices matter for what we make.

"All data are local. Indeed, data are cultural artifacts created by people, and their dutiful machines, at a time, in a place, and with the instruments at hand for audiences that are conditioned to receive them." [...] "[We must learn] to analyze data settings rather than data sets." (Loukissas)
Existing datasets perpetuate under-representation and "a range of harmful and problematic representation." They "use cheap tricks", "make ethically dubious questions seem answerable", and "strips away context" (Paulladua et al. 2020)

Data: Every Input Was Someone Else's Output[edit | edit source]

Note: This curation of datasets is a sketch in progress being continually updated, and will include a discussion of what makes a more ethical, more critical dataset--is such a thing possible?

Text and Image[edit | edit source]

Image Data[edit | edit source]

Text Data[edit | edit source]

Audio Data[edit | edit source]

  • Mozilla CommonVoice, "open source, multi-language dataset of voices that anyone can use to train speech-enabled applications", 2021-07-21, 65 GB, 75,879 voices
  • WikiCommons

Models[edit | edit source]

"When assessing whether a task is solvable, we first need to ask: should it be solved? And if so, should it be solved by AI?" (Jacobson et al. 2020)

Language Processing[edit | edit source]