Datasets

From Intersectional AI Toolkit


A Critical Field Guide to Working with Machine Learning Datasets is now available in collaboration with the Knowing Machines research project. The field guide aims to help you practically navigate the complexities of datasets, and explore the implications of what you choose, build, and design. It invites you to mess with these messy forms and to approach any logic of classification with a critical eye.

Datasets and Models, Ingredients and Recipes[edit]

It's not just a list of random choices and tools, it's a critical consideration... our choices matter for what we make.

"All data are local. Indeed, data are cultural artifacts created by people, and their dutiful machines, at a time, in a place, and with the instruments at hand for audiences that are conditioned to receive them." [...] "[We must learn] to analyze data settings rather than data sets." (Loukissas)

Existing datasets perpetuate under-representation and "a range of harmful and problematic representation." They "use cheap tricks", "make ethically dubious questions seem answerable", and "strips away context" (Paulladua et al. 2020)

Data: Every Input Was Someone Else's Output[edit]

Note: This curation of datasets is a sketch in progress being continually updated, and will include a discussion of what makes a more ethical, more critical dataset--is such a thing possible?

Text and Image[edit]

Image Data[edit]

Text Data[edit]

Audio Data[edit]

  • Mozilla CommonVoice, "open source, multi-language dataset of voices that anyone can use to train speech-enabled applications", 2021-07-21, 65 GB, 75,879 voices
  • WikiCommons

Models[edit]

"When assessing whether a task is solvable, we first need to ask: should it be solved? And if so, should it be solved by AI?" (Jacobson et al. 2020)

Language Processing[edit]