top of page
  • Writer's pictureCatherine Yeo

We Need to Change How Image Datasets are Curated

Why many gold-standard computer vision datasets, such as ImageNet, are flawed

ImageNet


Even though it was created in 2009, ImageNet is the most impactful dataset in computer vision and AI today. Consisting of more than 14 million human-annotated images, ImageNet has become the standard for all large-scale datasets in AI. Every year, ImageNet even hosts a competition (ILSVRC) to benchmark progress made in the field.


There’s no denying ImageNet’s influence and importance in computer vision. However, with the growing evidence of biases that lie in AI models and datasets, we must consider the curation process with awareness of ethics and social contexts to improve for future datasets.


A recent paper by Vinay Prabhu and Abeba Birhane found that there are issues of concern we must consider in such large-scale datasets (primarily ImageNet, but also others including 80 Million Tiny Images and CelebA). They also outline solutions to mitigate these concerns and call for mandatory Institutional Review Boards for large-scale dataset curation processes.


This article summarizes their findings below. You can read their full preprint on arXiV here.


One Line Summary

Large scale image datasets have issues we must aim to mitigate and address in future dataset curation processes.


Harms and Threats

1) Lack of Consent

Many of these large-scale datasets freely gather photos, including photos of real people, without consideration of consent. In the Open Images V4–5–6 dataset, Prabhu and Birhane found “verifiably non-consensual images” of children taken from photo sharing community Flickr.


Photographers don’t upload your photos for the whole world to see without your consent, so why shouldn’t image datasets account for consent?

Source: Facebook


2) Loss of Privacy

When ImageNet was published, reverse image search did not exist. Now, image scraping tools are widespread, and powerful reverse image search engines (e.g. Google Image Search, PimEyes) allow anyone to be able to uncover real identities of humans/faces in a large image dataset.


With a simple reverse lookup, we could potentially find one’s full name, social media accounts, occupation, house address, and many other data points we never agreed to give away. (We may not have agreed to give away our face to the dataset in the first place).


3) Perpetuation of Harmful Stereotypes

How a dataset is labelled and curated could lead to us perpetuating what/who is perceived as “desirable”, “normal”, and “acceptable”. Individuals and groups on the margins would then be perceived as “outliers”.


For example, MIT’s 80 Million Tiny Images dataset contains harmful slurs, potentially labeling women as “whores” or “bitches” and minority racial groups with offensive language.


Once trained on biased data, machine learning algorithms can not only normalize but amplifystereotypes.


Solutions

1) Remove and Replace

There is precedent of ImageNet removing photos within the “person” subtree when they were recognized to have “potentially offensive labels”. Similar actions could be done for datasets with offensive labels and photos captured in non-consensual settings — remove them, then replace (if possible) with consensually shot financially compensated images.


2) Differential Privacy

Another solution is to blur or obfuscate the individuals’ identities using differential privacy. Differential privacy is a system with quantifiable privacy guarantees where information about a dataset can be publicly shared but information about any single individual is withheld.


3) Dataset Audit Cards

If the publication of datasets were accompanied by audit cards, then everyone would be aware of the dataset’s goals, curation process, limitations, etc. when using the dataset. This is quite similar to the concept of model cards accompanying machine learning models.


Final Thoughts

This was a fascinating and timely read — it really made me question more about my own decision-making process when I gather and work with data. I sincerely hope that this paper (and any other similar works) motivates researchers to rethink the process for curating large-scale datasets to minimize and avoid these harms and threats.


For more information, check out the original paper on arXiv here.


Vinay Uday Prabhu and Abeba Birhane. “Large Image Datasets: A Pyrrhic Win for Computer Vision?”

bottom of page