Child sexual abuse and exploitation images identified in dataset used for developing AI moderation tools, analysis by Canadian child protection charity finds
The image dataset commonly used for training AI nude detection models frequently downloaded, used by research community
For Immediate Release
Winnipeg, Canada — A large image dataset used by researchers to develop AI tools for detecting sexually explicit content contains several child sexual abuse and exploitation images, an analysis by the Canadian Centre for Child Protection (C3P) has found. This finding raises a number of ethical concerns for those conducting research or investing in the field of artificial intelligence (AI).
The image collection, known as the NudeNet dataset, contains more than 700,000 images harvested from a number of online sources, including social media, image hosting services, and adult pornography websites.
An analysis of the NudeNet dataset using a combination of image-matching technology and manual review surfaced nearly 680 images that C3P knows or suspects are child sexual abuse and exploitation material (CSAEM) and other harmful and abusive material of minors. The review did not include a search for CSAEM not previously known to C3P.
Details of the findings include:
- More than 120 images of identified or known victims of CSAEM, including Canadian and American survivors;
- Nearly 70 images focused on the genital/anal area of children who are confirmed or appear to be pre-pubescent;
- Over 130 images focused on the genital/anal area of children who are confirmed or appear to be post-pubescent;
- In some cases, images depicting sexual or abusive acts involving children and teenagers such as fellatio or penile-vaginal penetration.
C3P has since issued a removal notice to administrators of the Academic Torrents website which had been making the user-generated dataset available for download since June 2019. As of the time of this publication, the identified images are no longer available from this web service.
These findings mirror the results of research conducted by Stanford University’s Cyber Policy Center in 2023, which found over a thousand CSAEM images in the open access LAION-5B image dataset — a massive media collection used to train popular image generation tools such as Stable Diffusion. C3P supported the Cyber Policy Center’s research by helping validate their findings through the use of its Canadian-built technology platform, Project Arachnid.
Ethical considerations for AI researchers and industry
In addition to the above findings, C3P found more than 250 academic works that either cited, used the NudeNet dataset, or used an AI image classifier trained using the NudeNet dataset. A non-exhaustive review of 50 of these academic projects found 13 made use of the NudeNet data set, and 29 relied on the NudeNet classifier or model.
C3P identified at least one research group from institutions based in Canada that used the NudeNet dataset.
“As countries continue to invest in the development of AI technology, it’s crucial that researchers and industry consider the ethics of their work every step of the way,” says Lloyd Richardson, C3P's director of technology.
“Many of the AI models used to support features in applications and research initiatives have been trained on data that has been collected indiscriminately or in ethically questionable ways. This lack of due diligence has led to the appearance of known child sexual abuse and exploitation material in these types of datasets, something that is largely preventable,” says Richardson.
In light of these findings, C3P recommends the following measures to help mitigate against the inadvertent inclusion of CSAEM in AI research and to further advance ethical and best practices in this field:
- Training dataset distributors and users of such resources should take appropriate steps to ensure CSAEM is not present in the datasets, which may include working with relevant authorities or organizations capable of verifying images;
- Training dataset and AI model distributors should ensure users have the ability to report abusive or illegal content discovered in a dataset;
- Academic Research Ethics Boards (REBs) and institutions that otherwise review and approve research projects, should closely examine research proposals related to AI and seek information about source data to be used by research teams, along with a requirement to take reasonable steps to ensure datasets do not contain CSAEM or violate the privacy of children;
- Implement laws and regulations surrounding the ethical development and use of AI technologies.
Canadian Centre for Child Protection
1 (204) 560-0723
communications@protectchildren.ca