Images may have aided AI systems in producing realistic sexual imagery of fictitious children; the database was removed in response.
Thousands of photos of child sexual assault are hidden inside the basis of popular artificial intelligence (AI) image generators, according to new research published on Wednesday. In response to the study, the operators of some of the largest and most widely used sets of photos used to train AI disabled access to them.
The Stanford Internet Observatory discovered over 3,200 photographs of alleged child sexual assault in the massive AI database LAION, an inventory of web images and captions used to train prominent AI image-makers like Stable Diffusion. The Stanford University-based watchdog group collaborated with the Canadian Centre for Child Protection and other anti-abuse organizations to identify the illegal material and submit the original photo URLs to law authorities. Over 1,000 of the suspected photos were found to contain child sexual abuse content.
“We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images,” the study’s authors concluded.
The reaction was quick. On the eve of the Stanford Internet Observatory report’s release on Wednesday, LAION announced that it would be temporarily withdrawing its datasets. LAION, which stands for the non-profit Large-scale Artificial Intelligence Open Network, stated in a statement that it “has a zero tolerance policy for illegal content, and we have taken down the LAION datasets to ensure they are safe before republishing them.”
While the photos constitute only a small portion of LAION’s database of over 5.8 billion images, the Stanford researchers believe they are influencing the ability of AI tools to generate damaging outputs and reinforcing the prior abuse of real victims who appear several times.
According to the researchers, those photographs have made it easier for AI systems to make realistic and graphic imagery of phony children as well as transform social media photos of fully clothed real kids into nudes, much to the chagrin of schools and law police throughout the world. Until recently, anti-abuse researchers believed that the only way some unmanaged AI systems produced abusive photographs of children was by integrating what they had learnt from two distinct types of web images – adult pornography and innocent photos of children.
It is impossible to clean up the data retroactively, therefore the Stanford Internet Observatory is advocating for more harsh steps. One is to “delete them or work with intermediaries to clean the material” for anyone who has generated training sets based on LAION5B – named after the more than 5 billion image-text pairs it contains. Another option is to effectively remove a previous version of Stable Diffusion from all but the darkest corners of the internet.
“Legitimate platforms can stop offering versions for download,” especially if they are commonly used to generate abusive photos and lack protections to restrict them, according to David Thiel, chief technologist at Stanford Internet Observatory and author of the paper.
It’s not an easy problem to solve, and it stems from many generative AI projects being “effectively rushed to market” and made freely available due to the field’s competitiveness, according to Thiel.
“Taking an entire internet-wide scrape and making that dataset to train models is something that should have been confined to a research operation, if anything, and is not something that should have been open-sourced without a lot more rigorous attention,” Thiel said in a recent interview.
Stability AI, creator of the Stable Diffusion text-to-image models, is a significant LAION user who helped shape the dataset’s evolution. Stable Diffusion’s newer versions have made it much more difficult to create harmful content, but an older version introduced last year – which Stability AI claims it did not release – is still baked into other applications and tools and remains “the most popular model for generating explicit imagery,” according to the Stanford report.
“We can’t go back on that.” Many people have that model on their local machines,” said Lloyd Richardson, director of information technology at the Canadian Centre for Child Protection, which operates Canada’s hotline for reporting online sexual abuse.
Stability AI stated on Wednesday that it only hosts filtered versions of Stable Diffusion and that “Stability AI has taken proactive steps to mitigate the risk of misuse since taking over the exclusive development of Stable Diffusion.”
“Those filters prevent unsafe content from reaching the models,” according to a prepared statement from the business. “By removing that content before it ever reaches the model, we can help to prevent the model from generating unsafe content.”
LAION stated last week that before distributing its datasets, it built “rigorous filters” to discover and eliminate illicit information, and that it was still striving to strengthen those filters. The Stanford research noted that LAION’s creators attempted to filter out “underage” sexual content, but that they could have done a better job if they had spoken with child safety specialists sooner.
Much of LAION’s data originates from another source, Common Crawl, a storehouse of data regularly trawled from the open internet. However, Rich Skrenta, executive director of Common Crawl, stated that it was “incumbent on” LAION to scan and filter what it took before using it.
Many text-to-image generators are derived in some way from the LAION database, however which ones are not necessarily obvious. OpenAI, the creator of Dall-E and ChatGPT, stated that it did not employ LAION and that it has fine-tuned its models to prevent requests for sexual content involving children.
Google constructed their text-to-image Imagen model using the LAION dataset, but decided not to make it public until 2022 because an assessment of the database “uncovered a wide range of inappropriate content, including pornographic imagery, racist slurs, and harmful social stereotypes.”
LAION was created by Christoph Schuhmann, a German researcher and teacher who stated earlier this year that one of the reasons for making such a large visual database freely available was to ensure that the future of AI development is not controlled by a few strong firms.