Artificial intelligence corporations are under increasing scrutiny for the information used to train their products.
As pressure mounts on artificial intelligence corporations over the content used to train their products, the developer OpenAI has stated that it would be impossible to produce tools like its pioneering chatbot ChatGPT without access to copyrighted material.
Chatbots like ChatGPT and picture generators like Stable Diffusion are “trained” on a massive collection of data pulled from the internet, much of it protected by copyright – a legal safeguard against someone else’s work being used without permission.
The New York Times sued OpenAI and Microsoft last month, accusing them of “unlawful use” of its work in creating their products. Microsoft is a major investor in OpenAI and uses its tools in its businesses.
In a report to the House of Lords communications and digital select committee, OpenAI stated that without access to copyrighted work, it could not train huge language models such as its GPT-4 model – the technology powering ChatGPT.
“Because copyright today covers virtually every sort of human expression – including blogposts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” according to OpenAI in its submission, which was first reported by the British newspaper the Telegraph.
It continued, “Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”
In response to the New York Times lawsuit, OpenAI stated last month that it recognized “the rights of content creators and owners.” The legal idea of “fair use,” which enables use of content in some circumstances without requesting permission from the owner, is frequently invoked by AI businesses in defense of exploiting copyrighted information.
OpenAI stated in its statement that “legally, copyright law does not forbid training.”
The New York Times lawsuit follows a slew of previous legal concerns filed against OpenAI. In September, 17 authors, including John Grisham, Jodi Picoult, and George RR Martin, sued OpenAI, alleging “systematic theft on a massive scale.”
Getty Images, which owns one of the world’s largest photo banks, is suing the author of Stable Diffusion, Stability AI, in the United States and England and Wales for alleged copyright violations. In the United States, Anthropic, the Amazon-backed firm behind the Claude chatbot, is being sued by a consortium of music publishers, including Universal Music, for allegedly exploiting “innumerable” copyrighted song lyrics to train its model.
In response to a query about AI safety in the House of Lords, OpenAI stated it encouraged independent investigation of its security procedures. The proposal stated that it supported “red-teaming” of AI systems, in which third-party researchers assess a product’s safety by simulating the behavior of rogue actors.
Following an agreement reached at a global safety summit in the United Kingdom last year, OpenAI is among the businesses that have promised to collaborate with governments on safety testing their most powerful models before and after deployment.