AI, Copyright and Data Rights: Why Microsoft and Google Are Facing Lawsuits

Over in the United States, Microsoft collaborators are defending the methods behind the training of its artificial intelligence (AI) system.

Developers have come under fire for potential copyright breaches due to data from unwitting users.

OpenAI is the company responsible for smash-hit AI tool ChatGPT, who teamed with Microsoft in a multi-billion-dollar partnership. The chatbot was front and centre of the AI revolution that amazed users worldwide. And OpenAI are in a third phase of their long-term collaboration with Microsoft, with one of the leading lights of the partnership being AI systems and large language models (LLM).

Faced with the prospect of being hit with a raft of lawsuits, they’ve now looked to nip legal issues in the bud, with rumours circulating that they will face the music for any copyright infringements over material its AI software generates. They have been targeted in the States in a proposed class action lawsuit alleging that the creation of an AI-powered program relies on “software piracy on an unprecedented scale.”

It’s a complex issue, and the tech giant has pledged to protect its paying customers. Last year, Microsoft laid off the entire ethics and society team, a unit responsible for overseeing the ethical outcome of its AI development. It did, however, maintain an Office of Responsible AI, which creates their AI initiatives’ rules and principles.

Microsoft and its partners and associates have at least had a positive result in their battle against a class action lawsuit that alleged they violated copyright laws.

The presiding judge, Jon Tigar, dismissed most of the claims filed by developers and, although there is a long way to go before the wranglings are completely put to bed, the legal landscape around AI, code copying, data use and ownership rights is set to rumble on.

Microsoft aren’t the only huge tech company facing legal action, however.

Slightly closer to home, Google’s possible misuse of data across Europe is the subject of an investigation by the Data Protection Commission. The allegation is that they have used confidential data in the development of their AI chatbot Gemini, formerly called Bard.

Gemini has proved controversial in its brief lifespan. At the heart of their ongoing legal battle is the developers’ perceived unauthorised use of data that copyright regulations cover. The company has already had a £270m fine imposed upon them by regulators in France for copyright breaches. It was determined that Gemini had been, at least in part, trained using data from news companies, agencies and publishers that the AI tool had scoured and processed. No licences were obtained, permission was never granted and no payment for copyrighted content was paid.

Google were also fined €50m in 2019 by the country’s regulators, CNIL, for failing to obtain content owners’ permission after running targeted ad campaigns using that content and data. The regulator deemed Google’s data permission policies – which weren’t easily accessible – weren’t transparent and that users weren’t provided with the information on data consent or how their data is used.

Legal challenges are piling up for big businesses, with stock image companies, publishers, artists and musicians making claims that their protected material has been used without their consent.

Any commercial use of the content that’s generated by a large language model may be a copyright risk, legally speaking.

Remember – your personal, confidential data may have been used in the training of these AI chatbots, or LLMs if you use, or have used, any of their products and services.

Let’s examine the legal issues and what defence may be put forward as momentum on the case starts to gather pace.

The primary issue is whether there’s been infringement of copyright from public repositories. There is an argument that the training and development involves the reproduction of licensed code with no permission, and without attributing anything to the copyright owners.

Time will tell if the law has been violated. Currently the argument centres on the copyrighted code used, or data used without permission being expressly given. Though the code and information in question was publicly available, it has never been placed in the public domain as free-to-use. Chatbot creators and developers may contest that, claiming that the AI system uses the code to generate new code, rather than simply reproducing it. Whether this constitutes direct infringement is the debate that will roll on for the foreseeable future.

We must also consider the prospect of open-source licensing violations.

In the case of open-source licensing, certain conditions should be applied to their use or distribution. The Microsoft lawsuit continuing in the USA alleges that the chatbot is in violation of those terms and conditions by using existing code without maintaining licensing data.

Many open-source licenses demand that any work derived from the licensed code must also be released under the same license. Other licenses require attribution, another requirement that the AI system ignores, potentially violating the licensing terms.

Legal teams may also look to use the ‘fair use doctrine’ defence. This is a claim that their use of public code in AI development is legal since limited use of copyrighted material is allowed and that they don’t necessarily need permission to use such public code repositories for news reporting, research, teaching, commentary or criticism.

This case is likely to help clarify (and, in doing so, set an important precedent) what is considered fair use and the applied doctrine relates to AI training data and could set important precedents for future AI development.

Also up for consideration is whether the results generated by an LLM can constitute derivative work of the original content used to trained it. Copyright law would suggest that derivative work is a new and original product, albeit one that contains aspects of existing content.

Intellectual property rights is another legal argument with precedents for future AI development still to be set.

Applying existing intellectual property laws to AI training raises a host of questions relating to content and its ownership. For instance, who owns AI output and how can anyone look at AI content and definitively claim ownership when an AI assistant has millions of pieces of code, copy and content at its disposal? Given the questions it raises, can AI-generated content ever be accurately and correctly copyrighted?

Clearly, the key issues around the birth of AI chatbots are complex and they will continue to evolve as LLMs also evolve and the knowledge bases expand.

The initial dismissal of the lawsuit by Judge Tigar is a battle won by Microsoft and their affiliates but the war is still to be won.