The Evolving Landscape of Code Licensing: Protecting Your Work from LLM Training and Use

The desire to share code online while simultaneously preventing its use for training or operating large language models (LLMs) is a growing concern for many developers. While the motivation is clear—protecting intellectual property and preventing unauthorized machine exploitation—the legal and practical pathways to achieve this are fraught with complexity.

The Legal Landscape and Enforceability Challenges

One of the most significant hurdles for creators is the enforceability of a "no-LLM" license. The ability of individual developers to take legal action against well-funded tech giants, who may disregard such clauses, is widely questioned. Precedent from copyright battles involving major publishing firms against large technology companies suggests that even established rights holders face immense challenges.

A key legal argument often invoked by LLM developers is "fair use," which could potentially allow them to use copyrighted material for training purposes, provided the source material was legally obtained. If training is deemed fair use, then traditional copyright-based licensing might not apply, diminishing the power of restrictive clauses. Furthermore, the objective measurement of whether an LLM has been trained on specific code is incredibly difficult without access to the model's internal data, although some suggest "watermarks" in output could provide evidence, similar to claims made about copyrighted art.

Collective action, a "death by a thousand paper cuts" approach, is suggested as a potential strategy to pool resources and increase the legal leverage against LLM providers. Phrases like "Bots strictly prohibited" are proposed, but their legal interpretation in court remains untested.

Impact on Open Source and Project Adoption

Creating a non-standard license, particularly one that restricts usage for LLM training, carries significant implications for a project's adoption and its status within the open-source ecosystem. The Open Source Definition (OSD) explicitly states that a license must not discriminate against fields of endeavor (Clause 6) or persons/groups (Clause 5). A "no-LLM" clause would likely violate these principles, meaning such a license would not be considered open source.

For projects aiming for widespread adoption, especially within corporate environments or Linux distribution repositories, a non-standard or restrictive license often acts as a deterrent. Legal departments are typically risk-averse and will reject projects with untested or complex licenses due to the review overhead and potential liabilities.

However, this "deterrent" effect isn't universally seen as negative. For personal projects, a creator might intentionally use a modified license to scare away commercial entities or "vampires" who consume resources without contributing back. One creative suggestion is to provide an initial MIT-modified license to deter general use, while offering a "clean" MIT license for companies willing to pay a monthly fee or contribute engineering hours for maintenance.

Practical Considerations and Alternatives

Beyond legal and definitional challenges, practicalities abound. The JSON license, with its famous "The Software shall be used for Good, not Evil" clause, serves as a historical example of how subjective terms can render a license problematic for objective compliance and enforcement. A "no-LLM" clause, while seemingly more objective, still faces the challenge of detection and proving infringement.

The most straightforward and universally acknowledged "no-LLM license" is simply to keep your code private and not share it publicly. This eliminates the possibility of it being scraped or used without explicit permission. Other, more radical ideas include designing an entirely new programming language where LLM training is inherently prohibited, making any LLM output in that language a violation.

Ultimately, developers are encouraged to consider their primary goals. Is it to prevent LLM training at all costs, even if it means sacrificing widespread adoption and open-source compatibility? Or is it to make a statement, knowing the legal path is uncertain? Focusing on code quality, as some veteran developers suggest, might not prevent LLM use but could differentiate human-crafted code from generic AI output.