icrosoft has struck a deal with HarperCollins Publishers that grants the tech giant the right to use the publisher's book catalogue to train its artificial intelligence models. The agreement, whose financial terms remain undisclosed, covers a substantial portion of HarperCollins' backlist — one of the largest English-language catalogues in existence.
A Licensed Data Strategy
The arrangement marks a turning point in how AI companies are approaching the thorny issue of copyright and training data. Rather than relying on internet-scraped material of questionable provenance, Microsoft is building a licensed data strategy — a model that OpenAI and Google are also quietly pursuing through their own publisher deals.
“The era of 'data as an afterthought' in AI development is over.”
— Priya Menon, EvoFutura
Per-Token Licensing Fees
For HarperCollins, the commercial logic is clear: books that earn negligible royalties from languishing backlist sales could now generate meaningful licensing revenue. The publisher has reportedly negotiated per-token usage fees, a novel licensing structure that attempts to tie compensation to the actual volume of AI training performed.
Critics, including the Authors Guild, argue that authors themselves should be direct beneficiaries of any such agreements, and that publisher-level deals do nothing to address the individual rights of the writers whose words power these models. The Guild has filed separate litigation against several AI companies, cases that remain pending.
The broader race for high-quality training data is intensifying. As frontier models hit diminishing returns on freely available web data, the ability to license premium corpora — books, scientific journals, legal records — is emerging as a key competitive moat. Microsoft's move signals that the era of 'data as an afterthought' in AI development is over.
