A new paper from the AI Disclosures Project accuses OpenAI of training its AI models, particularly GPT-4o, on paywalled books from O'Reilly Media without a licensing agreement. The paper uses the DE-COP method to suggest that GPT-4o demonstrates a strong recognition of paywalled O'Reilly book content compared to earlier models like GPT-3.5 Turbo.
The co-authors, including Tim O'Reilly, probed OpenAI models' knowledge of O'Reilly Media books, finding that GPT-4o recognized significantly more paywalled content. While not definitive proof, the findings suggest that OpenAI may have used non-public books in its training data, raising concerns about copyright infringement.
OpenAI has faced previous accusations of training AI on copyrighted content without permission and is currently battling several lawsuits over its training data practices. The company has licensing deals for some data and offers opt-out mechanisms for copyright owners, but the O'Reilly paper adds to the scrutiny surrounding its data sourcing methods.