A recent investigation supports allegations that OpenAI used copyrighted materials for training its AI technologies. OpenAI is currently embroiled in lawsuits from content creators, including authors and programmers, who claim their intellectual property, such as literary works and software code, has been utilized without consent in the formation of these models. OpenAI defends its practices by invoking fair use, while the plaintiffs argue that U.S. copyright provisions do not excuse the usage of training data from legal protections.
Researchers from the University of Washington, the University of Copenhagen, and Stanford invented a method to identify training data that AI models have “memorized.” These models operate as predictive systems, discerning patterns from extensive datasets, enabling them to produce content akin to essays and images, although certain outputs may bear resemblance to the original training data due to their learning methodologies.
The research and study has pinpointed “high-surprisal” words, which are those that occur infrequently within a broader work, as potential markers of memorization. For example, the term “radar” within a particular context qualifies as high-surprisal. The researchers examined various OpenAI models, such as GPT-4 and GPT-3.5, by excluding high-surprisal words from literary and journalistic pieces and evaluating the models’ ability to infer the omitted words. The ability to accurately guess these terms indicates likely “memorization” during the training phase.
The research findings revealed that GPT-4 likely retained segments of well-known fiction, including material from a dataset of copyrighted e-books known as BookMIA. In addition, GPT-4 displayed some recognition of New York Times articles, too, though to a lesser degree. A research spokesperson asserted that the results confirms the use of controversial data used in GPT-4 model training; and, it underscores the importance of transparency when training and using large language models in order to maintain reliability and legitimacy.
The ainewsarticles.com article you just read is a brief synopsis; the original article can be found here: Read the Full Article…