Home
News
Technology
Fresh Concerns Raised Over Sources Of Training Material For Ai Systems

Fresh concerns raised over sources of training material for AI systems

by theguardian
24 Apr 2023
in technology

1.2 k views

Fresh fears have been raised about the training material used for some of the largest and most powerful artificial intelligence models, after several investigations exposed the fascist, pirated and malicious sources from which the data is harvested.

The white nationalist site VDARE is in the database, one of the 1,000 largest sites, as is the far-right news site Breitbart. The Russian state-backed propaganda site RT is one of the hundred largest providers of training data to the C4 corpus.

Few of the sites gave explicit consent to be included, although Common Crawl, the non-profit organisation that assembled the scraped data, says it respects requests to be left out of its search. Some, however, push the limits of fair use: b-ok.org, formerly known as Bookzz, was a vast repository of pirated ebooks, until it was seized by the FBI in 2022. Despite that, contents of the site remain in the C4 database.

Such vast collections of data are important to AI creation, because the large language models (LLM) that underpin tools such as ChatGPT need huge datasets to improve.

Google was approached for comment.

previous post New Lyft CEO David Risher announces plans to lay off hundreds of workers

next post TikTok cashing in on sale of counterfeit cosmetics and prescription skin creams

HS distance runner begs school to remove trans athlete amid safety fears: 'LGBTQ is shoved down our throats'

Philippine vice president makes public assassination threat against country's president

School district defends decision to ban parents who wore 'XX' wristbands at daughters' game with trans athlete

Jason Aldean's wife blames 'wokeness' for Billboard 100 greatest country artists of all time list snub

Fresh concerns raised over sources of training material for AI systems