Friday, 08 Nov 2024

Fresh concerns raised over sources of training material for AI systems

Fresh concerns raised over sources of training material for AI systems


Fresh concerns raised over sources of training material for AI systems
1.4 k views

Fresh fears have been raised about the training material used for some of the largest and most powerful artificial intelligence models, after several investigations exposed the fascist, pirated and malicious sources from which the data is harvested.

The white nationalist site VDARE is in the database, one of the 1,000 largest sites, as is the far-right news site Breitbart. The Russian state-backed propaganda site RT is one of the hundred largest providers of training data to the C4 corpus.

Few of the sites gave explicit consent to be included, although Common Crawl, the non-profit organisation that assembled the scraped data, says it respects requests to be left out of its search. Some, however, push the limits of fair use: b-ok.org, formerly known as Bookzz, was a vast repository of pirated ebooks, until it was seized by the FBI in 2022. Despite that, contents of the site remain in the C4 database.

Such vast collections of data are important to AI creation, because the large language models (LLM) that underpin tools such as ChatGPT need huge datasets to improve.

Google was approached for comment.

you may also like

Siem Reap, Cambodia is set to be Australia’s top travel destination in 2025: What new you need to know?
  • by travelandtourworld
  • descember 09, 2016
Siem Reap, Cambodia is set to be Australia's top travel destination in 2025: What new you need to know?

Siem Reap, Cambodia, is set to be Australia’s top travel destination in 2025, according to Skyscanner’s Travel Trends 2025 report, as reported by a news agency. Known for its captivating temples, vibrant culture, culinary delights, and favorable climate, Siem Reap has seen a remarkable 529% increase in Australian travel interest, dethroning Japan as a favored destination for next year.

read more