close
close

Amazon investigates confusion over allegations of scraping abuse

Amazon’s cloud division has launched an investigation into Perplexity AI, WIRED has learned, into whether the AI ​​search startup is violating Amazon Web Services’ rules by crawling websites that have tried to stop it from doing so.

An AWS spokesperson, who spoke to WIRED on the condition of anonymity, confirmed the company’s investigation into Perplexity. WIRED previously found that the startup — which is backed by Jeff Bezos’ family fund Nvidia and was recently valued at $3 billion — appears to be grabbing content from scraped websites that have banned access through the Robots Exclusion Protocol, a common web standard. While the Robots Exclusion Protocol is not legally binding, the terms of service generally are.

The Robots Exclusion Protocol is a decades-old web standard that involves placing a plaintext file (like wired.com/robots.txt) on a domain to specify which pages automated bots and crawlers are not allowed to access. While companies that use scrapers can ignore this protocol, most traditionally adhere to it. The Amazon spokesperson told WIRED that AWS customers must adhere to the robots.txt standard when crawling websites.

“AWS’s Terms of Service prohibit customers from using our services for illegal activities, and our customers are responsible for complying with our terms and all applicable laws,” the spokesperson said in a statement.

The investigation into Perplexity’s practices follows a June 11 report by Forbes that accused the startup of stealing at least one of its articles. WIRED’s investigation confirmed this practice and found further evidence of scraping abuse and plagiarism by systems connected to Perplexity’s AI-powered search chatbot. Engineers at Condé Nast, WIRED’s parent company, block Perplexity’s crawler on all of the company’s websites using a robots.txt file. However, WIRED found that the company had access to a server via an undisclosed IP address — 44.221.181.252 — that visited Condé Nast properties at least a hundred times over the past three months, apparently to scrape Condé Nast websites.

The machine connected to Perplexity appears to be doing a large-scale crawl of news websites that prohibit bots from accessing their content. Spokespeople for The Guardian, Forbes and The New York Times also say they have spotted the IP address on their servers multiple times.

WIRED traced the IP address to a virtual machine called an Elastic Compute Cloud (EC2) instance hosted on AWS. The company launched its investigation after we asked whether using AWS infrastructure to scrape websites that prohibited it violated the company’s terms of service.

Last week, Perplexity CEO Aravind Srinivas initially responded to WIRED’s investigation by saying that the questions we asked the company “reflect a deep and fundamental misunderstanding of how Perplexity and the internet work.” Srinivas then told Fast Company that the secret IP address WIRED observed scraping Condé Nast websites and a test site we created was operated by a third-party company that performs web crawling and indexing services. He declined to name the company, citing a nondisclosure agreement. When asked if he would tell the third-party company to stop crawling WIRED, Srinivas replied, “It’s complicated.”