Unpacking Archive.today's Retaliatory Behavior and Online Archiving Ethics

A recent incident involving the online archiving service, archive.today (also known as archive.is), has drawn attention to potentially retaliatory behavior. Observations indicate that archive.today's CAPTCHA page has begun automatically sending high-frequency requests to gyrovague.com, a personal blog. The target of this activity is an article titled "archive.today: On the trail of the mysterious guerrilla archivist of the Internet," which delves into information about archive.today's owner.

The Nature of the Attack and its Intent

The requests involve JavaScript that continuously fetches gyrovague.com/?s=random_number every 300 milliseconds. This method serves as a cache-busting technique, designed to prevent browser caching and, more critically, to force the server to process each unique request. For a WordPress-hosted site like gyrovague.com, the ?s= parameter triggers a search query. Sending thousands of unique search queries can significantly increase CPU load, potentially pushing resource limits and leading to the blog's automatic suspension by its hosting provider.

While gyrovague.com is hosted on Automattic/WordPress.com, which is known for its robust infrastructure and DDoS protection, such persistent, resource-intensive requests are clearly malicious, even if unlikely to take the site down directly. The incident raises questions about the timing (2.5 years after the article's publication) and the owner's awareness of the Streisand effect, where attempts to suppress information often lead to its wider dissemination.

Historical Context and Allegations Against Archive.today

This isn't the first time archive.today has engaged in controversial behavior. Past instances include intentionally creating endless CAPTCHA loops for users relying on Cloudflare DNS. The stated reason for this was a "philosophical disagreement" or a technical need for EDNS client subnet information for regional compliance to prevent spam or address forbidden content (e.g., wartime propaganda). However, the necessity and logic of this claim have been debated, with some arguing it was a petty response rather than a purely technical requirement.

Concerns about archive.today's operation extend to its perceived political leanings. Allegations have been made that it is managed by "pro-Kremlin people," selectively edits content, and employs "sneaky" tracking methods for visitors and archivers, raising questions about its trustworthiness as an authentic archival source.

The Doxxing Debate

A significant aspect of the discussion revolved around the investigative article itself. The author of gyrovague.com maintains that their "investigation" involved looking up publicly available information and that no doxxing occurred. Conversely, some commenters, including an individual believed to be associated with archive.today, argue that publishing personal information, even if publicly accessible, when the individual clearly wishes to remain anonymous, constitutes doxxing, especially when it involves "detective work" across various platforms. This highlights the nuanced and often contentious definition of doxxing in the online sphere.

Tools and Insights Shared

Several valuable pieces of information and tools emerged from the discussion:

Investigating Infrastructure: For those interested in website infrastructure, tools like resolvectl query for DNS lookups and bgp.he.net or bgp.tools for IP and Autonomous System (AS) information can be invaluable. These resources help identify hosting providers and network configurations.
WordPress DoS Vector: The use of ?s=random_string as a cache-busting and CPU-intensive search query on WordPress sites is a known, albeit often less effective against large hosts, DoS vector.
Archiving Service Differences: The discussion illuminated the different approaches of various archiving services. Archive.today, for instance, executes JavaScript at archival time and saves the DOM, often employing site-specific mitigations and resisting takedown requests. This contrasts with services like archive.org, which save and replay server responses verbatim and often comply with content removal requests, leading to discussions about which service is more reliable for preserving authentic content, especially controversial material.
Multi-Archive Browser Add-on: For comprehensive archival searches, a browser add-on that searches across multiple archive services (e.g., https://github.com/dessant/web-archives) was recommended, empowering users to verify information across diverse sources.

This incident serves as a stark reminder of the complexities and potential conflicts within the digital information ecosystem, from website security and content preservation to online identity and the ethics of information gathering.