wesley tanaka

Preventing mass scraping or mirroring downloads

‹ Panorama Stitching Tools For Linux (Fedora Core 5) | Typing Breaks ›

A list of ideas for stopping or dissuading people from increasing webserver load by fetching lots of web pages quickly in succession. My assumption is that these attackers are motivated by wanting a copy of the content on my website, to plagiarize on their own website or for whatever other reason. I'll also be assuming that the reader of this page has a similar situation, and has two concerns:

  1. Bandwidth bills and webserver resource usage
  2. Prevention of the theft of their content

Techniques To Stop Scraping / Mirroring

  • robots.txt: This prevents well-behaved spiders and robots from grabbing well defined parts of your site. Unfortunately, it probably won't stop malicious or unscrupulously-motivated scraping, which I imagine is what happens most of the time. It could prevent a user from accidentally or unwittingly grabbing a lot of pages.
  • Use a Honeypot/Spider Trap to detect bots that disobey robots.txt and automatically block the IP address of the scraper. Create an entry in robots.txt pointing to a path on your site. Create a script at that path which records the IP address of the request somewhere, and have all the pages of your site refer disallow requests from pages on that list. Then link to that path using CSS (display:none) or a link surrounding a one pixel transparent image or some other method to let spiders see the link, but not display it to human visitors.
  • User Agent: Similar to robots.txt in benefits and disadvantages, this probably won't stop a dedicated attacker. It could possibly help keep buggy software from causing server load. One suggestion I've read is to only allow access from well known User-Agents.
  • Block/cloak by IP: If someone is scraping your site repeatedly from the same IP address, you can serve different content to their IP address than the rest of the world. The disadvantage to this technique is that users are often not coming from the same IP address, either because their IP is dynamically allocated, or because they are attacking through a botnet or something like tor.
  • Block based on access rate. mod_throttle claims to do this as an apache module. I haven't tried it.
  • Embed a piece of javascript in your pages that redirects users back to your site
    <script language="JavaScript">
    /* Only sunsites are allowed to mirror this page and then
    only with explicit, prior permission. For details,
    send email to elharo@sunsite.unc.edu */
    if (location.protocol.toLowerCase().indexOf("file") != 0 ) {
    if (location.host.toLowerCase().indexOf("sunsite") < 0) {
    location.href="http://sunsite.unc.edu/javafaq/";
    }
    }
    </script>
    While this doesn't help with server load, it would help prevent less technically savvy copyright infringers.
  • Similar to the javascript idea, embed a transparent image stored on your server in your pages, which might help you find people that copy wholesale without bothering to rewrite the links in the page.
  • Block known open proxies.
Syndicate content