An Anti-Scraping Guide for Creative Professionals
(...and the Casually Creative)

This page will continue to be updated with new developments. The idea is to collate in one place relevant information and resources and without the PR talk of tech press releases, although they will still be linked to in order to show the sources for information.

Introduction

Creators may seek not to be scraped by those seeking to use their content either without compensation or consent by generative AI models or by piracy sites. This guide considers different ways of scraping and potential countermeasures.

Fundamentally, scraping and crawlers work by copying data. This can be precisely targeted to focus on copying certain subsets of data. As a result, a lot of protection against scraping involves making it harder for data to be cleanly extracted, but doing so as far as possible in ways unnoticed by regular users.

Most of the time, scrapers act like whales, who swim in the oceans with their mouths constantly open to catch the plankton that comes their way. Most of the time, these scraper whales just won't bother writing a whole new script just for a single idiosyncratic site.

Just as hackers love it when people use default or otherwise easily guessed passwords, so do scrapers love it when people stick to default, highly predictable structures.


This guide is in three sections: Search Engines & Crawlers; Platforms; and Services.

Search Engines & Crawlers are the sites and tools that may crawl your content.

Platforms are places you may go to engage with users of these sites, be it social media platforms or marketplace platforms.

Services include software used in the process of creating something and choices for where other people can come to find you, as in email providers and hosting services.

Each entry may also be divided in up to three sections:

Easy Countermeasures are things that can be done with a minimum of coding. All you need is to change a line or use a tool, but your ability to do this is platform-dependent.

Adept Countermeasures refers to things that require a greater degree of coding, or a greater amount of other effort.

Advanced Countermeasures refers to techniques best used by those who really know what they are doing.

Different creators will make different choices in terms of the balance between publicity and security.


You can click on the following list to go direct to the associated part of the guide:

Search Engines & Crawlers:General Practices|Microsoft|Google

Controlling web crawlers can involve, on a web page, finding or creating a line that goes like:

<meta name="robots" content = "index"> and modifying this.

When it comes to web crawlers, controlling them may involve the use of a robots.txt file that will have a format like:

User-agent: Example

Disallow: /

The above would block web crawling by a crawler which respects the robots.txt file. Not all do, but typically the crawlers used by major search engines do.

For crawlers which do not respect the robots.txt file there are countermeasures that can be taken here as well. Some are actually quite easy to implement and these will provide a solid amount of protection against non-targeted scraping. That is, against scraping by those who are more generally crawling the Internet and aren't particularly aware of and aiming to get at you specifically.


General Practices

Easy Countermeasures

Index Settings

One countermeasure to scraping is by instructing that a page not be indexed:

<meta name="robots" content = "noindex">

This has an obvious downside in that while it prevents crawling by search engines that also obtain training data for AI systems this way, it will also prevent pages from being seen in search results and pages will still be scraped by less ethical operators.

There is, however, an excellent reason to have pages on a site be given the 'noindex' tag. Some of the most effective countermeasures of scraping are simply incompatible with the normal operations of search indexing. These measures can only be implemented (without damaging search engine results) by only being applied to pages that won't be indexed by search engines.

Sites can be structured so that a website have pages that attract web traffic and then from there, users can find non-indexed pages that go deeper and are better protected.

If the page is not indexed, the search engine will not penalise you for a page being too hard to crawl.

Blocking Crawlers

As described below in the entries for Search Engines and Crawlers, blocking them is relatively easy and will see results to the extent they obey the robots.txt

Choice of Hosting Providers

As described below in the Services section, web hosting providers may include anti-scraping services, such as blocklists of identified scrapers.

Unpredictable Names

A typical way for web scrapers to operate is to have a script that targets the desired parts of a page. For example, a HTML parser can be used to display all the content part of a specified class, or attribute selectors may be used to extract all content with a specified meta tag or other tag, (like the content of every <p> element) or to target an attribute value, such as all div classes that consist of or partially contain a designated text string.

One common means to fight against web scraping is not to use common class names or 'ids'. That is because those seeking to extract data will commonly target class names that are expected to return desired content. So, for example, many pages will have the most desired content for scrapers being included in a class name or div id that includes words such as 'body,' 'content' 'main' 'entry' 'post' 'article' and 'page'.

Not using such simpler names will limit the effectiveness of web scraping that disregards robots.txt instructions and particularly the kind of crawlers that sail across the Internet without particular awareness of any individual site, such as in the case of crawlers looking to develop or update large language models.

You can still be entirely consistent with what names you use but still get a measure of protection just by not using the words that really signpost where the most desirable content for scraping is located. That being said, there is greater protection from being less consistent, such as using different names for different pages.

It is also possible to change these names over time and there can be ways for advanced users to implement a find and replace script to automate the process, or for names to changes on a page load. This is sometimes down to help protect indirectly linked assets such as typefaces and this technique may be deemed worth doing to help protect trademarks and documents with a time-sensitive value.

There are no downsides to this approach. Search engine rankings do not take into account the names of either CSS classes or HTML elements.

This approach can be seen in action, at time of writing, on sites like BBC News, The Guardian, NPR and Wired although there are also plenty of examples of plain and simple <p> elements being consistently used on other important news sites. It has also been adopted by sites like Patreon and Pixiv.

On a related note, the very careful wrapping of elements and extensive use of Javascript events sees Wired be one of the few sites using a method that almost totally protects their web fonts from being downloaded. (For obvious reasons, will not explain the vulnerabilities that most site have further).

This countermeasure can also be partly a by-product of use of plugins or frameworks that autogenerate such names.

This countermeasure does require an ability to access and edit the underlying code. The capacity to have this control varies depending on the platform.

The capacity for scrapers to target every <p> element and the like can also be combated, at least when it comes to lazily instructed scrapers. All you need to do is mix up the use of CSS and inline formatting.

Instead of only using the <p> element throughout, you can alternate this with use of <p id> or <p style> or for that matter also making use of <span id> or <span style>.

This works because unless these alternatives are defined as alternate results, scraping will be more piecemeal. Even if this is accounted for, combining this with parser poisoning can be a quite effective deterrent to the most common means of crawling that disregards robotos.txt instructions.

The alternation of these elements and using more complex class and element names is also an approach that at time of writing is used by the social media platform formerly known as Twitter.

Parser Poisoning

<!-- HTML comments look like this on a page-->. They are not rendered, but remain visible in the HTML code. Viewing the page source is a common way for people to bypass restrictions on downloading content such as images and videos. This is typically done by searching for what marks the desired content i.e. as marked by '.png' or '.mp4'. A number of sites that exist to have people download content from social media sites also works this way. As noted elsewhere, scrapers typically work by using HTML parsers and may be programmed to look for specific sections of a webpage that mark the information that is desired.

HTML comments can really screw with all of this.

HTML comments can be used to make the raw output of web scraping unusable and hard or even impossible to read. Even better, HTML comments are disregarded by search engines so their use will not stuff up search engine rankings, provided they are not used excessively. This is because HTML comments, being part of the page source, are still loaded even though nothing is displayed and so add to the size of a page and thus loading times, which does impact upon search engine rankings.

HTML comments can be used to basically add junk text to poison datasets without tedious manual cleaning, but they also be used more strategically to combat scraping

Scraping is greatly enabled by being able to anticipate page structure and being able to take advantage of common elements. For instance, the <h1> element is very widely used for the most important heading on a page. Downloadable media are marked by file type markers and file paths.

HTML comments allow you to have additional fake elements and media. In the example: <!-- <h1>The Top Ten Top Tens Of All Time</h1>--> a fake top heading is given that can confuse simpler crawlers and parsers that aren't scripted to selectively disregard results in HTML comments. Because HTML comments are rendered in the page source, they shall appear to HTML parsers. The use of fake URLs, fake images, fake videos and fake font files can frustrate users viewing the page source to manually download media in an undesired way.

The use of HTML comments to complicate scraping is at time of writing an approach that can be seen being used by Reddit.

Because they have the potential to hurt search engine results, they are best used for unindexed pages to give extra protection against large-scale scraping, or for pages created partly for entertainment value and partly to express discontent for scraping without active consent by mainly having a poisonous effect on datasets that disregard robots.txt instructions.

This being said, the use of comments with game engines can be more extensively deployed to frustrate data mining.

Cloaking

InnerHTML is a means by which text will be replaced using Javascript. This can be implemented to replace text immediately so that crawlers which only look at the HTML code will miss this and load a different set of data, while the webpage will appear as normal to users. This should only be done on pages which are not to be indexed by search engines. The practice of showing crawlers a different set of content to that seen by normal users or cloaking is not regarded well by search engines and if detected by quality control processes on indexed pages will see search engine rankings suffer.

While it may be tempting for the initial text to be utter gibberish it is better to avoid this, as this is also more likely to be caught by AI deployed by web scrapers to check for signs of poisonous data. This could in turn lead to web scrapers to use more advanced methods (or a good ol' manual copy and paste) to scrape your site.

InnerHTML is one of the easiest Javascript events to set up and when used to immediately replace text on the page there are no accessibility concerns, as the screen reader will just read the updated text.

Provided one doesn't use utter gibberish, cloaking can be a fun way of helping to poison the data of scrapers that disregard user consent settings by causing these models to hallucinate more or be less accurate by feeding it less grammatical and less factual data. It is ethical, if making use of cloaking, to not help AI models be more actively harmful.

There is a big difference between trying to get an AI model disregarding user consent settings to believe a catdog is the offspring of a cat and a dog and trying to make it more likely to spread potentially harmful conspiracy theories or other disinformation.

CSS Images

Most scrapers and crawlers focus purely on HTML. This means that certain content can be selectively protected from crawling or scraping by making greater use of CSS. For example, it is possible for images to be defined in CSS rather than in HTML, using 'background-image: /example.png' or even a variable, as in 'var(--trajecient-header);' with the variable defined elsewhere. This is typically done for background images and header images, but it can also be done for images more generally, by setting a background-image to a specified part of the main page. Alt text can still be given to images defined in CSS.

The best practice is to do it like this:

<span class="CSS image class here" role="img" aria-label="alt text here"></span>

Source:Alternate Text for CSS Background Images|CanAdapt


The greatest challenge of deterring scraping is not actually how to do it, but how to do it in ways which keeps the content accessible to those who use screen readers. CSS Background Images is one strategy to deter unwanted scraping that fortunately does not come with greater accessibility challenges. Instead, the challenge is more making sure the CSS element is properly positioned on the page.

Javascript

One general defence to scraping is the general use of Javascript, or other interactive elements. Crawlers generally have issues with content that requires Javascript events because many crawlers focus on parsing HTML code. Certain Javascript events are comparatively easy to set up, like having text only appear when a button is clicked. Others are more complicated. In general, there is an extra level of effort involved to ensure there is sufficient accessibility support.

The use of Javascript can be used as a kind of opt out for content within pages that are otherwise freely crawled, or to provide extra protection from crawlers which do not obey a robots.txt file.

Do not overuse Javascript to protect information on pages indexed by search engines. If search engines have too much difficulty crawling a page, this will negatively impact search engine listings. It also makes for a worse experience for users, but in moderation, this can provide extra security and be done in ways that are intuitive for users and enhance presentation. For example, contact details may appear when selecting an image, either by hovering over it, by touch or by keyboard controls with use tab order.

More extensive use of Javascript events to protect information can be done in combination with having a page not be indexed.

Advanced Countermeasures

Back to entry list

Search Engines & Crawlers

Microsoft

Easy Countermeasures

Microsoft is providing two options (both detailed in the same source). They do not involve the robots.txt file but instead involve adding or modifying a line to the header of a site. Use of either option will allow sites to appear in Bing search results as normal.

<meta name="robots" content = "nocache">

Source: Announcing New Options for Webmasters to Control Usage of Their Content in Bing Chat|Webmaster Bing Blogs

Doing this will opt out of content being used as training data for Microsoft's generative AI models (mostly). This option will see URLs, snippets and titles appear in Microsoft's AI services like Bing Chat and this content will still be subject to use training data. This is equivalent to the only option provided by Google and you could see it as striking an appropriate balance between discoverability, data security and author rights.

<meta name="robots" content = "noarchive">

Doing this will completely opt out of content being used as training data for Microsoft's generative AI models. This may be a preferable option if you have a moral objection to contributing to the usefulness of AI systems believed to be built on datasets with data extracted without consent or compensation.

<meta name="robots" content = "nocache, noarchive">

If both tags appear, it will cancel out the 'noarchive'. There is zero reason for this to be done on the same page, at the same time. Microsoft suggests that paywalled articles do this instead of just noarchive for increased discoverability, but at the same time does not indicate that doing so would be in any way different from just using 'nocache'.

Note that the meta name field HAS to appear in the header of a page and cannot be used, for instance, to make a page generally 'noarchive' but to permit specfic sections of a page to fall under 'nocache'.

The above solutions are only possible if you have access to the robots.txt file, or if using a platform that provides some indirect means to alter the robots.txt file.

It is important to note that 'noarchive' has another side-effect. Microsoft also uses it to mark if a cached version is available. So a site marked 'noarchive' will not have a cached version of a site available via Bing, which means that if a server is inaccessible, the site will be down as the backup of a cached copy will not be available. Other search engines may use 'noarchive' similarly and at least Google does. If you want a cached version of the site available via Google Search but want 'noarchive' to apply only to Bing as you just want to opt out of being used as training data for generative AI services, you need to do this:

<meta name="bingbot" content = "noarchive">

Basically, the 'noarchive' must be in a line where the meta name only addresses the Microsoft crawler, which is Bingbot and not all crawlers.

Alternatively, if you use a content delivery network (CDN), this can be optionally used to provide access to cached files, without a search engine engine needing to do this.

It has been reported, at least by The Verge that Microsoft may be taking action to protect others from scraping Microsoft's index in order to develop rival generative AI services. It is standard for a Terms of Service to prohibit unauthorised copying or crawling of data and indeed, the Microsoft Terms of Service specifically prohibits unauthorised "web scraping, web harvesting, or web data extraction methods" and use of Microsoft data to create, train or improve another AI service, but that only applies to what is included as training data. Source:Services Agreement|Microsoft

Therefore, if what has been reported is true, Microsoft is going further actually taking enforcement actions against unauthorised copying of search result data and not just the subset of search result data that is AI training data, at least when it comes to partners where where search data is licensed and there is an available sanction for that license to be terminated, without further and more costly legal action needing to be taken.

It also makes logical sense that any future agreement that did allow Microsoft's search data to be used for development of other generative AI services would requires partners to respect a 'nocache' and 'noarchive' tag because otherwise, from Microsoft's perspective, partner companies would have access to a more comprehensive set of training data than Microsoft itself. So if Microsoft is taking enforcement actions against the unauthorised copying of their search engine results, it then logically follows that the use of 'nocache' or 'noarchive' to opt out of content being used for Microsoft's generative AI models will also protect content from being used as training data by other search engines to the extent they use Microsoft's search results.

These search engines include but are not limited to:







Back to entry list

Google

Easy Countermeasures

To opt out of training Google Bard and Google Vertex, add the following to a robots.txt:


User-agent: Google-Extended

Disallow: /

Source: Overview of Google Crawlers and Fetchers (User Agents)|Google Search Central


Doing this will still permit content to appear as normal in Google Search, Google Images, Google Video and so forth.

This solution is only possible if you have access to the robots.txt file, or if using a platform that provides some indirect means to alter the robots.txt file.

There is another user-agent that you may wish to consider blocking, as revealed by the same source.


User-agent: Google-Other

Disallow: /


This is a generic crawler used for crawls to inform internal research and development. Blocking Google-Other is stated to prevent the training of Google Bard and Google Vertex, specifically, which have both already been deployed. This is in contrast to Microsoft, who explicitly state the tags 'nocache' and 'noarchive' affect ALL Microsoft generative AI services, not Bing specifically.

it cannot be ruled out that Google-Other might be used to gather training data for the development of other future generative AI services provided by Google.

It is possible that Google is less willing to allow users to opt-out of the initial development of generative AI services. There is in fact, precedent for this, for Google has previously used private emails as training data without user consent. This was done with measures to protect the data, but it has also been found that such measures at heightened risk of being insufficient as LLMs develop.

The Gmail Smart Compose feature (an autocomplete feature) was explained in a presentation by Google employees titled Gmail Smart Compose: Real-Time Assisted Writing|The 25th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. On page 3, it is revealed to have been developed through the use, without consent, of over 8 billion user emails in English. The 8 billion figure is the number of emails left after the pre-processing stage. It has also been alleged that these messages, as opposed to being deleted, continued to be used as part of training data for Google Bard. Eli Mackinnon wrote an excellent article Why Won’t Google Give a Straight Answer on Whether Bard was Trained on Gmail Data? |Skiff (Disclosure: Skiff offers online collaboration and communications services with a focus on privacy and is not a neutral new site, but there is no issue with the reporting)."

Back to entry list

Services