F5 Networks ASM contains a very neat feature called Web Scraping Protection that I wanted to cover briefly.   What I would like to highlight is what the feature is and what it does when it is actively doing its job.

This was prompted by the fact that I noticed recently that there is not a lot of documentation available on the web regarding the F5 BIG-IP’s Web Scraping Protection mechanism and almost none regarding what it actually does to the underlying web page code presented to your end users.

Web scraping is defined as a computer software technique of extracting information from websites.  The people people running the web scraper program typically save the contents of what is scraped and use it for their own means.  Sometimes it is just for archiving purposes, such as Archive.org’s “WayBackMachine“.  Several companies even sell what is considered by many to be legitimate commercial web scraping software.  One such company is called Mozenda, who lists such clients as Microsoft, IBM and Citi.

But then there are the “Others” as I like to to call them.  This can range from hackers with bad intentions to companies simply seeking a competitive advantage over another company. One example of this that I  can think of dealt with a few websites who make their living by offering vacationing deals.  So these leaders of their industry would publish airfares for many popular destinations on their websites and their competitors would use a computer program to scrape the pricing off of their pages. They would then take this pricing, subtract a few dollars, load it into another program and update the pricing on their own website thereby making their vacation deal offerings just a little cheaper than their competitors!

Web scraping is not an illegal activity, but it can be against the “Terms of Use” for some websites.  Now, all of that being said, it is definitely nice to know that the BIG-IP ASM has a built in feature that you can enable to protect your own websites from being scraped.

It does this by attempting to determine whether a web client source is a human or if it is a headless computer program.  To do this it injects a piece of java script code into the headers of your HTTP traffic. I will not provide the full source code for the java script, but I will hopefully provide enough for those searching through Google to be able to find this page.

When you are viewing the web page being protected by an ASM and web scraping anomaly detection is being actively used to protect the web page you will see the following elements. To actually see these elements, open up Firefox, browse to the website in question and then right-click and select “View Source”. You should see a java script insert beginning very close to the top of the page that contains some of the following elements:

var jsepee

You can seen by looking at these events that it is looking for keyboard, mouse and other data to determine if the content is being looked at by a human or something that falls in the OTHER category. Once it has made a determination the web application security policy will follow whatever guidelines you have set under the policy settings.

So there you have it, yet one more reason why the F5 BIG-IP ASM is an excellent tool to be included in your defense in depth lineup.


11 comments so far

Add Your Comment
  1. Just wanted to say thanks for the post! We ran into this same problem and your blog helped us to isolate the issue. I have read some of your archives to and will be bookmarking your site!

  2. Glad it came in handy. I was hoping that folks would be able to find it via Google easily enough. Have a Happy New Year!

  3. Hi,

    Nice post.I like the way you start and then conclude your thoughts. Thanks for this information .I really appreciate your work, keep it up.

  4. Would you mind sending the full source to me? I’m seeing this on one of our websites but there is a lot more code than what you posted.

  5. I have replied to your personal e-mail address. Just an FYI to others that have read this post, I have heard from F5 Networks and they will be releasing a knowledge base article which will likely include the full source code. Once it has gone through their legal department and they get that posted up I will update the article above and provide a direct link to their article.

    Thanks again for the feedback all :)

  6. Hi, thanks for this article. I’d like to talk more in depth with you about it. If you have a minute, I’d appreciate if you could mail me.

  7. great stuff!!

  8. How do you think can it help to protect from bots mobile phone numbers on advertisement sites?

  9. Thanks for writing this post. The “Others” you referred to have gotten more sophisticated over the past year and can now run all of their web scraping requests through real fully-loaded browsers (IE, Firefox, Chrome), switch between 1,000’s of IP addresses in real-time, and even go as far as to send in dummy crawlers to test the site for things like ASM.

    TheF5Guy – Do you have any new thoughts on this feature now that it’s been a year? (BTW – no love loss towards F5. I’m told they make great Load Balancers).

  10. I cringe when vendors stop by and attempt to demean the F5 BIG-IP product line implying they are simple load balancers. At least you had the decency to hit my blog from a line not directly linked to your business.

    As far as answering your questions: YES. The BIG-IP product line can easily protect against all of these things and in my opinion trump the services provided by your company because the BIG-IP’s are not limited to only providing that one service. They are more than capable of providing the services you mentioned x100. Stay tuned to F5 Networks for press releases :)

  11. Hi, Thanks for the post! One of my companies website is always under Web Scraping. I’m looking for a way to deal with it. Can you provide me the source code please ? or if the official document is out then that may also do.

    Just jotting down for the benefits of others as well who read your blog. I’ve learnt following from many web scarping attempts.

    * IP address will keep on changing
    * User agent would also keep on changing but you might find a pattern of the weird user-agents being logged.
    * By logging entire X-Forwarded-For, one would understand that most of the times this scraping requests originates from lots of proxies. You will see 3-4 IP address in a request showing the proxies used.
    * Requests originated from modern browsers would send HTTP/1.1 but someone who might be running a tool or CLI would show HTTP/1.0
    * Scrapper may use a fixed “referrer” in all the requests.
    * Requests would originate from lots of countries or they are spoofed that way.