incidence

Digitalblend-CL

Baidu: means too many crawls...

At the start of 2012, Tilluminati.com has been experiencing some slow dows and for not apparent reason the site experienced lots of bandwidth usage with out any known reason what was causing the excessive bandwidth usage or who was pinging and sending so many web spiders to crawl the site. I knew that it could not be a hacker, all of the hits to the website had to be coming from a sepific set of IP Addresses, since this sort of thing is outside of the webhost's statement of support, I was on my own.

 

I caught Baidu..

Well after I upgraded Magento in April, I finally had a means of understanding what was going on, since all of the bandwidth usage in early 2012 was not excessive, I did nothing. But as the weeks went by things only got worse and something was causing the site to use a average of 110% to 120% of the server's resources, this was when I knew I had to try and solve the problem. Magento 1.7 has a online user tracking tool, and it recorded a constant stream of hits on all the pages of the site coming from two IP Address 180.76.5.0 - 255 and 180.76.5.0 - 255; at the time, until Magento 1.7 I had no success stoping the traffic from these two set of IP Addresses, robot.txt nothing worked. I was trying to avoid having to block these two IP addresses but that is what I ended up doing today.

So if you have to maintain servers, and you have a bandwith issue, check the server logs and see if the IP Addresses from Baidu keep showing up on your log, they are either 180.76.5.0 or 180.76.6.0 and their entire number range which is 0 to 255. The problem is that every hour Baidu is sending a average of 20-60 web search spiders to crawl the site, sucking up all the bandwidth on the webhosting account.

I searched google and looked through a number of webhosting forums, a number of web masters and administrators have complained about the same issue.

A solution found

The solution is simple, at the bottom of your .htaccess file in the root directory where your domain lis linked to on the web-host, you will have to block the two IP Adresses that Baidu uses. This is the only really effective way to solve the problem, so far it has worked for me, Magento is not logging visits from Baidu and the bandwidth usage has dropped to a normal level.

The issue is that once baidu has found your site, it keeps sending a steady stream of search spiders to your site, yes I know that Baidu's search spiders are suppose to obey your robot.txt file, but in practice this is not happening.

You will need to insert the code into the bottom of the .htaccess file, to block Baidu effectively you will have to block the entire range of IP Addresses that it is being served from.

############################################
# allow all except those indicated here

order allow,deny
allow from all
deny from xxx.xx.x.x
deny from xxx.xx.x.x

 Once you have inserted the code, check your server logs to see if the IP Addresses from Baidu are being blocked. I don't mind search engines crawling a site, it helps a site get notices, but 20-60 hits everyhour, now that is just to much.

20120528: Update and more information.

After doing some more work with blocking IP Address via htaccess, I found some more useful information. If you are interested in blocking a range of IP Addresses from a certain provide or website, you can use either /16 or /24 at the end of a IP Address.

I have included a article from Wikipedia for more information about IPv4 subnet reference.

Here is the article:

Click here to view the article

I found out that another search engine is a big bandwidht hog, it's Yandex a Russian search engine. I had to block this IP Address along with several other Baidu IP Addresses that have been crawling the website.

After looking around for information about IP Addresses to block, I came across several other pieces of useful info.

One is a list of web traffic IP Addresses.
Here is the link to the website.

This is a Time Magazine website article.
Here is the link to the article.