Jump to content

Geotargeting and allowing Spiders/Crawlers


seito

Recommended Posts

Hy! Any help on this topic would be much appreciated.

 

Here is the deal. I need to redirect some countries to another URL. But I need to allow a free access for search engine's bots even if they come from those countries.

 

Here is what I did until now:

 

Used MaxMind PHP API and databases to establish redirect. And then tried to modify the code to redirect ONLY if user is from blocked country AND IS NOT allowed crawler... suprise suprise, it's not working as I imagined:)!

 

Here is the code so far:

 

#!/usr/bin/php -q

<?php

 

// This code demonstrates how to lookup the country by IP Address

 

include("geoip.inc");

 

// Uncomment if querying against GeoIP/Lite City.

// include("geoipcity.inc");

 

$gi = geoip_open("GeoIP.dat",GEOIP_STANDARD);

$country = geoip_country_code_by_addr($gi, $_SERVER['REMOTE_ADDR']);

geoip_close($gi);

 

$my_countries = array('de', 'se', 'no', 'ee', 'it', 'lv', 'cz', 'dk', 'sk', 'at', 'ch', 'lu', 'nl', 'es', 'hu', 'sg', 'gr', 'ua', 'fi', 'ru', 'cn', 'hk', 'my', 'id', 'us', 'ar', 'mx', 'ec', 'cr', 'py', 'br');

$allowed_spiders = array('Googlebot', 'Yammybot', 'Openbot', 'Yahoo', 'MSNbot', 'Ask Jeeves', 'Teoma', 'Architext spider', 'FAST-WebCrawler', 'Slurp', 'Yahoo Slurp', 'ia_archiver', 'Scooter', 'crawler@fast', 'Crawler', 'InfoSeek sidewinder', 'Lycos_Spider_(T-Rex)');

$agent_name = $_SERVER['HTTP_USER_AGENT'];

 

 

if (in_array(strtolower($country), $my_countries)) {

        foreach($allowed_spiders as $s){

                if(!strpos($s,$agent_name)){

                    header('Location: www.REDIRECT URL.com');

                }

        }

exit;

}

 

 

?>

 

 

How I imagined this code should work is: First check from where is user coming. Then compare GeoIP shortcode for country to shortcodes for banned countries. AND then if traffic is indeed from banned country also check if user agent name has anything in it from an array of allowed spider names (Googlebot can be Googlebot, or Googlebot/2.1 for an example). And if it's not allowed spider then redirect user to another address.

 

I tried many variations of this but nothing seems to work. Each time that I tried to fetch page as a googlebot in webmaster, I got 302 redirection. Any help or guidance here MUCH appreciated.

 

Thank you PHP JEDI in advance ;)!!

Link to comment
Share on other sites

Thank you for fast replay. Sadly, it's not working... yet :)! I actually tried this at first. But problem with !in_array is that it's searching for exact match from $allowed_spiders. And spiders can have different attachements to it's name. Like Googlebot for an instance, can be Googlebot/2.1. And in that case it's not recognized and gets redirected ...

 

Any other idea?

 

 

Link to comment
Share on other sites

The reason strpos doesn't work is because your spiders are too generic. You have "Googlebot" but if the bot is actually "Googlebot/2.1" it's not going to match. The other way around, though, would match.

 

So you're either going to need to put all of the actual bot names in your array, or using something like similar_text or levenshtein

Link to comment
Share on other sites

Try this code

if (in_array(strtolower($country), $my_countries)) {
         foreach($allowed_spiders as $s){
                if(!stristr($s,$agent_name)){
                      header('Location: www.REDIRECT URL.com');
                }
        }
exit;
}

Link to comment
Share on other sites

Try this code

if (in_array(strtolower($country), $my_countries)) {
         foreach($allowed_spiders as $s){
                if(!stristr($s,$agent_name)){
                      header('Location: www.REDIRECT URL.com');
                }
        }
exit;
}

 

No, sory! This still gives me 302 Moved Temporarily...

 

This is now became challenge, huh :)?! Any more ideas?

 

BTW, can I ask you what is the purpose of <?php

ob_start();

?>

after code you provided?

 

 

@ scootstah:

I think we are now on something. Where could I found such list of spiders user agents? I tried searching on Google but only found really old ones (2006) or DNS records. No up to date User agents for bots...

 

Or if you please elaborate a bit more about  similar_text() and levenshtein() functions... I checked topics on links you provided but I have troubles to modify examples to work in my situations.

Link to comment
Share on other sites

Try my new version of code  :shy:

 


if (in_array(strtolower($country), $my_countries)) {
foreach($allowed_spiders as $s){
list($val1,$val2) = explode(";",$agent_name);
list($check) = explode("/",$val2);
if(!stristr($s,$check)){
header('Location: www.REDIRECT URL.com');
                }
        }
exit;
}
[code]

Link to comment
Share on other sites

No, sadly it didn't. I still get 302 redirect error when fetching as googlebot in webmaster tools.

 

To update situation on my own changes:

- I have started to use CloudFlare service. I think this is no importance for our code and that $agent_name = $_SERVER['HTTP_USER_AGENT']; should still work normally.

- CloudFlare offers it's own Geotargeting solution. Which is in importance for us since users are redirected through their proxy and so do not necessary ''appear'' from their country. I modified:

 

#!/usr/bin/php -q

<?php

 

// This code demonstrates how to lookup the country by IP Address

 

include("geoip.inc");

 

// Uncomment if querying against GeoIP/Lite City.

// include("geoipcity.inc");

 

$gi = geoip_open("GeoIP.dat",GEOIP_STANDARD);

$country = geoip_country_code_by_addr($gi, $_SERVER['REMOTE_ADDR']);

geoip_close($gi);

 

 

In this:

 

 

#!/usr/bin/php -q

<?php

 

$country = $_SERVER["HTTP_CF_IPCOUNTRY"];

 

 

Which should return me users REAL XY country code... I think I did it right since now blocking from countries seems to be stable. As far as I managed to test it through free proxy servers. Will need to motivate my WWW friends for some tests also ;)... any volunteers ;)?

 

From now on blocked countries, allowed bots and user agent are the same:

 

$my_countries = array('de', 'se', 'no', 'ee', 'it', 'lv', 'cz', 'dk', 'sk', 'at', 'ch', 'lu', 'nl', 'es', 'hu', 'sg', 'gr', 'ua', 'fi', 'ru', 'cn', 'hk', 'my', 'id', 'us', 'ar', 'mx', 'ec', 'cr', 'py', 'br');

$allowed_spiders = array('Googlebot', 'Yammybot', 'Openbot', 'Yahoo', 'MSNbot', 'Ask Jeeves', 'Teoma', 'Architext spider', 'FAST-WebCrawler', 'Slurp', 'Yahoo Slurp', 'ia_archiver', 'Scooter', 'crawler@fast', 'Crawler', 'InfoSeek sidewinder', 'Lycos_Spider_(T-Rex)');

$agent_name = $_SERVER['HTTP_USER_AGENT'];

 

 

To put it together, this is what I have at the moment (including your solution):

 

#!/usr/bin/php -q
<?php

$country = $_SERVER["HTTP_CF_IPCOUNTRY"];
$my_countries = array('de', 'se', 'no', 'ee', 'it', 'lv', 'cz', 'dk', 'sk', 'at', 'ch', 'lu', 'nl', 'es', 'hu', 'sg', 'gr', 'ua', 'fi', 'ru', 'cn', 'hk', 'my', 'id', 'us', 'ar', 'mx', 'ec', 'cr', 'py', 'br');
$allowed_spiders = array('Googlebot', 'Yammybot', 'Openbot', 'Yahoo', 'MSNbot', 'Ask Jeeves', 'Teoma', 'Architext spider', 'FAST-WebCrawler', 'Slurp', 'Yahoo Slurp', 'ia_archiver', 'Scooter', 'crawler@fast', 'Crawler', 'InfoSeek sidewinder', 'Lycos_Spider_(T-Rex)');
$agent_name = $_SERVER['HTTP_USER_AGENT'];

if (in_array(strtolower($country), $my_countries)) {
foreach($allowed_spiders as $s){
list($val1,$val2) = explode(";",$agent_name);
list($check) = explode("/",$val2);
if(!stristr($s,$check)){
header('Location: www.REDIRECT URL.com');
                }
        }
exit;
}

?>

 

Link to comment
Share on other sites

Hy.

 

If anybody will need something similar in future, this is what in the end worked for me:

 

<?php

$country = $_SERVER["HTTP_CF_IPCOUNTRY"];
$my_countries = array('de', 'se', 'no', 'ee', 'it', 'lv', 'cz', 'dk', 'sk', 'at', 'ch', 'lu', 'nl', 'es', 'hu', 'sg', 'gr', 'ua', 'fi', 'ru', 'cn', 'hk', 'my', 'id', 'ar', 'mx', 'ec', 'cr', 'py', 'br', 'us');

if (in_array(strtolower($country), $my_countries)) {

		function getnotCrawler($userAgent) {
		$crawlers = 'firefox|Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|' .
		'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|' .
		'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
		$notCrawler = (preg_match("/$crawlers/i", $userAgent) == 0);
		return $notCrawler;
	}

	$notCrawler = getnotCrawler($_SERVER['HTTP_USER_AGENT']);

	if ($notCrawler) {
		header('Location: www.REDIRECTION URL.com');
		exit;
	} else {
		// "not crawler!";
	}

}

?>

 

$_SERVER["HTTP_CF_IPCOUNTRY"]; is specific for CloudeFlare that I'm using. If you are not, adapt this part to your needs to get XY name of visitor's country.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.