Defeat Spam Blogs With IP Based Content Delivery
The majority of bloggers are forced to deal with spam blogs (splogs, aka scraper blogs), and even though a variety of counter measures exist, they just don’t seem to do the trick. Most of the time, splogs will scrape only an excerpt from the post, making the permalink at the bottom of the post useless. Some harvesting software is even smart enough to strip out these attempts at foiling the scrapers, so what’s a blogger to do? Today, I introduce a way to deliver entirely different content to these spammers via IP based content delivery.
First Things First
In order for this to work, we’ll need a list of IP addresses that known offenders use. For your convenience, I’ve compiled this massive list of 5,780 offending IPs (I highly recommend you use your own unique list compiled from your own server logs). Copy those IPs and save them to your server’s root directory with a filename of your choice. Remember the filename, you’ll need it in just a second. Now that you’ve got your enemy plotted, lets get to the code.
Backup, Modify, Test
Backup your theme’s single.php to single.post.original.txt or something of your choice. Now, open up single.php with a text editor and insert the code below at the very top of your single.php file.
Please note: I take absolutely no credit for this code. The original source is located here.
<?php
function chkiplist($ip) {
$lines = file("THE-FILENAME-OF-IP-LIST.txt");
$found = false;
$split_it = split("\.",$ip);
$ip = "1" . sprintf("%03d",$split_it[0]) .
sprintf("%03d",$split_it[1]) . sprintf("%03d",$split_it[2]) .
sprintf("%03d",$split_it[3]);
foreach ($lines as $line) {
$line = chop($line);
$line = str_replace("x","*",$line);
$line = preg_replace("|[A-Za-z$max = $line;
$min = $line;
if ( strpos($line,"*",0) <> "" ) {
$max = str_replace("*","999",$line);
$min = str_replace("*","000",$line);
}
if ( strpos($line,"?",0) <> "" ) {
$max = str_replace("?","9",$line);
$min = str_replace("?","0",$line);
}
if ( $max == "" ) { continue; };
if ( strpos($max," - ",0) <> "" ) {
$split_it = split(" - ",$max);
if ( !preg_match("|\d{1,3}\.|",$split_it[1]) ) {
$max = $split_it[0];
}
else {
$max = $split_it[1];
};
}
if ( strpos($min," - ",0) <> "" ) {
$split_it = split(" - ",$min);
$min = $split_it[0];
}
$split_it = split("\.",$max);
for ( $i=0;$i<4;$i++ ) {
if ( $i == 0 ) { $max = 1; };
if ( strpos($split_it[$i],"-",0) <> "" ) {
$another_split = split("-",$split_it[$i]);
$split_it[$i] = $another_split[1];
}
$max .= sprintf("%03d",$split_it[$i]);
}
$split_it = split("\.",$min);
for ( $i=0;$i<4;$i++ ) {
if ( $i == 0 ) { $min = 1; };
if ( strpos($split_it[$i],"-",0) <> "" ) {
$another_split = split("-",$split_it[$i]);
$split_it[$i] = $another_split[0];
}
$min .= sprintf("%03d",$split_it[$i]);
}
if ( ($ip <= $max) && ($ip >= $min) ) {
$found = true;
break;
};
}
return $found;
};
$status = chkiplist($_SERVER['REMOTE_ADDR']);
?>
Ok, now what? Change the third line so the filename you saved your IPs to is specified, then look for:
<?php the_content(); ?>
Immediately before that line, add something similar to the following:
<?php if ($status == 1): ?> Hey, thanks for scraping my post: <a href="<?php the_permalink(); ?>" title="<?php the_title(); ?>"> <?php the_title(); ?><br /> Click here to see the site this content was stolen from! <?php the_excerpt(); ?> <?php the_title(); ?></a> Original Source: <?php the_permalink(); ?> <?php else: ?>
The spam bots will see this:
Hey, thanks for scraping my post:
Defeat Spam Blogs With IP Based Content Delivery
Defeat Spam Blogs With IP Based Content Delivery
Original Source:
www.nullamatix.com/defeat-spam-blogs-with-ip-based-content-delivery/
We’re not done yet. To prevent any PHP errors, you’ll need to add this:
<?php endif; ?>
immediately after:
<?php the_content(); ?>
The whole thing, minus the chkiplist() function defined above, should look something like this:
<div class="entry-content"> <?php if ($status == 1): ?> Hey, thanks for scraping my post: <a href="<?php the_permalink(); ?>" title="<?php the_title(); ?>"> <?php the_title(); ?><br /> Click here to see the site this content was stolen from! <?php the_excerpt(); ?> <?php the_title(); ?></a> Original Source: <?php the_permalink(); ?> <?php else: ?> <?php the_content(); ?> <?php endif; ?> </div>
To test everything out and make sure your blog is up and running properly, just visit a post like you normally would. If the content is displays as usual, you’re good to go. To test the scraper’s view, just add your IP to the list of known spammers. The script above also supports wildcards, among other variations. Check out the original source mentioned above for more details.
To Conclude…
This won’t immediately work on every new splog that comes out, but if you actively check your server logs, you can stop most of ‘em by adding the offending IP(s). Now for the real question: what other purposes might this nifty little script serve? Just use your imagination - there is a hidden agenda behind this entire post



This idea is interesting, but I have a a few concerns.
First, shouldn’t this code go in the RSS feed and not the single.php page? Most scraping is through the RSS feed itself and not the Web site, I’m not sure how much is gained by putting this code in the site itself.
Second, I worry about the possibility, or perhaps probability, of false positives. Spammers change IP addresses regularly and it is entirely possible that once an IP is abandoned, it could be picked up by someone who wants to make legitimate use of the content. If the RIAA can’t consistently pinpoint a person by their IP address, I don’t see what hope we have.
Finally, in the same vein, since spammers change IP addresses so regularly, many will just dodge the list. It is a big part of the reason why fighters of email spam have done away with IP detection as a tool.
I like the idea in principle, but I don’t think that this implementation of it is going to be effective enough to warrant the potential risks.
That is just my opinion though, I’m sure many others will disagree with me.
Thank you for writing this and for providing another tool. Even though I won’t be using it, I hope that, perhaps, others are able to find it productive!
Jonathan,
Thanks for taking the time to comment on the post. Your first suggestion makes complete sense and I honestly can’t believe I failed to mention that in the article. You’re right, most scrapers probably do scrape RSS feeds rather than the HTML.
False positives are inevitable, no doubt about it.
Your third and final remark again makes total sense, but geographically speaking, if you’re not concerned with Russian and/or Romanian visitors, this little script would work perfectly.
The real reason I wrote this article wasn’t because I wanted to block splog bots. Just use your creativity and imagine what else you could use this for