Tuesday, July 7, 2009

Grouping blogger.com blogs based on the adsense account. Does Google do enough to protect its user's privacy?

Hello, as you may be aware, Google's popular blogger.com/blogspot.com blogging site allows you to easily monetize your blog by serving Adsense advertisements on it.

While a blogger has the option of not displaying their profile on the page, it is still possible to group who owns which blogs because of the Adsense publisher id!


/*
* 7/7/2009
*
* gcc -lcurl -I/usr/include/libxml2 -lxml2 -o getblognames getblognames.c
*
* Generates a list of blogspot.com/blogger.com blog URLs
* by accessing http://www.blogger.com/next-blog to get a random blog redirect.
*
* Based on http://cool.haxx.se/cvs.cgi/curl/docs/examples/getinmemory.c
*
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <curl/curl.h>
#include <curl/types.h>
#include <curl/easy.h>

#include <libxml/HTMLparser.h>





struct MemoryStruct {
char *memory;
size_t size;
};

static void *myrealloc(void *ptr, size_t size);

static void *myrealloc(void *ptr, size_t size)
{
/* There might be a realloc() out there that doesn't like reallocing
NULL pointers, so we take care of it here */
if (ptr)
return realloc(ptr, size);
else
return malloc(size);
}

static size_t
WriteMemoryCallback(void *ptr, size_t size, size_t nmemb, void *data)
{
size_t realsize = size * nmemb;
struct MemoryStruct *mem = (struct MemoryStruct *) data;

mem->memory = myrealloc(mem->memory, mem->size + realsize + 1);
if (mem->memory)
{
memcpy(&(mem->memory[mem->size]), ptr, realsize);
mem->size += realsize;
mem->memory[mem->size] = 0;
}
return realsize;
}

void print_ahrefs(xmlNode * a_node)
{
xmlNode *cur_node = NULL;

for (cur_node = a_node; cur_node; cur_node = cur_node->next)
{
if (cur_node->type == XML_ELEMENT_NODE)
{
if (!strcasecmp((char *) cur_node->name, "a"))
{
unsigned char *url =
xmlGetProp(cur_node, (xmlChar *) "href");
printf("%s\n", url);
fflush(stdout);
xmlFree(url);
}
}

print_ahrefs(cur_node->children);
}
}

int main(int argc, char **argv)
{
CURL *curl;
CURLcode res;

struct MemoryStruct chunk;

curl_global_init(CURL_GLOBAL_ALL);

curl = curl_easy_init();

curl_easy_setopt(curl, CURLOPT_URL,
"http://www.blogger.com/next-blog");

/* send all data to this function */
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);

/* we pass our 'chunk' struct to the callback function */
curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *) &chunk);

while (1)
{
chunk.memory = NULL; /* we expect realloc(NULL, size) to work */
chunk.size = 0; /* no data at this point */

res = curl_easy_perform(curl);
if (res != 0)
{
fprintf(stderr, "%s\n", curl_easy_strerror(res));
exit(1);
}

if (chunk.memory)
{
//assumes it is a string
//printf("%s\n", chunk.memory);
htmlDocPtr doc =
htmlParseDoc((unsigned char *) chunk.memory, NULL);
xmlNode *root_element;

if (doc == NULL)
{
fprintf(stderr, "error parsing HTML\n");
exit(1);
}

root_element = xmlDocGetRootElement(doc);
print_ahrefs(root_element);
xmlFreeDoc(doc); //no htmlFreeDoc(), this seems to work to prevent memory leak however.
free(chunk.memory);
}

}
curl_easy_cleanup(curl);
curl_global_cleanup();

exit(0);
}


The above code will output random blogger URLs line by line. Using a persistent HTTP connection, it will continue to retreive http://www.blogger.com/next-blog and extract out the redirection link. It requires libcurl and libxml2 to compile.

The next program is a simple shell script which will parse the output of the above program and in turn create a CSV (comma seperated variable) file consisting of blogURL, AdsensePublisherID
If the blog does not have adsense, the second column will be left blank.



#!/bin/sh
#
# cat blogsURLlist.txt | ./getadpublishers.sh > blogads.csv


while read url ; do
echo "\"$url\" , \""`wget -qO- "$url" |grep google_ad_client| \
head -n 1| sed "s:google_ad_client[ ]*=[ ]*\"pub-\([0-9]*\)\";:\1:"`'"'
done


Now, to put it all into action:

./getblognames | ./getadpublishers.sh > blogs.xml


This will run indefinitely as there are millions of blogs, and it does not check for duplicates!
It may be a wise idea to run the getblognames program independently to generate lists of the blog URLs beforehand, but I will simply pipe it to the shell script here.

Now it will be trivial to determine which Google Adsense publisher IDs are shared between supposedly independent blogs!

If Google wants to protect the identity of their Advertisers perhaps they should assign multiple publisher IDs to each publisher, so that they can put a unique one on each page!

I will post the results when I have some, hopefully there will be lulz to be had!