Friday, March 5, 2010

Hosting html files on Google Code SVN

By default, the file is served with HTTP header Content-type: text/plain

To change it to text/html so that the client's web browser renders it,

1. Checkout your SVN repo.
2. svn propset svn:mime-type 'text/html' somehtmlfile.html
3. Commit


PNG image files (and other binary files in SVN) are served with Content-type: application/octet-stream .
So for a PNG image replace step 2 above with:
svn propset svn:mime-type 'image/png' file.png

Tuesday, January 26, 2010

Shell script for changing the interpreter in other scripts

For changing the first line in an interpreted script, the hash-bang (#!) line.

To get the most up to date version, download here.

#!/bin/bash
# Copyright (c) 2010 Sean A.O. Harney
#
# set-interpreter.sh
#
# Usage: set-interpreter.sh <new interpreter> <script file>
#
# Examples: ./set-interpreter.sh /usr/local/bin/omf file.omf
# ./set-interpreter.sh "/bin/bash -r" myshellscript.sh
#
# In the second example above the first line of myshellscript.sh
# would become the following: #!/bin/bash -r
#
# Yes, there is probably a way to do this in sed with one line.
#
#
# To set the first line of all files ending with .omf to #!/bin/omf run the following cmd:
#
# find . -name "*\.omf" -exec ./set-interpreter.sh /bin/omf {} \; ;
#


if [ $# -ne 2 ] ; then
echo -e "Usage:\t$0 <new interpreter> <script file>";
exit 1 ;
fi

NEW_INTERP="$1" ;
SCRIPT_FILE="$2" ;

if [ ! -r "${SCRIPT_FILE}" ] ; then
echo "Cannot proceed, ${SCRIPT_FILE} either does not exist or is not readable." ;
exit 1 ;
fi

FIRST_LINE=`head -n 1 "${SCRIPT_FILE}"` ;

if [[ ! "${FIRST_LINE}" =~ \#!.* ]] ; then
echo "Cannot proceed, ${SCRIPT_FILE} is not a valid script. It does not start with hash-bang (#!)." ;
echo "${SCRIPT_FILE} has not been modified." ;
exit 1 ;
fi

NEW_FIRST_LINE="#!${NEW_INTERP}" ; # Add the pound-hash.

if [ "${FIRST_LINE}" == "${NEW_FIRST_LINE}" ] ; then
echo "Cannot proceed, ${SCRIPT_FILE} already contains the desired new first line." ;
echo "${SCRIPT_FILE} has not been modified." ;
exit 1 ;
fi


TMP_FILE=`mktemp` ;
if [ $? -ne 0 ] ; then
echo "Cannot proceed, the mktemp program has failed." ;
echo "${SCRIPT_FILE} has not been modified." ;
exit 1 ;
fi


echo "${NEW_FIRST_LINE}" > "${TMP_FILE}" ;
tail --lines=+2 "${SCRIPT_FILE}" >> "${TMP_FILE}" ; # everything but the first line.

if [[ `cat "${TMP_FILE}" | wc -l` -ne `cat "${SCRIPT_FILE}" | wc -l` ]] ; then
echo "Cannot proceed, ${TMP_FILE} does not have the same line-count as ${SCRIPT_FILE} . Something went wrong." ;
echo "${SCRIPT_FILE} has not been modified." ;
exit 1 ;
fi

BAK_FILE="${SCRIPT_FILE}.bak" ;
while [ -e "${BAK_FILE}" ]
do
BAK_FILE="${BAK_FILE}.bak" ; # file.bak , file.bak.bak etc.
done

cp -b "${SCRIPT_FILE}" "${BAK_FILE}" ;
echo "Backed up ${SCRIPT_FILE} to ${BAK_FILE} . Preserved timestamps." ;

cat "${TMP_FILE}" > "${SCRIPT_FILE}" ; # better than mv since the $SCRIPT_FILE permissions et al will be preserved.
rm -f "${TMP_FILE}" ;
echo "Changed ${FIRST_LINE} to ${NEW_FIRST_LINE}" ;
echo "${SCRIPT_FILE} has been successfully modified." ;
exit 0 ;

Saturday, January 23, 2010

FindNewFiles, a Java class to poll a directory for new files

To get the most up to date version, download here.

Example usage:

...

FindNewFiles fnf;
ArrayList<java.util.File> newFiles;

try {
fnf = new FindNewFiles("/tmp"); /* or windows path */
} catch(Exception e) {
System.out.println("Exception: " + e.getMessage());
return;
}

while(true) {
/* get files which were not in the directory the previous time method was invoked */
newFiles = fnf.findNewFiles();
if(newFiles != null)
{
/* then found new files */
/* do something */
}

/* do something, perhaps sleep */
}

...

Thursday, January 21, 2010

Linux shell script to print interface's IP Address

Prints out the IP Address of the given network interface followed by a trailing newline.

To install, copy and paste to a new file named
/usr/local/sbin/ipaddr
and execute the following command:
chmod 755 /usr/local/sbin/ipaddr

Usage: ipaddr <interface>
e.g. ipaddr eth0


#!/bin/sh
# Copyright (c) 2010 Sean A.O. Harney
# This script is licensed under GNU GPL version 2.0 or above

if [ $# -ne 1 ] ; then
echo -e "Usage:\t$0 <interface>" ;
exit 1 ;
fi

/sbin/ifconfig $1 | \
grep 'inet addr:' | \
awk '{ split($2, ar, ":") ; print ar[2] }' ;

exit 0 ;

Friday, October 2, 2009

Powers of Binomials in C

Expanding (a + b)^n yields the summation of the 2^n terms. This program will show you the expanded formula. The example below shows (1.6 + 0.4)^3 expanded and solved:

$ ./binomial 1.6 0.4 3
(1.600000 + 0.400000)^3
= (1.600000 + 0.400000)(1.600000 + 0.400000)(1.600000 + 0.400000)
= (1.600000)(1.600000)(1.600000) + (1.600000)(1.600000)(0.400000) + (1.600000)(0.400000)(1.600000) + (1.600000)(0.400000)(0.400000) + (0.400000)(1.600000)(1.600000) + (0.400000)(1.600000)(0.400000) + (0.400000)(0.400000)(1.600000) + (0.400000)(0.400000)(0.400000)
= 8.000000



Rather than use the FOIL method to compute the 2^20 terms of the expansion, it makes use of the fact that all of the terms consist of of either 4 or 2; a binary dichotomy. Instead an algorithm to convert a the value of a variable of type long to a binary representation is used.


$ time ./binomial 4.0 2.0 20 | tail -n 1
= 3656158440062976.000000

real 0m17.482s
user 0m16.650s
sys 0m0.830s


It may be slow, but at least accurately computed (4 + 2)^20 properly using a very similar algorithm to that of a person doing algebra with pencil and paper who had never heard of the Binomial Theorem.
For solving Powers of Binomials problems, using Binomial Theorem instead would have been an approach which used less running time, of course. Or the following:

double result = pow(a + b, exp);


Here is my binomial.c file:


/*
* Copyright (c) 2009, Sean A.O. Harney
*
* Calculate (a+b)^exp
* TODO: Breaks with exponents larger than 30, check for arch specific limit when parsing cmdline args
* TODO: Change it to use C99 fixed width integer types instead of long.
*/

#include <stdio.h>
#include <stdlib.h>

double binomial_expansion(double a, double b, int exp);
double calcterm(unsigned long x, int width, double a, double b);

int main(int argc, char *argv[])
{
double a, b;
int exp; //positive.
char *endptr;

if (argc < 4)
{
fprintf(stderr, "Usage:\t%s a b exp\n", argv[0]);
exit(1);
}

a = strtod(argv[1], &endptr); // C89
if (endptr == NULL)
{
perror("strtod()");
exit(1);
}

b = strtod(argv[2], &endptr);
if (endptr == NULL)
{
perror("strtod()");
exit(1);
}

exp = strtol(argv[3], &endptr, 10); // atoi() does not error check
if (endptr == NULL)
{
perror("strtol()");
exit(1);
}
if (exp <= 0)
{
fprintf(stderr, "exp must be positive integer.\n");
exit(1);
}

binomial_expansion(a, b, exp);
exit(0);
}


// return value as computed
double binomial_expansion(double a, double b, int exp)
{
long num_terms; // will be 2^exp terms
long i;
double total = 0;

printf("(%f + %f)^%d\n\t= ", a, b, exp);

for (i = 0, num_terms = 2; i < exp; i++)
{
if (i < exp - 1)
num_terms *= 2; //raise num_terms by one power for all iteration but last
printf("(%f + %f)", a, b);
}


printf("\n\t= ");
for (i = 0; i < num_terms; i++)
{
total += calcterm(i, exp, a, b);
if (i < num_terms - 1)
printf(" + ");
}

printf("\n\t= %f\n", total);

return total;
}


// modified from a convert int to binary string representation function
double calcterm(unsigned long x, int width, double a, double b)
{
double s[64]; // max exp size
int i = width - 1;
double total = 0.0;

do
{
s[i--] = (x & 1) ? b : a; // b for 1, a for 0
x >>= 1;
}
while (x > 0);
while (i >= 0)
s[i--] = a; // fill with a (0), all must be fixed width

for (i = 0; i < width; i++)
{
if (i == 0)
total = s[0];
else
total *= s[i];
printf("(%f)", s[i]);
}

return total;
}

Tuesday, July 7, 2009

Grouping blogger.com blogs based on the adsense account. Does Google do enough to protect its user's privacy?

Hello, as you may be aware, Google's popular blogger.com/blogspot.com blogging site allows you to easily monetize your blog by serving Adsense advertisements on it.

While a blogger has the option of not displaying their profile on the page, it is still possible to group who owns which blogs because of the Adsense publisher id!


/*
* 7/7/2009
*
* gcc -lcurl -I/usr/include/libxml2 -lxml2 -o getblognames getblognames.c
*
* Generates a list of blogspot.com/blogger.com blog URLs
* by accessing http://www.blogger.com/next-blog to get a random blog redirect.
*
* Based on http://cool.haxx.se/cvs.cgi/curl/docs/examples/getinmemory.c
*
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <curl/curl.h>
#include <curl/types.h>
#include <curl/easy.h>

#include <libxml/HTMLparser.h>





struct MemoryStruct {
char *memory;
size_t size;
};

static void *myrealloc(void *ptr, size_t size);

static void *myrealloc(void *ptr, size_t size)
{
/* There might be a realloc() out there that doesn't like reallocing
NULL pointers, so we take care of it here */
if (ptr)
return realloc(ptr, size);
else
return malloc(size);
}

static size_t
WriteMemoryCallback(void *ptr, size_t size, size_t nmemb, void *data)
{
size_t realsize = size * nmemb;
struct MemoryStruct *mem = (struct MemoryStruct *) data;

mem->memory = myrealloc(mem->memory, mem->size + realsize + 1);
if (mem->memory)
{
memcpy(&(mem->memory[mem->size]), ptr, realsize);
mem->size += realsize;
mem->memory[mem->size] = 0;
}
return realsize;
}

void print_ahrefs(xmlNode * a_node)
{
xmlNode *cur_node = NULL;

for (cur_node = a_node; cur_node; cur_node = cur_node->next)
{
if (cur_node->type == XML_ELEMENT_NODE)
{
if (!strcasecmp((char *) cur_node->name, "a"))
{
unsigned char *url =
xmlGetProp(cur_node, (xmlChar *) "href");
printf("%s\n", url);
fflush(stdout);
xmlFree(url);
}
}

print_ahrefs(cur_node->children);
}
}

int main(int argc, char **argv)
{
CURL *curl;
CURLcode res;

struct MemoryStruct chunk;

curl_global_init(CURL_GLOBAL_ALL);

curl = curl_easy_init();

curl_easy_setopt(curl, CURLOPT_URL,
"http://www.blogger.com/next-blog");

/* send all data to this function */
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);

/* we pass our 'chunk' struct to the callback function */
curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *) &chunk);

while (1)
{
chunk.memory = NULL; /* we expect realloc(NULL, size) to work */
chunk.size = 0; /* no data at this point */

res = curl_easy_perform(curl);
if (res != 0)
{
fprintf(stderr, "%s\n", curl_easy_strerror(res));
exit(1);
}

if (chunk.memory)
{
//assumes it is a string
//printf("%s\n", chunk.memory);
htmlDocPtr doc =
htmlParseDoc((unsigned char *) chunk.memory, NULL);
xmlNode *root_element;

if (doc == NULL)
{
fprintf(stderr, "error parsing HTML\n");
exit(1);
}

root_element = xmlDocGetRootElement(doc);
print_ahrefs(root_element);
xmlFreeDoc(doc); //no htmlFreeDoc(), this seems to work to prevent memory leak however.
free(chunk.memory);
}

}
curl_easy_cleanup(curl);
curl_global_cleanup();

exit(0);
}


The above code will output random blogger URLs line by line. Using a persistent HTTP connection, it will continue to retreive http://www.blogger.com/next-blog and extract out the redirection link. It requires libcurl and libxml2 to compile.

The next program is a simple shell script which will parse the output of the above program and in turn create a CSV (comma seperated variable) file consisting of blogURL, AdsensePublisherID
If the blog does not have adsense, the second column will be left blank.



#!/bin/sh
#
# cat blogsURLlist.txt | ./getadpublishers.sh > blogads.csv


while read url ; do
echo "\"$url\" , \""`wget -qO- "$url" |grep google_ad_client| \
head -n 1| sed "s:google_ad_client[ ]*=[ ]*\"pub-\([0-9]*\)\";:\1:"`'"'
done


Now, to put it all into action:

./getblognames | ./getadpublishers.sh > blogs.xml


This will run indefinitely as there are millions of blogs, and it does not check for duplicates!
It may be a wise idea to run the getblognames program independently to generate lists of the blog URLs beforehand, but I will simply pipe it to the shell script here.

Now it will be trivial to determine which Google Adsense publisher IDs are shared between supposedly independent blogs!

If Google wants to protect the identity of their Advertisers perhaps they should assign multiple publisher IDs to each publisher, so that they can put a unique one on each page!

I will post the results when I have some, hopefully there will be lulz to be had!