[Search-l] Coop - Custom search engine

Jeremie Miller jeremie at jabber.org
Sat Sep 27 22:15:21 UTC 2008


> Could you provide an example? Looks like it will be great, but I'm not
> chasing the scope.
> Would the script provide select a subset for a given query, or be  
> for a
> wide subset?
> Would it score the matches?

Dennis just ran some test scripts for me yesterday with great success!  
Heh ;)

The scripts were very simple, map.pl:
	while(<STDIN>){
	        while(s/src[ |\=]+[\"|\']*http\:\/\/([^\s|\"|\'|\>]+)//i){
	                print "$1\t1\n";
	        }
	}

and reduce.pl:
	my %keyz;
	while(<STDIN>){
	        chop;
	        my($key,$cnt) = split;
	        $keyz{$key} += $cnt;
	}
	foreach $key (keys %keyz){
	        print "$key\t$keyz{$key}\n";
	}

The map stage gives the script the raw HTML on STDIN, uses a regex to  
find any src=URL and it's output is collected then fed into reduce  
which sorts based on the key (URL) and aggregates a total count.

My example doesn't do anything useful really, but it shows just how  
easily anyone can now provide scripts to plug into the processing  
stages of Nutch and do interesting things.

Dennis is also cresting a format that either of these can output that  
will look something like:
URL \t fieldname \t fieldvalue \t flags \n

The fields can have any string value, can be stored (so they'll be  
returned along with that url on any result in the JSON) and/or indexed  
(so they can be required in any query like fieldname:foo).

Some uses such as having a DB of URLs that you just want to be able to  
filter on could just build the above file format and wouldn't even  
need any map/reduce steps.

Jer




More information about the Search-l mailing list