[Search-l] Coop - Custom search engine
Jeremie Miller
jeremie at jabber.org
Sat Sep 27 22:15:21 UTC 2008
> Could you provide an example? Looks like it will be great, but I'm not
> chasing the scope.
> Would the script provide select a subset for a given query, or be
> for a
> wide subset?
> Would it score the matches?
Dennis just ran some test scripts for me yesterday with great success!
Heh ;)
The scripts were very simple, map.pl:
while(<STDIN>){
while(s/src[ |\=]+[\"|\']*http\:\/\/([^\s|\"|\'|\>]+)//i){
print "$1\t1\n";
}
}
and reduce.pl:
my %keyz;
while(<STDIN>){
chop;
my($key,$cnt) = split;
$keyz{$key} += $cnt;
}
foreach $key (keys %keyz){
print "$key\t$keyz{$key}\n";
}
The map stage gives the script the raw HTML on STDIN, uses a regex to
find any src=URL and it's output is collected then fed into reduce
which sorts based on the key (URL) and aggregates a total count.
My example doesn't do anything useful really, but it shows just how
easily anyone can now provide scripts to plug into the processing
stages of Nutch and do interesting things.
Dennis is also cresting a format that either of these can output that
will look something like:
URL \t fieldname \t fieldvalue \t flags \n
The fields can have any string value, can be stored (so they'll be
returned along with that url on any result in the JSON) and/or indexed
(so they can be required in any query like fieldname:foo).
Some uses such as having a DB of URLs that you just want to be able to
filter on could just build the above file format and wouldn't even
need any map/reduce steps.
Jer
More information about the Search-l
mailing list