[Search-l] Grub + Wiki = Open Crawler

jer jeremie at jabber.org
Mon Aug 13 19:53:22 UTC 2007


So it's about time I start a thread about what my vision of Grub's  
future be.  This isn't a decree nor a detailed plan, simply the start  
of a discussion and welcome to the input and feedback from all.

Grub doesn't have a specific purpose as implemented, it's more or  
less just a generic "distributed crawler" function that updated a  
bunch of stats.  Then what?  Well, before we even talk about  
indexing, I don't think we've really thought enough about the  
crawling.  If we're building an open platform, that openness needs to  
be thorough, and start from the very beginning as a solid foundation.

I want to re-vamp the php web admin interface that is grub.org, and  
mind-meld it with a wiki, where that wiki actually admins the  
functions of Grub and all feedback is published back through it.   
Most importantly, this includes the ability for anyone to grab the  
cached copies of the crawled content either per-page or in bulk,  
under an open content license.  All of the collective knowledge on  
the wiki is also published as open content for anyone else that wants  
to use it on their own crawler.

I'm not talking about having a wiki page for every single URL or  
every domain (that doesn't scale well), but the option to have one  
_if_needed_ is important.  The wiki would host a special "Site" page  
that anyone can create, and that uses a regex to match single or any  
grouping of URLs.  These Site pages can host human-guided hints to a  
crawler about that set of URLs.  I don't have a fixed and all- 
encompassing list of what these hints are or what attributes are  
valuable, I only have a list of suggestions based on my experience,  
and I expect this to change a lot as we go :)

Some quick ideas on what a wiki "Site" page would enable humans to  
contribute:
	- block domains, IPs, paths
	- spider traps, accidental or intentional
	- identify throwaway parts of URLs (like query strings with session  
identifiers)
	- identical content that might be obfuscated for auto-detection
	- override hostname aliasing to the right hostname
	- suggest update frequencies or best times to check
	- discoverability (are new urls trustworthy)
	- url models (blog indexes vs. permalinks, mailing list and board  
styles)
	- override titles (suggest where to get the titles in the content)
	- influence-ability (open wikis and comment areas that get spammy)
	- a robots and sitemap definition (if the domain is missing them)

I would eventually like to see it advance even further and provide  
some mechanisms for providing input to understanding the content as  
well, finding dates and common patterns in the URLs, hints about what  
parts of the content are dynamic or static, etc.

I don't expect this to be the *authority* on any of these points, but  
instead be the human override, the guide, for when automated systems  
fail or make mistakes.

It will be treated like any other wiki, with full transparency.   
Anyone can create a Site page and suggest all kinds of hints, and all  
their actions are visible.  Even Grub will get tied directly into  
this, so that the content a user crawls is verifiable by others.

What I *hope* to end up with is a purely socially driven crawler  
feeding back into a quality repository that is free and open for  
anyone to build on.  There should also be local hosting for anyone  
experimenting with open source projects that need direct/fast access  
to this compiled repository.

I can think of a couple immediate questions people might ask:
	* What about the riff-raff that want to do bad things with the  
repository?  I have a fundamental belief that the good of making a  
quality foundation available outweighs the bad.  Also, everyone using  
it will be transparent, if the community feels someone is doing  
something inappropriate it will be treated like any other wiki page,  
by discussion and consensus.
	* How does it get new URLs?  As part of the wiki, anyone can upload  
new lists of URLs, and anyone else can see those lists that they  
uploaded. If there's spare crawling resources the new URLs would be  
checked.
	* What if people are spamming and being ignored, won't that pollute  
the repository? A simple social network function either needs to be  
created or derived to curb this, whereby individuals that aren't  
connected or trusted by others have less priority.  Any derived  
reputation network is included into the outputs for any later  
function to decide which parts are more valuable.
	* This isn't a search engine!  Nope.  It's just a step, one that  
should help anyone wanting to build a search engine or experiment in  
this space.
	* Why isn't it working yet, you've had months now? My cloning  
machine is br0k3n.
	* But Wikia is commercial and owns all of this!  Nope, Wikia is  
sponsoring it's development, but it's all open source and everything  
published under an open license, nobody owns it and everybody owns  
it :)  Wikia won't be the only one sponsoring it either.

Jer






More information about the Search-l mailing list