[Search-l] Grub + Wiki = Open Crawler
jer
jeremie at jabber.org
Mon Aug 13 19:53:22 UTC 2007
So it's about time I start a thread about what my vision of Grub's
future be. This isn't a decree nor a detailed plan, simply the start
of a discussion and welcome to the input and feedback from all.
Grub doesn't have a specific purpose as implemented, it's more or
less just a generic "distributed crawler" function that updated a
bunch of stats. Then what? Well, before we even talk about
indexing, I don't think we've really thought enough about the
crawling. If we're building an open platform, that openness needs to
be thorough, and start from the very beginning as a solid foundation.
I want to re-vamp the php web admin interface that is grub.org, and
mind-meld it with a wiki, where that wiki actually admins the
functions of Grub and all feedback is published back through it.
Most importantly, this includes the ability for anyone to grab the
cached copies of the crawled content either per-page or in bulk,
under an open content license. All of the collective knowledge on
the wiki is also published as open content for anyone else that wants
to use it on their own crawler.
I'm not talking about having a wiki page for every single URL or
every domain (that doesn't scale well), but the option to have one
_if_needed_ is important. The wiki would host a special "Site" page
that anyone can create, and that uses a regex to match single or any
grouping of URLs. These Site pages can host human-guided hints to a
crawler about that set of URLs. I don't have a fixed and all-
encompassing list of what these hints are or what attributes are
valuable, I only have a list of suggestions based on my experience,
and I expect this to change a lot as we go :)
Some quick ideas on what a wiki "Site" page would enable humans to
contribute:
- block domains, IPs, paths
- spider traps, accidental or intentional
- identify throwaway parts of URLs (like query strings with session
identifiers)
- identical content that might be obfuscated for auto-detection
- override hostname aliasing to the right hostname
- suggest update frequencies or best times to check
- discoverability (are new urls trustworthy)
- url models (blog indexes vs. permalinks, mailing list and board
styles)
- override titles (suggest where to get the titles in the content)
- influence-ability (open wikis and comment areas that get spammy)
- a robots and sitemap definition (if the domain is missing them)
I would eventually like to see it advance even further and provide
some mechanisms for providing input to understanding the content as
well, finding dates and common patterns in the URLs, hints about what
parts of the content are dynamic or static, etc.
I don't expect this to be the *authority* on any of these points, but
instead be the human override, the guide, for when automated systems
fail or make mistakes.
It will be treated like any other wiki, with full transparency.
Anyone can create a Site page and suggest all kinds of hints, and all
their actions are visible. Even Grub will get tied directly into
this, so that the content a user crawls is verifiable by others.
What I *hope* to end up with is a purely socially driven crawler
feeding back into a quality repository that is free and open for
anyone to build on. There should also be local hosting for anyone
experimenting with open source projects that need direct/fast access
to this compiled repository.
I can think of a couple immediate questions people might ask:
* What about the riff-raff that want to do bad things with the
repository? I have a fundamental belief that the good of making a
quality foundation available outweighs the bad. Also, everyone using
it will be transparent, if the community feels someone is doing
something inappropriate it will be treated like any other wiki page,
by discussion and consensus.
* How does it get new URLs? As part of the wiki, anyone can upload
new lists of URLs, and anyone else can see those lists that they
uploaded. If there's spare crawling resources the new URLs would be
checked.
* What if people are spamming and being ignored, won't that pollute
the repository? A simple social network function either needs to be
created or derived to curb this, whereby individuals that aren't
connected or trusted by others have less priority. Any derived
reputation network is included into the outputs for any later
function to decide which parts are more valuable.
* This isn't a search engine! Nope. It's just a step, one that
should help anyone wanting to build a search engine or experiment in
this space.
* Why isn't it working yet, you've had months now? My cloning
machine is br0k3n.
* But Wikia is commercial and owns all of this! Nope, Wikia is
sponsoring it's development, but it's all open source and everything
published under an open license, nobody owns it and everybody owns
it :) Wikia won't be the only one sponsoring it either.
Jer
More information about the Search-l
mailing list