[Search-l] Grub + Wiki = Open Crawler
Seth Ford
seth.ford at gmail.com
Mon Aug 13 20:55:02 UTC 2007
A mixed model of a meritocracy built on a strong payed editorial based
monitoring and guiding the community? I have no problem with wikia making
model, personally I think adsense in wikipedia is a productive progression
as well (Hey if PBS can do it...). Personally I think it is about modeling
what the Apache Server group has going... Maybe it takes a little money to
keep a core of well qualified editors around, but that isn't mutually
exclusive to being open as well.
On 8/13/07, jer <jeremie at jabber.org> wrote:
>
> So it's about time I start a thread about what my vision of Grub's
> future be. This isn't a decree nor a detailed plan, simply the start
> of a discussion and welcome to the input and feedback from all.
>
> Grub doesn't have a specific purpose as implemented, it's more or
> less just a generic "distributed crawler" function that updated a
> bunch of stats. Then what? Well, before we even talk about
> indexing, I don't think we've really thought enough about the
> crawling. If we're building an open platform, that openness needs to
> be thorough, and start from the very beginning as a solid foundation.
>
> I want to re-vamp the php web admin interface that is grub.org, and
> mind-meld it with a wiki, where that wiki actually admins the
> functions of Grub and all feedback is published back through it.
> Most importantly, this includes the ability for anyone to grab the
> cached copies of the crawled content either per-page or in bulk,
> under an open content license. All of the collective knowledge on
> the wiki is also published as open content for anyone else that wants
> to use it on their own crawler.
>
> I'm not talking about having a wiki page for every single URL or
> every domain (that doesn't scale well), but the option to have one
> _if_needed_ is important. The wiki would host a special "Site" page
> that anyone can create, and that uses a regex to match single or any
> grouping of URLs. These Site pages can host human-guided hints to a
> crawler about that set of URLs. I don't have a fixed and all-
> encompassing list of what these hints are or what attributes are
> valuable, I only have a list of suggestions based on my experience,
> and I expect this to change a lot as we go :)
>
> Some quick ideas on what a wiki "Site" page would enable humans to
> contribute:
> - block domains, IPs, paths
> - spider traps, accidental or intentional
> - identify throwaway parts of URLs (like query strings with
> session
> identifiers)
> - identical content that might be obfuscated for auto-detection
> - override hostname aliasing to the right hostname
> - suggest update frequencies or best times to check
> - discoverability (are new urls trustworthy)
> - url models (blog indexes vs. permalinks, mailing list and board
> styles)
> - override titles (suggest where to get the titles in the content)
> - influence-ability (open wikis and comment areas that get spammy)
> - a robots and sitemap definition (if the domain is missing them)
>
> I would eventually like to see it advance even further and provide
> some mechanisms for providing input to understanding the content as
> well, finding dates and common patterns in the URLs, hints about what
> parts of the content are dynamic or static, etc.
>
> I don't expect this to be the *authority* on any of these points, but
> instead be the human override, the guide, for when automated systems
> fail or make mistakes.
>
> It will be treated like any other wiki, with full transparency.
> Anyone can create a Site page and suggest all kinds of hints, and all
> their actions are visible. Even Grub will get tied directly into
> this, so that the content a user crawls is verifiable by others.
>
> What I *hope* to end up with is a purely socially driven crawler
> feeding back into a quality repository that is free and open for
> anyone to build on. There should also be local hosting for anyone
> experimenting with open source projects that need direct/fast access
> to this compiled repository.
>
> I can think of a couple immediate questions people might ask:
> * What about the riff-raff that want to do bad things with the
> repository? I have a fundamental belief that the good of making a
> quality foundation available outweighs the bad. Also, everyone using
> it will be transparent, if the community feels someone is doing
> something inappropriate it will be treated like any other wiki page,
> by discussion and consensus.
> * How does it get new URLs? As part of the wiki, anyone can
> upload
> new lists of URLs, and anyone else can see those lists that they
> uploaded. If there's spare crawling resources the new URLs would be
> checked.
> * What if people are spamming and being ignored, won't that
> pollute
> the repository? A simple social network function either needs to be
> created or derived to curb this, whereby individuals that aren't
> connected or trusted by others have less priority. Any derived
> reputation network is included into the outputs for any later
> function to decide which parts are more valuable.
> * This isn't a search engine! Nope. It's just a step, one that
> should help anyone wanting to build a search engine or experiment in
> this space.
> * Why isn't it working yet, you've had months now? My cloning
> machine is br0k3n.
> * But Wikia is commercial and owns all of this! Nope, Wikia is
> sponsoring it's development, but it's all open source and everything
> published under an open license, nobody owns it and everybody owns
> it :) Wikia won't be the only one sponsoring it either.
>
> Jer
>
>
>
> _______________________________________________
> Search-l mailing list
> Search-l at wikia.com
> http://lists.wikia.com/mailman/listinfo/search-l
> Change options or unsubscribe:
> http://lists.wikia.com/mailman/options/search-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/search-l/attachments/20070813/a01bc998/attachment.html
More information about the Search-l
mailing list