[Search-l] Grub + Wiki = Open Crawler

Seth Ford seth.ford at gmail.com
Mon Aug 13 20:55:02 UTC 2007


A mixed model of a meritocracy built on a strong payed editorial based
monitoring and guiding the community?  I have no problem with wikia making
model, personally I think adsense in wikipedia is a productive progression
as well (Hey if PBS can do it...). Personally I think it is about modeling
what the Apache Server group has going... Maybe it takes a little money to
keep a core of well qualified editors around, but that isn't mutually
exclusive to being open as well.

On 8/13/07, jer <jeremie at jabber.org> wrote:
>
> So it's about time I start a thread about what my vision of Grub's
> future be.  This isn't a decree nor a detailed plan, simply the start
> of a discussion and welcome to the input and feedback from all.
>
> Grub doesn't have a specific purpose as implemented, it's more or
> less just a generic "distributed crawler" function that updated a
> bunch of stats.  Then what?  Well, before we even talk about
> indexing, I don't think we've really thought enough about the
> crawling.  If we're building an open platform, that openness needs to
> be thorough, and start from the very beginning as a solid foundation.
>
> I want to re-vamp the php web admin interface that is grub.org, and
> mind-meld it with a wiki, where that wiki actually admins the
> functions of Grub and all feedback is published back through it.
> Most importantly, this includes the ability for anyone to grab the
> cached copies of the crawled content either per-page or in bulk,
> under an open content license.  All of the collective knowledge on
> the wiki is also published as open content for anyone else that wants
> to use it on their own crawler.
>
> I'm not talking about having a wiki page for every single URL or
> every domain (that doesn't scale well), but the option to have one
> _if_needed_ is important.  The wiki would host a special "Site" page
> that anyone can create, and that uses a regex to match single or any
> grouping of URLs.  These Site pages can host human-guided hints to a
> crawler about that set of URLs.  I don't have a fixed and all-
> encompassing list of what these hints are or what attributes are
> valuable, I only have a list of suggestions based on my experience,
> and I expect this to change a lot as we go :)
>
> Some quick ideas on what a wiki "Site" page would enable humans to
> contribute:
>         - block domains, IPs, paths
>         - spider traps, accidental or intentional
>         - identify throwaway parts of URLs (like query strings with
> session
> identifiers)
>         - identical content that might be obfuscated for auto-detection
>         - override hostname aliasing to the right hostname
>         - suggest update frequencies or best times to check
>         - discoverability (are new urls trustworthy)
>         - url models (blog indexes vs. permalinks, mailing list and board
> styles)
>         - override titles (suggest where to get the titles in the content)
>         - influence-ability (open wikis and comment areas that get spammy)
>         - a robots and sitemap definition (if the domain is missing them)
>
> I would eventually like to see it advance even further and provide
> some mechanisms for providing input to understanding the content as
> well, finding dates and common patterns in the URLs, hints about what
> parts of the content are dynamic or static, etc.
>
> I don't expect this to be the *authority* on any of these points, but
> instead be the human override, the guide, for when automated systems
> fail or make mistakes.
>
> It will be treated like any other wiki, with full transparency.
> Anyone can create a Site page and suggest all kinds of hints, and all
> their actions are visible.  Even Grub will get tied directly into
> this, so that the content a user crawls is verifiable by others.
>
> What I *hope* to end up with is a purely socially driven crawler
> feeding back into a quality repository that is free and open for
> anyone to build on.  There should also be local hosting for anyone
> experimenting with open source projects that need direct/fast access
> to this compiled repository.
>
> I can think of a couple immediate questions people might ask:
>         * What about the riff-raff that want to do bad things with the
> repository?  I have a fundamental belief that the good of making a
> quality foundation available outweighs the bad.  Also, everyone using
> it will be transparent, if the community feels someone is doing
> something inappropriate it will be treated like any other wiki page,
> by discussion and consensus.
>         * How does it get new URLs?  As part of the wiki, anyone can
> upload
> new lists of URLs, and anyone else can see those lists that they
> uploaded. If there's spare crawling resources the new URLs would be
> checked.
>         * What if people are spamming and being ignored, won't that
> pollute
> the repository? A simple social network function either needs to be
> created or derived to curb this, whereby individuals that aren't
> connected or trusted by others have less priority.  Any derived
> reputation network is included into the outputs for any later
> function to decide which parts are more valuable.
>         * This isn't a search engine!  Nope.  It's just a step, one that
> should help anyone wanting to build a search engine or experiment in
> this space.
>         * Why isn't it working yet, you've had months now? My cloning
> machine is br0k3n.
>         * But Wikia is commercial and owns all of this!  Nope, Wikia is
> sponsoring it's development, but it's all open source and everything
> published under an open license, nobody owns it and everybody owns
> it :)  Wikia won't be the only one sponsoring it either.
>
> Jer
>
>
>
> _______________________________________________
> Search-l mailing list
> Search-l at wikia.com
> http://lists.wikia.com/mailman/listinfo/search-l
> Change options or unsubscribe:
> http://lists.wikia.com/mailman/options/search-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/search-l/attachments/20070813/a01bc998/attachment.html 


More information about the Search-l mailing list