[Search-l] Grub Update

jer jeremie at jabber.org
Thu Aug 2 18:01:51 UTC 2007


> I'm a bit new to this wikia search thing but the concept of using Grub
> is a bit confusing. It is almost an implementation of the Infinite
> Number of Monkeys approach to spidering the web. It still requires a
> powerful backend to make sense of all the data spidered and that was
> always Grub's flaw.

Correct, this is just one portion of a search platform, this is  
entirely the content side.  There's still deep/big things to be done  
after this, but you can't get to there without having a strong  
content foundation first.

> The state of the web has changed since Grub was a player. Most of the
> larger sites now block spidering by DSL and dialup connections.

Most of the larger sites are easy to crawl as well.

> Some
> directories block on User Agent and from what I remember, Grub was one
> string that used to get blocked a lot.

Yep, this needs to be addressed to make sure Grub is behaving  
properly.  It being an open source project now will obviously make  
any naughty behaviour at least a lot more transparent.

> This is the other flaw with a Grub approach - there is no quality
> assurance of the index. Many small search engines have followed the
> Infinite Monkeys approach to indexing, following each URL to find  
> more.
> The problem with this approach is that it relies on the back end to  
> give
> the data context. They tend to last about 18 months on average.

There's another side to Grub that we've not talked about at all, and  
that's very much related to creating and increasing the quality of  
the work that Grub is doing.  The web interface and user management  
in Grub right now is pretty straight forward, and lacks the tools  
you'd need to monitor and increase the quality of the crawling.  We  
need to start a big discussion about a mash-up of Grub and a wiki,  
where the wiki is serving as the primary driver and dashboard of all  
activities.

This Grub+wiki would be the first step for a social framework to  
manage an open crawling/content foundation for search.  The wiki  
would have some fields to help direct crawlers, including blocking,  
timing, depth, discovery, frequency, etc.  The Grub results would be  
correlated back into the wiki so that crawl samples are easily  
checked and problems can be discovered quickly.  All of the  
intelligence in the wiki and all of the crawl output will be  
available under an open doc license, and then begins the next stage,  
building out a platform for bulk access and processing, *grin*.

> It should be interesting to see how things turn out.

Indeed :)

Jer




More information about the Search-l mailing list