[Search-l] Grub Update
jer
jeremie at jabber.org
Thu Aug 2 18:01:51 UTC 2007
> I'm a bit new to this wikia search thing but the concept of using Grub
> is a bit confusing. It is almost an implementation of the Infinite
> Number of Monkeys approach to spidering the web. It still requires a
> powerful backend to make sense of all the data spidered and that was
> always Grub's flaw.
Correct, this is just one portion of a search platform, this is
entirely the content side. There's still deep/big things to be done
after this, but you can't get to there without having a strong
content foundation first.
> The state of the web has changed since Grub was a player. Most of the
> larger sites now block spidering by DSL and dialup connections.
Most of the larger sites are easy to crawl as well.
> Some
> directories block on User Agent and from what I remember, Grub was one
> string that used to get blocked a lot.
Yep, this needs to be addressed to make sure Grub is behaving
properly. It being an open source project now will obviously make
any naughty behaviour at least a lot more transparent.
> This is the other flaw with a Grub approach - there is no quality
> assurance of the index. Many small search engines have followed the
> Infinite Monkeys approach to indexing, following each URL to find
> more.
> The problem with this approach is that it relies on the back end to
> give
> the data context. They tend to last about 18 months on average.
There's another side to Grub that we've not talked about at all, and
that's very much related to creating and increasing the quality of
the work that Grub is doing. The web interface and user management
in Grub right now is pretty straight forward, and lacks the tools
you'd need to monitor and increase the quality of the crawling. We
need to start a big discussion about a mash-up of Grub and a wiki,
where the wiki is serving as the primary driver and dashboard of all
activities.
This Grub+wiki would be the first step for a social framework to
manage an open crawling/content foundation for search. The wiki
would have some fields to help direct crawlers, including blocking,
timing, depth, discovery, frequency, etc. The Grub results would be
correlated back into the wiki so that crawl samples are easily
checked and problems can be discovered quickly. All of the
intelligence in the wiki and all of the crawl output will be
available under an open doc license, and then begins the next stage,
building out a platform for bulk access and processing, *grin*.
> It should be interesting to see how things turn out.
Indeed :)
Jer
More information about the Search-l
mailing list