[Search-l] [search wikia] crawl, index, architecture... what about semantic ?
Gérard Dupont
ger.dupont at gmail.com
Tue Jan 8 11:41:10 UTC 2008
Hi folks !
Trying to put some more idea in the loop about the messages hereafter. I'll
try to comment on the fly (sorry for this odd presentation of my idea);
> ------------------------------
>
> Message: 25
> Date: Tue, 8 Jan 2008 00:20:26 +0000
> From: "Peter Burden" <peter.burden at gmail.com>
> Subject: Re: [Search-l] Which does the figure next to each URL mean?
> To: "Mailing list for Search Wikia" <search-l at wikia.com>
> Message-ID:
> <db8759410801071620v774f4d6fr52d932be68dba2eb at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
> I hate to put a bit of a spoke into this discussion but alternative
> ranking
> algorithms can only operate on the page data that has been collected
> in the crawl and page analysis phases of the search engine operation.
> If, for example, you wanted to consider the impact of common words (such
> as "the", "a" etc.,) no amount of clever algorithmic tricks will overcome
> the fact that the crawl and analysis phases have discarded them.
>
> [I have posted elsewhere on the importance of not discarding such
> information, consider searches for "The Times" (my morning newspaper)
> or "Vitamin A" - Google gets these right, Wikia doesn't]
Agree with that, but this is still on "specific" cases I think and for most
of the search the stopwords need to be removed.
SEs also seem to discard punctuation giving the effect of phrase
> searches yielding results in which phrases appear to span clause
> and sentence boundaries. Neither Google nor Wikia get this right.
>
> You might want to give more weight to terms that appear in headings
> [HTML <h1>...</h1> etc.] or other special page contexts or styles. If the
> analysis has discarded this information you can't.
Who said that the analysis discard those information and does not take it
into account ? I do not know the depest secret of search wikia, but that
common in search engine to allow such relevant information to be kept.
The best that could be done is to offer the "advanced" user the possibility
> of tweaking all the weighting terms that are available. This would
> certainly
> be a useful and welcome feature for Wikia but it requires much more
> clarity
> about what the numbers actually mean, if you click on the scores in Wikia
> results, you get complete gobblydegook (well I think it's gobblydegook)
>
> Anything more sophisticated implies that people will generate their own
> crawlers, analysers etc., and the results can be federated in a fashion
> reminiscent of the Atlas proposals.
I fully agree with this and it leads to my remarks about semantic : the
metadata that are analyzed and saved by crawler/indexer should have a
meaning. As I already said, I'm not involved enough in the development part
of the engine but I'm familiar with Lucene and a bit of nutch. Those can
extract some salient information and index them but there is no meaning on
this information. If the idea is to have distributed AND personalized
crawler that can extract any kind of advanced information, those have to be
expressed in a standard way which can adress the "meaning" problem.
Regarding the h1 example you mentionned : you can it H1 information or
title or "titre" (in french) but at the end the indexer need to know that
are the same content but extracted from different crawler.
So you should start thinking what page information (metadata) you actually
> want and then start designing and coding crawlers and analysers to collect
> it
The idea is not to define what are the goos metadata since it will change
over the time I guess. But defining a standard way to express the meaning of
extracted metadata can be much more usefull, Then every kinfd of new
metadata extractor can be placed in the loop and the indexer will be able to
know which are the metadata that can be fused and which are the one which
are not enough reliable to be usefull (perhaps this last step should be
studied deeper...).
Perry
> > ushow2, Inc.
> >
> > P.S. The above supposes you have a window that craftily shows when you
> > search for "YOURSEARCHTOKEN" miniatures of link choices between
> > several top engines and wikia teams? Let users and what they click on
> > help take the measurement. Usage is the yardstick.
> >
> > > > Maybe each team does specialized scoring for certain topics, so
> > > > that way
> > > > people can go on Wikia search try out various searches and providing
> > > > community feedback to what team's doing the job well.
> > > >
> > ...
> > > > Just spitting out ideas. Long live open source.
> > > >
> > > > - Bryan
> > _______________________________________________
> > Wikia Search mailing list
> > http://alpha.search.wikia.com/
> > Change options or unsubscribe:
> http://lists.wikia.com/mailman/options/search-l
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.wikia.com/pipermail/search-l/attachments/20080108/82233823/attachment-0001.html
> ------------------------------
>
> Message: 27
> Date: Mon, 7 Jan 2008 19:09:09 -0600
> From: Bryan Bishop <kanzure at gmail.com>
> Subject: Re: [Search-l] Which does the figure next to each URL mean?
> To: Mailing list for Search Wikia <search-l at wikia.com>
> Message-ID: <200801071909.09412.kanzure at gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Monday 07 January 2008, Peter Burden wrote:
> > Anything more sophisticated implies that people will generate their
> > own crawlers, analysers etc., and the results can be federated in a
> > fashion reminiscent of the Atlas proposals.
>
> Perhaps grub can be an enabling platform for us to schedule and run our
> own custom crawlers that look for specific data. But I doubt there's as
> much discarded information as you say there is ... I bet that it's just
> discarded when the servers start to process the user's search request.
>
- Bryan
> ________________________________________
> Bryan Bishop
> http://heybryan.org/
>
> ------------------------------
>
> Message: 29
> Date: Tue, 08 Jan 2008 02:19:10 +0000
> From: John McCormac <jmcc at hackwatch.com>
> Subject: [Search-l] Indices, Social Search and Distributed Crawling
> To: Mailing list for Search Wikia <search-l at wikia.com>
> Message-ID: <4782DD9E.6060508 at hackwatch.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> While the index is very much a proof of concept index, it makes me wonder
> if a link based,
> distributed crawling approach is broken. The Social Search aspect has the
> capability to provide a
> good, human rated index but distributed crawling is an idea from a time
> when bandwidth and
> processing power was far more expensive than it is now.
>
> Regards...jmcc
> --
> ******************************************************
> John McCormac * e-mail: jmcc at whoisireland.com
> MC2 * voice: +353-51-873640
> 22 Viewmount * web: http://www.whoisireland.com/
> Waterford * The Irish Domains Directory
> Ireland * Irish Domain Stats & Market Research
> ******************************************************
>
>
Waiting for any comments
cheers
G.Dupont
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/search-l/attachments/20080108/51b680dd/attachment.html
More information about the Search-l
mailing list