[Search-l] Wikia - Global focus or country level search?
Achim Ruopp
achim at digitalsilkroad.net
Sun Aug 12 15:20:03 UTC 2007
Another thing that has to happen before language analysis is that the
indexer (or crawler?) analyzes the character encoding of the pages. There
are algorithms available for that and some open source implementations
(Mozilla, ICU).
Even then you will not get the right data from all the pages - Google
estimates that about 4% of all pages contain conversion errors or unassigend
(i.e. non-existent) characters. (see
http://www.macchiato.com/slides/unicode_at_google.pdf - slide 13).
Achim Ruopp
Dennis Kubes wrote:
> -----Original Message-----
> From: search-l-bounces at wikia.com
> [mailto:search-l-bounces at wikia.com] On Behalf Of Dennis Kubes
> Sent: Saturday, August 11, 2007 8:39 PM
> To: search-l at wikia.com
> Subject: Re: [Search-l] Wikia - Global focus or country level search?
>
> Localization on the front end so the same search website can
> be used and appropriate content pages can be retrieved, yes.
> But being able to restrict query and search results to
> specific languages requires two
> components:
>
> 1) A very good language identifier. This is usually run on
> content while it is being fetched and stored because you want
> to make it available to processing jobs including indexing
> jobs. There are different algorithms for this but many
> current approaches use character ngrams and distributions to
> identify language from text.
>
> 2) Fields within the index to store language and restrict
> queries, or a better option is completely separate indexes
> and search servers based on language. One of the benefits of
> separate indexes for languages is the ability to scale
> capacity for a given language.
>
> Some numbers we have seen is about 35% of webpages crawled
> are non-english. I would agree though that language needs to
> be designed in from the start even if we only start with english.
>
> Dennis Kubes
>
> Jimmy Wales wrote:
> > I think good quality demands localization, and so it needs to be
> > designed in from the start.
> >
> > John McCormac wrote:
> >> Will Wikia have an more US focus with
> .com/net/org/biz/info being the
> >> primary search targets? Or will each country have its own SE as in
> >> the "pages from $country" thing with Google etc? Has such
> a question
> >> been considered or is it way down the list?
> >>
> >> Regards...jmcc
> >
> >
> > _______________________________________________
> > Search-l mailing list
> > Search-l at wikia.com
> > http://lists.wikia.com/mailman/listinfo/search-l
> > Change options or unsubscribe:
> > http://lists.wikia.com/mailman/options/search-l
> _______________________________________________
> Search-l mailing list
> Search-l at wikia.com
> http://lists.wikia.com/mailman/listinfo/search-l
> Change options or unsubscribe:
> http://lists.wikia.com/mailman/options/search-l
>
More information about the Search-l
mailing list