[Search-l] [Fwd: Re: Wikia search goes live today]

Dennis Kubes kubes at apache.org
Tue Jan 8 23:53:16 UTC 2008


The message below in reply to this thread on the lucene mailing list.

http://www.nabble.com/Wikia-search-goes-live-today-to14665259.html

Thought it might help answer some questions:

Sorry about not responding to this before now, been a little busy :).

For those of you who don't know me, I am a committer on the Nutch
project.  I have been working with Wikia since early July and more
actively since the beginning of November.  Before Wikia I helped start
another search engine based on Nutch called Visvo.com.

For the record, yes Search Wikia is using and will be supporting
Nutch/Hadoop/Lucene/Solr/HBase. It is the intention of Search Wikia to
help develop these projects and their communities.  We have no intention
of keeping the changes we make "proprietary". Everything that Search
Wikia develops (barring an user or personal data) will be considered
open source and freely available.  Any improvements made to the apache
projects will be immediately donated back to the community through the
respective project.

Making search open and transparent is not just limited to source code.
It is our intention to make the Search Wikia data freely open and
available as well.  This means that people will be able to download the
crawl data, link data, content shards, and completed indexes.  Also the
social networking functionality, named foowi, will become its own open
source project (probably with an apache license), and will be available
to download, use, and improve.

And Search Wikia is not alone in this.  Visvo.com in coordination with
Wikia will be releasing all of its data and source code improvements to
the community under an OSI approved license, including a python
framework for managing hadoop configurations on distributed machines,
automating the fetching and indexing process, and for managing search
shards.

In terms of the Nutch logo.  There are two standard nutch installations
and index farms at the following urls.  One in an index hosted at the
ISC and the other is Visvo's open index.  The ISC index has
approximately 35M pages while Visvo's index has a little over 50M pages.

http://search.isc.swlabs.org
http://open-index.visvo.com

The main Search Wikia site is hosted in a secure underground hosting
facility in a bunker in Iowa (http://usshc.com/) and calls to these
indexes.  So when showing cached pages and explain plans those requests
go to their respective indexes.

Both indexes are available for search by either browser based or web 2.0
based clients. We are currently using NUTCH-594 to serve results from
these indexes in both xml and JSON formats.  An example request
searching for java would be:

http://search.isc.swlabs.org/nutchsearch?query=java&hitsPerSite=1&lang=en&hitsPerPage=10&type=json
http://open-index.visvo.com/nutchsearch?query=java&hitsPerSite=1&lang=en&hitsPerPage=10&type=json

So we are busy working on getting the data avaiable for download.
Hopefully we should have a site setup within the next day or so.  If
anybody has any questions or would like to get some specific data feel
free to send me an email.

Dennis Kubes



More information about the Search-l mailing list