[Search-l] Sorry to do this but its coming, yes a rant :-(

Dennis Kubes kubes at apache.org
Tue Apr 1 03:08:34 UTC 2008



Mark (Markie) wrote:
> hmmm sent one reply to this, but forgot to comment :-p
> 
> so, comments below
> 
> mark
> 
> On Mon, Mar 31, 2008 at 2:04 AM, Dennis Kubes <kubes at apache.org 
> <mailto:kubes at apache.org>> wrote:
> 
>     Hi Markie,
> 
>     First let me say that if anything has been missed, or promised and then
>     not delivered, it was not intentional.  
> 
> 
> okay so maybe there not intentional, but any chance of them being 
> sorted? :-p

Absolutely. :)

>  
> 
>     Second, I would agree with you
>     that while we have been working to make changes to improve the accuracy
>     of the search results, we have not been doing a very good job of keeping
>     the community informed about those or other changes and that is
>     something we need to work on.
> 
>     For my part I will attempt to communicate more of what we are working on
>      in terms of the search engine internals, starting now.  
> 
> 
> excellent :-D many thanks
>  
> 
>     Probably the
>     biggest improvement we have seen in terms of relevancy is changing how
>     inbound link text is index.
> 
>     Inbound link text the text of anchors pointing to a page.  We currently
>     index that text along with a given page.  So for example if page x links
>     to page y and the anchor text reads "hotels" that text will get put into
>     the index under page y.  The problem we were having was we would index
>     the first N number of links pointing to a page without regard for what
>     were the best links.  That provided for some weird results when we
>     launched, for instance google.com <http://google.com> would come up
>     in a search for dallas
>     hotels because it had one inbound link that said "dallas" and another
>     that said "hotels".  To fix this we started looking and inbound links
>     according to the score of their parent (pointing from) page.  The idea
>     behind this was that higher scoring pages would have better outbound
>     links.  In our current index we first determine what the *best* links
>     are by their parent pages score and then index the first N best links.
>     And what we have seen as a result is a big increase in the relevancy of
>     the search results.
> 
> 
> excellent, as this has been one of our major problems, so im glad to 
> hear that work is being done to sort the problem

Funny thing is we didn't know the extent to which this change would 
help.  Once we deployed it we saw a noticeable improvement.  One of the 
problems I see is being able to determine if a change actually improves 
search results (besides just looking at the results).  Any ideas on how 
to determine this?

>  
> 
> 
>     Here is a list of the things I see that could help improve search
>     relevancy going forward:
> 
>     - Being able to score elements of web pages.  For example determine if a
>      piece of text is a h1, h2, div, etc.  Currently our web pages parsers
>     don't support that.
> 
> 
> are these codes available anywhere in wikia's svn?

Many of the improvements are being made to Nutch directly.  Any changes 
to Nutch are submitted back to the Apache JIRA and if the Nutch 
community oks them to the Apache SVN repository.  A lot of the 
discussion for Nutch internals would happen on the Nutch user and dev 
mailing lists.

Dennis

>  
> 
> 
> 
>     - Better integration of the star system into the rankings and better
>     ability for the community to tag pages as spam.  This is part of the KT
>     stuff Jer has been working on.
> 
> 
> :-D
>  
> 
> 
> 
>     - Overall improvement in the search algorithm.  Currently the algorithm
>     is based on nutch's OPIC implementation.  Long story short this
>     algorithm is unstable after a few iterations because web page score keep
>     increasing exponentially.  This is more of a Nutch problem and has
>     already been discussed on the Nutch lists but essentially we need a new
>     process for scoring and probably a new algorithm that is more
>     pagerank-like and has some type of convergence.
> 
>     There are other items as well but I think these things would help show a
>      dramatic improvement in search quality.
> 
>     Last let me say that anybody should feel free to email me at any time.
>     If something isn't being done fast enough or something seems to be
>     getting left out.  Give me a nudge. :)
> 
> 
> /me adds you to contacts :-p
>  
> 
> 
>     Dennis
> 
> 
>     Mark (Markie) wrote:
>      > re sending in case it was missed, from 4/5 days ago, maybe the people
>      > copied in (wikia staff/founders) would be willing to give a small
>     amount
>      > of time to reply?!?
>      >
>      > mark
>      >
>      > ---------- Forwarded message ----------
>      > From: *Mark (Markie)* <newsmarkie at googlemail.com
>     <mailto:newsmarkie at googlemail.com>
>      > <mailto:newsmarkie at googlemail.com
>     <mailto:newsmarkie at googlemail.com>>>
>      > Date: Wed, Mar 26, 2008 at 11:13 PM
>      > Subject: Sorry to do this but its coming, yes a rant :-(
>      > To: Mailing list for Search Wikia <search-l at wikia.com
>     <mailto:search-l at wikia.com>
>      > <mailto:search-l at wikia.com <mailto:search-l at wikia.com>>>, Search
>     Wiki <searchwiki at wikia.com <mailto:searchwiki at wikia.com>
>      > <mailto:searchwiki at wikia.com <mailto:searchwiki at wikia.com>>>,
>     Jimmy Wales <jwales at wikia.com <mailto:jwales at wikia.com>
>      > <mailto:jwales at wikia.com <mailto:jwales at wikia.com>>>, jer
>     <jeremie at jabber.org <mailto:jeremie at jabber.org>
>      > <mailto:jeremie at jabber.org <mailto:jeremie at jabber.org>>>,
>     dennis at igfoo.com <mailto:dennis at igfoo.com> <mailto:dennis at igfoo.com
>     <mailto:dennis at igfoo.com>>
>      >
>      >
>      > Right, im afraid the time has come once again where i have been
>      > wondering to my self again, and i feel that things need to be
>     said, so
>      > here they are.
>      >
>      > *Whats happening with the project.  AFAIK overall (and i know
>     somethings
>      > have happened) but *very* little seems to have happened since the
>      > launch.  Now i know that things are probably happening with the team,
>      > but any chance of actually telling the users about this, cos its not
>      > looking good from here atm.
>      >
>      > Ive copied in the so called pillars of search
>      >
>      >    1. *Transparency* - riiiiiiight :-(
>      >    2. *Community* - hmmm contribute to stale projects?
>      >    3. *Quality* - well....
>      >    4. *Privacy <http://search.wikia.com/wiki/search:Privacy>* -
>     hmm yes
>      >       that seems to have been done to an extent ( by the
>     community mind)
>      >
>      >
>      > Ive been on the project since dec 2006, and so have been waiting
>     along
>      > time for this to happen, so its not purely a case of i want
>     everything
>      > to happen NOW, i just want it to look like SOMETHING will happen
>     SOON.
>      >
>      > *This brings me onto the next topic of where is the project going???
>      > There has been practically no progress, and frankly i cant see much
>      > being done from my point.  The launch has happened, many people were
>      > interested, contributed but have now left, because NOTHING has
>     happened.
>      > so overall the net gain of launching the project?? bad press and
>     a few
>      > (relative to the web) minis.
>      >
>      > *Many things have been promised by various people, which havent
>      > happened. Most specifically this has come from a certain member of
>      > staff, one specifically, that has said that they will do many things,
>      > but even the most basic of tasks seem to have not happened. so
>      > Broken/missed promises. Well iirc (name here) said he would make sure
>      > that the about pages etc were created, hmm...
>      > (http://alpha.search.wikia.com/about.html in case you forgot
>     where those
>      > were).  This is a wikia project, any chance of getting ANY
>      > involvement/input/co-ordination from the team who, ultimately,
>     want us
>      > to make them more successfull and a profit (if were being frank).
>      >
>      > Now i know i havent been that active recently on the wiki, but i have
>      > been reading the mailing lists and talking in irc, but the main
>     reason
>      > for me not being active on the wiki, is mainly the fact that i
>     just dont
>      > have the motivation to do anything because of the above.  Frankly atm
>      > its a stale project, but hopefully this rant (which i hate doing)
>     will
>      > mean that the project will hopefully become better.
>      >
>      > If i have offended anyone above then i am sorry, but i feel that
>     certain
>      > things need to be said right now, in order to make the project
>     better,
>      > which is my aim.
>      >
>      > Many thanks and look forward to the responses to this, especially
>     from
>      > wikia staff
>      >
>      > Regards
>      >
>      > mark
>      >
>      > (user:Markie)
>      >
>      >
>      >
>     ------------------------------------------------------------------------
>      >
>      > _______________________________________________
>      > Wikia Search mailing list
>      > http://alpha.search.wikia.com/
>      > Change options or unsubscribe:
>     http://lists.wikia.com/mailman/options/search-l
>     _______________________________________________
>     Wikia Search mailing list
>     http://alpha.search.wikia.com/
>     Change options or unsubscribe:
>     http://lists.wikia.com/mailman/options/search-l
> 
> 



More information about the Search-l mailing list