[Search-l] [relevancy of search results]

William Surowiec wsurowiec at gmail.com
Wed Jun 6 18:59:33 UTC 2007


(This is a plain text reposting of an earlier, accidental html posting 
with an additional link at the end.)

An interesting article 
(http://jcmc.indiana.edu/vol12/issue3/vancouvering.html#schemas) has 
begun to change my mind.

I have been somewhat of a "lurker" waiting to gain access to crawl 
results to pass them through a "natural language processing" pipeline 
(see UIMA: http://incubator.apache.org/uima/.) I admit to not believing 
in the success of a voluntary group rating system (note, this is far 
from saying I believe in the opposite: that it will fail)  I know that I 
do not, and cannot, know the outcome till we get there.

The following quote from the article has forced me to question the 
potential efficacy of both my approach and "the" (quotes because it is 
only my impression of what I believe is still evolving) voluntary group 
rating system being discussed.

*** quoted text follows ***

What is relevance? In a small, well-defined database, it is relatively 
easy to sort relevant from irrelevant documents. On the Web, this is not 
necessarily as simple. One interviewee commented that the standard of 
relevance has changed from when he began to work with information 
retrieval systems:

[W]here the systems used to only be the Dialogues and the Lexis-Nexises, 
you know, I think they strove for a more academic standard of relevance, 
where you define relevance as the relationship between the subject that 
is in the document with what the user is asking about. So it is sort of 
topical relevance. Whereas in the practical world where the search 
engines are reaching today, something being useful to the user and 
something where the user grabs the information and continues, has 
become, I think, more important and less emphasis on say, getting the 
best document. (Interviewee G)

In other words, as this interviewee says elsewhere, it is about 
"satisfying users." Relevance has changed from some type of topical 
relevance based on an applied classification to something more subjective.

*** end quoted text ***

If this is so (and others may fairly argue against that point) then a 
determination of the user relevance of a link needs to be in alignment 
with the intentions of the user and is neither inherent in the document 
nor _any_ meta data associated with the link that is not so aligned.

I believe this leads to requiring knowledge about the user that cannot 
be derived solely from the query - to impute the user's intent will require:

  1. identification of the user (may be anonymous, but a specific
     anonymous user - a token in the user's possession)

  2. the newly entered query from this user

  3. the search history (the ordered collection of query and results
     returned and user action taken) of this user and many others

  4. an ability to impute a current relevancy value for a link in a
     result set for a query given this user and the actions taken by
     similar user/query requests - the hard part

I know that collecting this data will justifiably be offensive to some - 
given enough data, an anonymous user may be identified and a careless 
user far sooner. And, as we are open, this data _will_ be closely 
examined, sometimes by not nice people. Some users will doubtlessly be 
hurt. (It is neither cold heartedness nor insensitivity that prevents me 
from ameliorating that statement - if we collect this data we should do 
it knowing the consequences.)

Given enough data I believe this approach will be both used and yield 
more relevant results than any other. The "used" and "yield" part of 
that sentence is the conversion in me wrought by the article. I now 
doubt a user would make the effort to use even a "semantic search" if 
one were available over a simple keyword search yielding good enough 
results with less effort on their part - sigh. Of course a semantic 
search would be preferentially used by "intelligent agents" - both 
software and some humans. But I sense neither is our target audience.

I believe user history (aka personalization) will be a component in the 
approach taken by the "big boys" (I am intentionally trying to 
communicate a negative in that phrasing as I am annoyed by the belief 
that it is being done quietly by those who will posses a de facto, 
significant, and user appreciated advantage that will be well managed to 
not "cause trouble." )

I do not claim that being technically feasible or because others are 
doing it is sufficient reason for us to do it. But I do not believe in 
another way to deliver the most relevant results to a user (I am open to 
any data - especially contrary data.)

One saving grace we might have, if we were to do this, would be our 
openness. This will help research efforts, inform the public, and 
possibly influence rule makers and others

We now have servers - they are being provisioned. Shall we load the data 
released by AOL last year and begin exploring how to use this type of data?

Bill

ps - I discovered the article via a blog entry by Seth Finkelstein 
(http://sethf.com/) I intend this as a public thank you but realize it 
may yield other fruit :)

pps - I've become aware of an additional article bearing on this point: 
http://jeffnolan.com/wp/2007/05/22/google-flirts-with-evil/






More information about the Search-l mailing list