[Search-l] Smurf - towards an architecture for participation in search

Nitin Borwankar nitin at borwankar.com
Tue May 1 17:46:12 UTC 2007


Hi Jer, Jimmy,

Have been lurking on this list for a while but have separately spoken to 
Jimmy and Jer about similar ideas.
Here's a first cut at describing an architecture of participation for 
search.
If this is even mildly interesting I'd be glad to help develop it further.
I am not attached to any of the ideas expressed below and am open to 
alternate approaches
- anything that eventually will lead to better search, "built like the 
Internet was built".
Am currently out of the counttry but will be back May 10th.
-- 

Nitin Borwankar

http://walruscarpenter.wordpress.com	Of shoes  and ships  and sealing wax  of cabbages and kings
http://greener.com    Find, Learn, Act .... Greener, the search engine for the planet
http://tagschema.com  Implementation of tag database applications

nitin at borwankar.com
510-872-7066

=========

Smurf - Towards an architecture of participation in search
----------------------------------------------------------
Nitin Borwankar, El Sobrante, CA.
Published under a Creative Commons (attribution, no other restrictions) License


Summary:
Semantic routing over an arbitrary collection of vertical search engines creates an architecture for harnessing collective intelligence in search.

Architectural Motivation
-------------------------
The evolution of the Internet has seen a continual tension between two opposing tendencies.  One of them is the "telco model" of complexity at the core with simple edges i.e. telecom switch with telephones at the endpoints. The other is the native model of the Internet with simplicity at the core and complexity at the edge, i.e. a simple set of rules for routing packets with each router having local knowledge but no central authority.  While this model has won the evolutionary struggle at the lower layers of the network, we seem to have to learn this lesson again and again at the upper application layers every few years.  So it was with bulletin boards like Compuserve vs.the web.  So it is with search.

The current dominant model of search follows the "telco model" of a complex core (massive algorithm crunching compute farms) with simple endpoints - thin client browsers. The wave of architectural tension between the core and the edges now crashes on the beach of search.  With this wave comes the movement of complexity to the edges, of simple routing rules at the core and free participatory interoperability without a central authority, in other words "search built just like the Internet was built", not like Compuserve was built.


The status quo, an architecture of exclusion in search
------------------------------------------------------

The low rumble of the grumble of dissatisfaction with the quality of search has been steadily growing in volume. And with good cause.

There is currently a dominant architecture for search on the Internet. It is a monolithic and closed one. It is not an architecture of participation. It provides no ability for users to enhance the content or quality of the index, nor to flexibly manipulate the search results via filtering and mixing with their own content.  

Aside from the search results themselves, the ability to associate advertising content with search results is also a closed and monolithic process allowing no publisher participation or control in the selection of ads appropriate to content.  

These undesirable characteristics of the dominant architecture appear to be vendor independent.  All major search engines appear to have followed this architectural paradigm. This architecture has become, inadvertently or by design, an architecture of exclusion, of collective disempowerment and hence of stagnation in innovative energy.

User roles of active participation
-----------------------------------

This discussion proposes an alternate architecture - an architecture of participation - by which individual users may actively participate in the creation of a global search infrastructure in the spirit of participation and contribution engendered by Wikipedia.

In this architecture we envisage users playing two major contributory roles.  In the role of search providers they may independently index subsets of the web in semantic clusters (vertical search engines), and provide a "vertical search service" to the search infrastructure.
  
In the role of search users they may provide dynamic feedback about the quality of search results. This dynamic feedback can be for two components of the result.  One, the scoring of search results hence providing a collectively evolved page ranking system. Two, the scoring of search providers based on the subjective assessment of search result quality from that provider.

An example motivating semantic routing
--------------------------------------

Say a vertical search provider indexes a collection of documents/web pages related to the environment and declares to the world that these pages are represented, say, by the following set of keywords or "tag bundle" :-

ecology, environment, green, eco-friendly, solar power, biodiesel, global warming 

by including them in the <Tags> element of the OpenSearch Description Document for that search provider. 
(See <a href="http://www.opensearch.org"> OpenSearch.org </a> and <a href="http://opensearch.a9.com/"> Open search at A9 </a> for details on OpenSearch.)

So, let's say I am searching for information on biodiesel.  A friend of mine who knows about this particular green search engine will say "hey, go search on such and such search engine".  This friend has matched my query, 'biodiesel',  to that green search engine by mentally matching keywords in her head.  This friend has played the role of a semantic router. 

The semantic router in our framework is an automated version of this matching process using some simple keyword representation and matching between query keywords and tag bundles. This matching process can start simple and iteratively evolve over time to involve more sophisticated semantic representations.

Evolving an architecture of participation from first principles
---------------------------------------------------------------

While there are a number of vertical search engines on the Internet, a framework for routing an arbitrary query to the right search engine, or the right set of search engines, is missing.  Such a framework must needs be created in an open collaborative fashion.  Closed collections of vertical search engines already exist and have not made much of a difference in the culture of participation (or complete lack thereof) in search.  

The mechanism that matches and then routes queries to search providers is what we call, for the purposes of this discussion, a "semantic router".   This sounds a lot fancier than it currently is but leaves conceptual "room to grow" in future.  The so-called semantic router is, in this proposal, a tag-matching engine which matches keywords in the query to tags provided by a vertical search provider. With a strong enough match, the query is routed to that provider.  We will deliberately avoid defining the term "strong enough match" more precisely at this point.

We avoid the term "tag based router" because this term already has a different meaning in network engineering.  Moreover, the key idea is that there is some level of 'meaning' involved in the matching process.  At first cut we use a crude 'tag bundle' approach.  In future we may use different semantic representations for vertical search engines, so we don't want to "pour concrete" on the term "tag bundle" or "tag ..." anything.

The key insight here is that a 'tag bundle' or a 'keyword bundle' roughly defines a context - i.e. a subject domain and hence an area to search over. In future we may use something else other than tag bundles to define this and something other than crude tag matching to do query routing.

So we formalize all this talk roughly as follows :-

A Semantic Routing Framework (SmRF or preferably Smurf) is proposed consisting of the following elements
---------------------------------------------------------------------------------------------------------

1) A Vertical Search Provider (VSP) a web site that: 

** operates a Vertical Search Engine (VSE), 
** publishes search results e.g. via OpenSearch v 1.0 and 
** provides the keyword set e.g. in the Tag element of the OpenSearch Description document.

2) A Semantic Router i.e a web based server that provides the following functions:

** an endpoint that accepts submissions of OpenSearch Description document (OSDD)  URL's.  A VSP submits an OpenSearch Description document via a POST operation to this URL endpoint.
** a facility that extracts the contents of the Tag element in the OSDD and associates the tag bundle with the VSE in an internal map 
(tag-bundle handler)
** a tag matching engine that takes the keywords in the query and does a best match against tag bundles of available VSE's (tag-matcher engine)
** a facility that collates the search results and may group by VSE and order by score, or present in some other format.
** a facility that captures user feedback in the form of clickthroughs and of thumbs up/down on links and VSE's

3) A Ranking engine for page rank and site rank

** A page-rank facility that maintains a dynamically updated score for each link, this score is used for sorting results, this score is generated by aggregated user feedback
** A site-rank facility that maintains a dynamically updated score for each site, this score is used for routing queries and sorting result sets, this score is generated by aggregated user feedback

Issues to be resolved 
---------------------
  
a) most efficient scalable techniques for tag matching and routing
b) using "related tags" for a site in addition to tags supplied in OpenSearch Description document 
c) feedback mechanisms and algorithms for page rank
d) feedback mechanisms and algorithms for site rank
e) update scores quasi-statically ? (every hour, day, week) 

While these are non-trivial issues, and should not be glossed over, these issues a)-e) are not the issues that have held up creating an architecture of participation in search. Moreover they have been tackled in other contexts and while it may be an exaggeration to say that they are "well understood", they are at least "well known" and "actively being worked on".  Given sufficient participation in this or some similar effort these issues will become well understood. So we will not be ignoring them.  There will be energetic participation in all of these.

Wrapup
------

Well wishers are encouraged to widely disseminate these ideas. Naysayers are invited to actively and vigorously critique them. 

----------

Why Smurf is different from ....

A9 search aggregator
--------------------
* in A9 aggregator a human being has to pick the sites that search needs to be directed to - this is a major show stopper

JagTag
------

* Unclear how query is matched to engine
* the UI in JagTag is multi step 

"Federated Search"
------------------
* More of a description rather than a concrete architecture or protocol
* Smurf is semantic federated search





More information about the Search-l mailing list