[atlas-l] back to knuggets

jer jeremie at jabber.org
Tue Sep 11 18:29:31 UTC 2007


I've been toying with my simple Factory prototype, and need to iron  
out some more details on knuggets :)

One bold assertion I'm trying to stay pure to is that the original  
"document" is as removed from a knugget as possible.  No byte counts,  
physical locations, or other inner attributes of document.  The  
grounding for this is the fundamental division of responsibility  
between Factories and Collectors, Factories must know the content,  
Collectors just rank the pieces of knowledge.

A simple hierarchy of the original content *is* preserved though,  
such as the relationship between a header of a section and it's  
contents, or the title of a table and it's rows, but it's not ever  
labeled "header" or "rows" and is only encoded as a parent/child  
relationship between the resulting knuggets.

Another assertion is that all the content analysis must be done by  
the Factory, specifically, tokenization.  Every collector shouldn't  
have to implement this, and if possible, I would like to say that a  
Collector could even be language neutral (in general, some Collectors  
may have better ranking via some localization).

The string of content that is the heart of each knugget is therefore  
defined by the Factory as a series of tokens. This series is is  
encoded in Fuz and has two restrictions on whitespace, no newlines  
and additionally no tabs.  The tab character is used simply a  
delimiter between tokens. All other white space (important ones,  
extra white spaces should be normalized to just single ones) is  
preserved and treated as individual tokens as well. This adds some  
verbosity, but it's worth it for it's simplicity, as this tokenized  
tab-delimited sequence will become the full original string by simply  
removing the tabs.

So the sentence:
	Roses are (usually) red and Violets are blue.
Becomes:
	Roses	 	are	 	(	usually	)	 	red	 	and	 	Violets	 	are	 	blue	.

This is the heart of the knugget, and being so, it has an addressing  
scheme, with the first token being 0 and each individual one just the  
count after that.  Other parts of the knugget will refer to the  
tokens based on their address.

A knugget can (should) contain much richer semantics than simply the  
list of individual tokens, as deemed appropriate by the Factory and  
the content (such as performing NLP, NER, TC, and including that meta- 
data for the relevant tokens).  The most common way of doing this is  
the knugget can include one or more meta-token (mt) definitions, each  
of which is any single or grouping of tokens. These meta-tokens are  
defined simply by a list of addresses with numbers, dashes, and  
commas.  The dash means a series, and comma allows a series of  
specific addresses.  These can be combined, such that the mt 12-16  
for the above example is "Violets are blue" and mt 0-3,8 is "Roses  
are red".

Any meta-token should be treated generically by a Collector as an  
individual keyword as indexed as such for the Brokers. A Broker can  
then ask specifically for a compound phrase like "Apple Computer" and  
when the Factory had highlighted that as a mt, it would be matched  
exactly.

This structure is all a bit unusual sounding for a search engine, but  
I believe this approach an important one for search as a platform.

Jer




More information about the Atlas-l mailing list