[atlas-l] back to knuggets
jer
jeremie at jabber.org
Tue Sep 11 18:29:31 UTC 2007
I've been toying with my simple Factory prototype, and need to iron
out some more details on knuggets :)
One bold assertion I'm trying to stay pure to is that the original
"document" is as removed from a knugget as possible. No byte counts,
physical locations, or other inner attributes of document. The
grounding for this is the fundamental division of responsibility
between Factories and Collectors, Factories must know the content,
Collectors just rank the pieces of knowledge.
A simple hierarchy of the original content *is* preserved though,
such as the relationship between a header of a section and it's
contents, or the title of a table and it's rows, but it's not ever
labeled "header" or "rows" and is only encoded as a parent/child
relationship between the resulting knuggets.
Another assertion is that all the content analysis must be done by
the Factory, specifically, tokenization. Every collector shouldn't
have to implement this, and if possible, I would like to say that a
Collector could even be language neutral (in general, some Collectors
may have better ranking via some localization).
The string of content that is the heart of each knugget is therefore
defined by the Factory as a series of tokens. This series is is
encoded in Fuz and has two restrictions on whitespace, no newlines
and additionally no tabs. The tab character is used simply a
delimiter between tokens. All other white space (important ones,
extra white spaces should be normalized to just single ones) is
preserved and treated as individual tokens as well. This adds some
verbosity, but it's worth it for it's simplicity, as this tokenized
tab-delimited sequence will become the full original string by simply
removing the tabs.
So the sentence:
Roses are (usually) red and Violets are blue.
Becomes:
Roses are ( usually ) red and Violets are blue .
This is the heart of the knugget, and being so, it has an addressing
scheme, with the first token being 0 and each individual one just the
count after that. Other parts of the knugget will refer to the
tokens based on their address.
A knugget can (should) contain much richer semantics than simply the
list of individual tokens, as deemed appropriate by the Factory and
the content (such as performing NLP, NER, TC, and including that meta-
data for the relevant tokens). The most common way of doing this is
the knugget can include one or more meta-token (mt) definitions, each
of which is any single or grouping of tokens. These meta-tokens are
defined simply by a list of addresses with numbers, dashes, and
commas. The dash means a series, and comma allows a series of
specific addresses. These can be combined, such that the mt 12-16
for the above example is "Violets are blue" and mt 0-3,8 is "Roses
are red".
Any meta-token should be treated generically by a Collector as an
individual keyword as indexed as such for the Brokers. A Broker can
then ask specifically for a compound phrase like "Apple Computer" and
when the Factory had highlighted that as a mt, it would be matched
exactly.
This structure is all a bit unusual sounding for a search engine, but
I believe this approach an important one for search as a platform.
Jer
More information about the Atlas-l
mailing list