[atlas-l] back to knuggets
Peter Burden
peter.burden at gmail.com
Tue Sep 11 22:06:11 UTC 2007
On 11/09/2007, jer <jeremie at jabber.org> wrote:
>
> I've been toying with my simple Factory prototype, and need to iron
> out some more details on knuggets :)
>
> One bold assertion I'm trying to stay pure to is that the original
> "document" is as removed from a knugget as possible. No byte counts,
> physical locations, or other inner attributes of document. The
> grounding for this is the fundamental division of responsibility
> between Factories and Collectors, Factories must know the content,
> Collectors just rank the pieces of knowledge.
If I recall the original scheme the collectors perform the ranking based
on information they receive from the factories which perform the "spidering"
function. I can imagine many ranking schemes that would use some
of the "inner" or "physical" attributes of a document such as ratio of
text data to total markup, age, in and out link counts, word length /
readability
etc., etc., To exclude this information from the collector is to throw away
a
potentially significant amount of useful information. The knowledge in
a document is not just a function of the text, many other aspects play
a role.
I.e. collectors are ranking documents not pieces of knowledge since,
ultimately, users are searching for knowledge that is embedded in
documents and users expect to be presented with documents. Of course
if the system is intelligent enough to distill knowledge from documents
that's a different story.
A simple hierarchy of the original content *is* preserved though,
> such as the relationship between a header of a section and it's
> contents, or the title of a table and it's rows, but it's not ever
> labeled "header" or "rows" and is only encoded as a parent/child
> relationship between the resulting knuggets.
Is there an assumption that text higher in the hierarchy may be
more important? This is questionable especially given the way
some HTML authors and authoring tools deliberately misuse
structural elements to achieve presentational effects.
Another assertion is that all the content analysis must be done by
> the Factory, specifically, tokenization. Every collector shouldn't
> have to implement this, and if possible, I would like to say that a
> Collector could even be language neutral (in general, some Collectors
> may have better ranking via some localization).
OK. Factories will tokenise, that is sensible. Fuz seems a simple
sensible "wire" protocol although issues of internationalisation
need to be addressed properly (i.e. what charset is being used
for knuggets?)
The string of content that is the heart of each knugget is therefore
> defined by the Factory as a series of tokens. This series is is
> encoded in Fuz and has two restrictions on whitespace, no newlines
> and additionally no tabs. The tab character is used simply a
> delimiter between tokens. All other white space (important ones,
> extra white spaces should be normalized to just single ones) is
> preserved and treated as individual tokens as well. This adds some
> verbosity, but it's worth it for it's simplicity, as this tokenized
> tab-delimited sequence will become the full original string by simply
> removing the tabs.
Not sure I fully understand this. I've always assumed tokenisation splits
the
text into "words" which are pieces of text delimited by what is
traditionally
called white space (spaces, tabs, newlines) along with certain HTML mark up
elements (such as <p>) if it's HTML. What is an "important" white space?
[There is an interesting point about punctuation embedded in words.
In the example below "(usually)" seems to be split into 3 tokens. This is
potentially useful as the separate tokenisation of the full stop at the end
of a sentence prevents the construction of "false" phrases from the last
word of one sentence and the first word of the next. Would text elements
with
internal punctuation such as "motor-cycle" and "St.Paul" also be split into
3 tokens?
HTML mark up does not always indicate word boundaries, consider the
emboldening of the first character of "Certified" using the HTML fragment
"<b>C</b>ertified"]
So the sentence:
> Roses are (usually) red and Violets are blue.
> Becomes:
> Roses are ( usually )
> red and Violets are blue .
>
> This is the heart of the knugget, and being so, it has an addressing
> scheme, with the first token being 0 and each individual one just the
> count after that. Other parts of the knugget will refer to the
> tokens based on their address.
How does this numbering relate to the hierarchy of knuggets? I.e.
are these word numbers within the document or word number within
the knugget?
A knugget can (should) contain much richer semantics than simply the
> list of individual tokens, as deemed appropriate by the Factory and
> the content (such as performing NLP, NER, TC, and including that meta-
> data for the relevant tokens). The most common way of doing this is
> the knugget can include one or more meta-token (mt) definitions, each
> of which is any single or grouping of tokens. These meta-tokens are
> defined simply by a list of addresses with numbers, dashes, and
> commas. The dash means a series, and comma allows a series of
> specific addresses. These can be combined, such that the mt 12-16
> for the above example is "Violets are blue" and mt 0-3,8 is "Roses
> are red".
Unless I've seriously misunderstood what's going on "Violets are blue" is mt
7-9
and "Roses are red" is mt 0,1,5. Some more examples please!
Any meta-token should be treated generically by a Collector as an
> individual keyword as indexed as such for the Brokers. A Broker can
> then ask specifically for a compound phrase like "Apple Computer" and
> when the Factory had highlighted that as a mt, it would be matched
> exactly.
So is a meta-token equivalent to a phrase? Are factories expected to
identify all phrases in a document? Some years ago exploring the
capabilities
of search I came across what I called the "Spice Girls" problem. This, as
some will recall, was the name of an immensely popular band. How would
a factory recognise "Spice Girls" as a significant phrase? I toyed with
some ideas about a text parser that would spot such phrases in distinctive
contexts and automatically become aware of such phrases and their
special meanings but didn't get too far with the idea.
How important is it that all factories tokenise in a consistent fashion?
Wild inconsistencies will cause big problems further down the line
unless there is some way in which collectors/brokers can be aware
of the tokenisation strategies used by individual factories. Of course they
could just examine the tokenised material but this is probably neither
easy nor efficient.
This structure is all a bit unusual sounding for a search engine, but
> I believe this approach an important one for search as a platform.
The nugget as a sequence of "words" is excellent and the general structure,
especially in a distributed context, looks really nice but some details
probably need to be focussed. I used to teach comms and one of the rules
of protocol design seems to be "if a protocol can be interpreted differently
by two implementers, then it will be interpreted differently".
Jer
>
>
> _______________________________________________
> Atlas-l mailing list
> Atlas-l at wikia.com
> http://lists.wikia.com/mailman/listinfo/atlas-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/atlas-l/attachments/20070911/dea647ba/attachment-0001.html
More information about the Atlas-l
mailing list