[atlas-l] back to knuggets
jer
jeremie at jabber.org
Wed Oct 10 23:54:40 UTC 2007
A very late followup I missed, sorry! ...
> If I recall the original scheme the collectors perform the ranking
> based
> on information they receive from the factories which perform the
> "spidering"
> function.
Yep.
> I can imagine many ranking schemes that would use some
> of the "inner" or "physical" attributes of a document such as ratio of
> text data to total markup, age, in and out link counts, word
> length / readability
> etc., etc., To exclude this information from the collector is to
> throw away a
> potentially significant amount of useful information. The knowledge in
> a document is not just a function of the text, many other aspects play
> a role.
Not all of that is thrown away by any means, links, words, basic
framing all still comes through, but the Factory does rather deeply
normalize the original data and remove most of the markup. If there
is potential intrinsic rank-relevant information being discarded then
I'd rather see the Factory also normalize that, perhaps there is a
simple 'quality' attribute that could be derived.
Much of this will only really be worth discussing once there are some
prototypes and a partial stack to play with :)
> Is there an assumption that text higher in the hierarchy may be
> more important? This is questionable especially given the way
> some HTML authors and authoring tools deliberately misuse
> structural elements to achieve presentational effects.
Nope, just that there is a relationship. This isn't a direct
translation of all markup structure either, it's the judgement of the
Factory to decide what is worth preserving, which might only be
header and paragraph tags for instance. Smarter Factories will do a
better job at this, that's healthy.
> OK. Factories will tokenise, that is sensible. Fuz seems a simple
> sensible "wire" protocol although issues of internationalisation
> need to be addressed properly (i.e. what charset is being used
> for knuggets?)
Fuz is just a light structure, the transport would normally define
the charset the Fuz data is in.
Since knuggets are initially HTTP based I'm not sure there's an
issue, other than it would be nice to simplify on just a few standard
charsets that any implementation needs to support?
> Not sure I fully understand this. I've always assumed tokenisation
> splits the
> text into "words" which are pieces of text delimited by what is
> traditionally
> called white space (spaces, tabs, newlines) along with certain HTML
> mark up
> elements (such as <p>) if it's HTML. What is an "important" white
> space?
> [There is an interesting point about punctuation embedded in words.
> In the example below "(usually)" seems to be split into 3 tokens.
> This is
> potentially useful as the separate tokenisation of the full stop at
> the end
> of a sentence prevents the construction of "false" phrases from the
> last
> word of one sentence and the first word of the next. Would text
> elements with
> internal punctuation such as "motor-cycle" and "St.Paul" also be
> split into 3 tokens?
> HTML mark up does not always indicate word boundaries, consider the
> emboldening of the first character of "Certified" using the HTML
> fragment
> "<b>C</b>ertified"]
All of this I consider the domain of the Factory and really dependent
on however it wants to implement it. A Factory that does a better
job at intelligently tokenizing, both using the markup and reacting
the inherent punctuation, will produce better knuggets. What Atlas
is standardizing as knuggets here isn't technique, it's just making a
common framing and naming scheme, and letting the quality be up to
the implementor.
> So the sentence:
> Roses are (usually) red and Violets are blue.
> Becomes:
> Roses are
> ( usually ) red and
> Violets are blue .
>
> This is the heart of the knugget, and being so, it has an addressing
> scheme, with the first token being 0 and each individual one just the
> count after that. Other parts of the knugget will refer to the
> tokens based on their address.
>
> How does this numbering relate to the hierarchy of knuggets? I.e.
> are these word numbers within the document or word number within
> the knugget?
Within the knugget. Starts at 0 for that knugget alone.
> A knugget can (should) contain much richer semantics than simply the
> list of individual tokens, as deemed appropriate by the Factory and
> the content (such as performing NLP, NER, TC, and including that meta-
> data for the relevant tokens). The most common way of doing this is
> the knugget can include one or more meta-token (mt) definitions, each
> of which is any single or grouping of tokens. These meta-tokens are
> defined simply by a list of addresses with numbers, dashes, and
> commas. The dash means a series, and comma allows a series of
> specific addresses. These can be combined, such that the mt 12-16
> for the above example is "Violets are blue" and mt 0-3,8 is "Roses
> are red".
>
> Unless I've seriously misunderstood what's going on "Violets are
> blue" is mt 7-9
> and "Roses are red" is mt 0,1,5. Some more examples please!
The whitespace is still in there, by simply removing the tab
characters you have the original content exactly. So it's "Roses-TAB-
SPACE-TAB-are-TAB-SPACE-TAB" and so on. Then 0-3 is "Roses-SPACE-are".
> Any meta-token should be treated generically by a Collector as an
> individual keyword as indexed as such for the Brokers. A Broker can
> then ask specifically for a compound phrase like "Apple Computer" and
> when the Factory had highlighted that as a mt, it would be matched
> exactly.
>
> So is a meta-token equivalent to a phrase? Are factories expected to
> identify all phrases in a document?
Right, a phrase. This is optional for a Factory, and just a great way
to add value. It only has to identify any or as many phrases as it
can or has resources to.
> Some years ago exploring the capabilities
> of search I came across what I called the "Spice Girls" problem.
> This, as
> some will recall, was the name of an immensely popular band. How would
> a factory recognise "Spice Girls" as a significant phrase? I toyed
> with
> some ideas about a text parser that would spot such phrases in
> distinctive
> contexts and automatically become aware of such phrases and their
> special meanings but didn't get too far with the idea.
It's rather content-specific IMO, there are numerous techniques but
ultimately this is a really big advantage for a Factory that knows a
certain domain of content and can be an expert on it.
> How important is it that all factories tokenise in a consistent
> fashion?
That's really up to the judgement of whomever operates a Collector,
and what their standards are for accepting content :)
> Wild inconsistencies will cause big problems further down the line
> unless there is some way in which collectors/brokers can be aware
> of the tokenisation strategies used by individual factories. Of
> course they
> could just examine the tokenised material but this is probably neither
> easy nor efficient.
It could be pretty involved, but of course starting out it'll likely
be entirely open source projects and a lot easier to sort these kinds
of things.
> This structure is all a bit unusual sounding for a search engine, but
> I believe this approach an important one for search as a platform.
>
> The nugget as a sequence of "words" is excellent and the general
> structure,
> especially in a distributed context, looks really nice but some
> details
> probably need to be focussed. I used to teach comms and one of the
> rules
> of protocol design seems to be "if a protocol can be interpreted
> differently
> by two implementers, then it will be interpreted differently".
I consider that a feature! I'm hoping Atlas can be a model of a more
"organic" protocol, where the differences are expected and even
welcomed, and the act of connecting to other entities and competing
for obtaining/providing resources will result in a basic survival-of-
the-fittest.
Jer
More information about the Atlas-l
mailing list