<br><br><div><span class="gmail_quote">On 11/09/2007, <b class="gmail_sendername">jer</b> <<a href="mailto:jeremie@jabber.org">jeremie@jabber.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
I've been toying with my simple Factory prototype, and need to iron<br>out some more details on knuggets :)<br><br>One bold assertion I'm trying to stay pure to is that the original<br>"document" is as removed from a knugget as possible. No byte counts,
<br>physical locations, or other inner attributes of document. The<br>grounding for this is the fundamental division of responsibility<br>between Factories and Collectors, Factories must know the content,<br>Collectors just rank the pieces of knowledge.
</blockquote><div><br>If I recall the original scheme the collectors perform the ranking based<br>on information they receive from the factories which perform the "spidering"<br>function. I can imagine many ranking schemes that would use some
<br>of the "inner" or "physical" attributes of a document such as ratio of<br>text data to total markup, age, in and out link counts, word length / readability<br>etc., etc., To exclude this information from the collector is to throw away a
<br>potentially significant amount of useful information. The knowledge in<br>a document is not just a function of the text, many other aspects play<br>a role.<br><br>I.e. collectors are ranking documents not pieces of knowledge since,
<br>ultimately, users are searching for knowledge that is embedded in<br>documents and users expect to be presented with documents. Of course<br>if the system is intelligent enough to distill knowledge from documents<br>that's a different story.
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">A simple hierarchy of the original content *is* preserved though,<br>such as the relationship between a header of a section and it's
<br>contents, or the title of a table and it's rows, but it's not ever<br>labeled "header" or "rows" and is only encoded as a parent/child<br>relationship between the resulting knuggets.</blockquote>
<div><br>Is there an assumption that text higher in the hierarchy may be<br>more important? This is questionable especially given the way<br>some HTML authors and authoring tools deliberately misuse<br>structural elements to achieve presentational effects.
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Another assertion is that all the content analysis must be done by<br>the Factory, specifically, tokenization. Every collector shouldn't
<br>have to implement this, and if possible, I would like to say that a<br>Collector could even be language neutral (in general, some Collectors<br>may have better ranking via some localization).</blockquote><div><br>OK. Factories will tokenise, that is sensible. Fuz seems a simple
<br>sensible "wire" protocol although issues of internationalisation<br>need to be addressed properly (i.e. what charset is being used<br>for knuggets?)<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
The string of content that is the heart of each knugget is therefore<br>defined by the Factory as a series of tokens. This series is is<br>encoded in Fuz and has two restrictions on whitespace, no newlines<br>and additionally no tabs. The tab character is used simply a
<br>delimiter between tokens. All other white space (important ones,<br>extra white spaces should be normalized to just single ones) is<br>preserved and treated as individual tokens as well. This adds some<br>verbosity, but it's worth it for it's simplicity, as this tokenized
<br>tab-delimited sequence will become the full original string by simply<br>removing the tabs.</blockquote><div><br>Not sure I fully understand this. I've always assumed tokenisation splits the<br>text into "words" which are pieces of text delimited by what is traditionally
<br>called white space (spaces, tabs, newlines) along with certain HTML mark up<br>elements (such as <p>) if it's HTML. What is an "important" white space?<br>[There is an interesting point about punctuation embedded in words.
<br>In the example below "(usually)" seems to be split into 3 tokens. This is<br>potentially useful as the separate tokenisation of the full stop at the end<br>of a sentence prevents the construction of "false" phrases from the last
<br>word of one sentence and the first word of the next. Would text elements with<br>internal punctuation such as "motor-cycle" and "St.Paul" also be split into 3 tokens?<br>HTML mark up does not always indicate word boundaries, consider the
<br>emboldening of the first character of "Certified" using the HTML fragment<br>"<b>C</b>ertified"]<br><br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
So the sentence:<br> Roses are (usually) red and Violets are blue.<br>Becomes:<br> Roses are ( usually ) red and Violets are blue .
<br><br>This is the heart of the knugget, and being so, it has an addressing<br>scheme, with the first token being 0 and each individual one just the<br>count after that. Other parts of the knugget will refer to the<br>tokens based on their address.
</blockquote><div><br>How does this numbering relate to the hierarchy of knuggets? I.e.<br>are these word numbers within the document or word number within<br>the knugget?<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
A knugget can (should) contain much richer semantics than simply the<br>list of individual tokens, as deemed appropriate by the Factory and<br>the content (such as performing NLP, NER, TC, and including that meta-<br>data for the relevant tokens). The most common way of doing this is
<br>the knugget can include one or more meta-token (mt) definitions, each<br>of which is any single or grouping of tokens. These meta-tokens are<br>defined simply by a list of addresses with numbers, dashes, and<br>commas. The dash means a series, and comma allows a series of
<br>specific addresses. These can be combined, such that the mt 12-16<br>for the above example is "Violets are blue" and mt 0-3,8 is "Roses<br>are red".</blockquote><div><br>Unless I've seriously misunderstood what's going on "Violets are blue" is mt 7-9
<br>and "Roses are red" is mt 0,1,5. Some more examples please!<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Any meta-token should be treated generically by a Collector as an
<br>individual keyword as indexed as such for the Brokers. A Broker can<br>then ask specifically for a compound phrase like "Apple Computer" and<br>when the Factory had highlighted that as a mt, it would be matched
<br>exactly.</blockquote><div><br>So is a meta-token equivalent to a phrase? Are factories expected to<br>identify all phrases in a document? Some years ago exploring the capabilities<br>of search I came across what I called the "Spice Girls" problem. This, as
<br>some will recall, was the name of an immensely popular band. How would<br>a factory recognise "Spice Girls" as a significant phrase? I toyed with<br>some ideas about a text parser that would spot such phrases in distinctive
<br>contexts and automatically become aware of such phrases and their<br>special meanings but didn't get too far with the idea.<br><br>How important is it that all factories tokenise in a consistent fashion?<br>Wild inconsistencies will cause big problems further down the line
<br>unless there is some way in which collectors/brokers can be aware<br>of the tokenisation strategies used by individual factories. Of course they<br>could just examine the tokenised material but this is probably neither
<br>easy nor efficient.<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">This structure is all a bit unusual sounding for a search engine, but
<br>I believe this approach an important one for search as a platform.</blockquote><div><br>The nugget as a sequence of "words" is excellent and the general structure,<br>especially in a distributed context, looks really nice but some details
<br>probably need to be focussed. I used to teach comms and one of the rules<br>of protocol design seems to be "if a protocol can be interpreted differently<br>by two implementers, then it will be interpreted differently".
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Jer<br><br><br>_______________________________________________<br>Atlas-l mailing list
<br><a href="mailto:Atlas-l@wikia.com">Atlas-l@wikia.com</a><br><a href="http://lists.wikia.com/mailman/listinfo/atlas-l">http://lists.wikia.com/mailman/listinfo/atlas-l</a><br></blockquote></div><br>