[atlas-l] Collector Query (CQ), <strike>SQL</strike>
jer
jeremie at jabber.org
Sat Aug 25 06:23:10 UTC 2007
I'm a dork, I totally have to backpedal on suggesting SQL as a basis
for the CQ :)
Since I began very prelim prototyping of the various systems, I
quickly realized that SQL is sorely deficient for doing exactly the
type of queries that a Collector has to handle. My initial impetus
towards SQL was based on my desire to keep the understanding of a
Collector as a very simple table-like system, keywords, knuggets, and
ranking.
As I was warned and as is absolutely true, SQL is a mess. You can't
choose just a subset of it without aligning yourself with a
particular implementation, and that's just not what the CQ should be
like. The goal of the CQ is to have the simplest compatible-across-
all format, and allow that to be extended based on either common
features or ones specific to a particular Collector. It needs to
evolve, and that evolution will be more than just new tables and
functions as you might have when boxed into SQL.
So I need to talk a little more about a Collector, and what it's job
is, since that is what is driving how to query it. A Collector's
only purpose is to *rank* knuggets by keyword. It is not a general
purpose system for data mining, it's not for rich semantic queries,
it doesn't do spelling corrections or even stemming, and it doesn't
do any general purpose query processing. In fact, a Collector does
not by default even return unordered results, it simply produces a
ranked list of knuggets.
Given the above, a CQ is going to consist of only a few general
filters to reduce the ordered list result, all of these filters are
by default inclusive (and) when specified, and optionally exclusive
(not):
by keyword
- exact keyword only (no stemming)
- case sensitive (yes, really)
by date
- this is age of the knugget
- valid formats are YEAR-MM-DD / YEAR-MM / YEAR
by Factory
- the domain name of the Factory
by URL
- any length of prefix of the URL
- reverse-host format: com.website.www/path/name
Some additional optional fields that aren't for filtering but
instructions for the result set:
skip
- how many knuggets from the results to skip
limit
- max number to send
UDP
- optional for Brokers and Collectors that support it
- IP:Port of a UDP target to send the results to instead of in-line
- Collector should provide a list of source IP/ports out of band so
a NAT'd Broker can punch holes first
The most obvious missing things are OR and grouping. While more
advanced Collectors can support a more rich query, I'm not ready to
make those a base or core requirement. In fact, I think the the
*core* requirement is by keyword only, and the other filters above
would be a secondary higher level even. I would prefer to have a
Collector be very fast and very good at returning simple queries, in
bulk, and let the Brokers get smart about making multiple and/or
larger requests and performing their own advanced logic on the result
sets.
Also, I need to explain the lack of stemming and case sensitivity.
It's the job of a Broker to be the expert on the query, to understand
the semantics around the keywords, and to determine the possible
stemming and correct case possibilities. This is a function that I
believe needs to specifically be part of the Broker, and not inherent
in any way in the CQ. The Broker should be the expert on keyword,
the Collector the expert on ranking a specific keyword.
What's this actually going to look like on the wire? Fuz is
important to me, so I'm probably going to push hard to use that for
the actual CQ encoding. Like everything else though, the wrapper is
HTTP based. My inclination is to use a simple POST with the CQ as
the body. Any Collector can provide any URL to POST to, and apply
any kind of normal access restrictions to it as necessary, cookies
should be supported as well. The HTTP redirects should be adhered to
as well, giving a Collector an easy and standard means of managing or
distributing requests from a Broker. Caching should also work like
normal, so that a Broker can cache result sets. Keep-alives are also
really important for busy Broker/Collector relationships.
Part of the result set (also formatted as Fuz) is an optional "local"
ranking number for any knugget. This number is 0 by default (if not
specified), and can be both positive (higher value) or negative.
While all results are ordered, the distribution of value is never
even, and this local ranking is assigned by the Collector to indicate
the distribution of value in the ordered results. A Collector can
use a global standard system or generate this local ranking for each
result, but a Broker *must* assume that the number only has meaning
within that result and can't be compared across result sets.
Hmm... I think that's enough for now, sorry if this sounds kind of
rambling, this was supposed to be a short note two days ago :)
Jer
More information about the Atlas-l
mailing list