[Search-l] Grub + Wiki = Open Crawler
peter burden
peter.burden at gmail.com
Mon Aug 13 23:44:13 UTC 2007
jer wrote:
>
> I'm not talking about having a wiki page for every single URL or
> every domain (that doesn't scale well), but the option to have one
> _if_needed_ is important. The wiki would host a special "Site" page
> that anyone can create, and that uses a regex to match single or any
> grouping of URLs. These Site pages can host human-guided hints to a
> crawler about that set of URLs. I don't have a fixed and all-
> encompassing list of what these hints are or what attributes are
> valuable, I only have a list of suggestions based on my experience,
> and I expect this to change a lot as we go :)
>
> Some quick ideas on what a wiki "Site" page would enable humans to
> contribute:
> - block domains, IPs, paths
> - spider traps, accidental or intentional
> - identify throwaway parts of URLs (like query strings with session
> identifiers)
> - identical content that might be obfuscated for auto-detection
> - override hostname aliasing to the right hostname
> - suggest update frequencies or best times to check
> - discoverability (are new urls trustworthy)
> - url models (blog indexes vs. permalinks, mailing list and board
> styles)
> - override titles (suggest where to get the titles in the content)
> - influence-ability (open wikis and comment areas that get spammy)
> - a robots and sitemap definition (if the domain is missing them)
>
Unless I'm more confused than usual this sounds like what I'd call site
metadata.
Focussing on sites rather than pages is, IMHO, essential to a good efficient
spider. To the above list I'd add (since I use them in my spider) things
such as
DNS information (TTL, number of IPs), server identification (remember many
sites can share the same server and traffic control really ought to be
applied on
a per-server basis).
I would hope very much that the vision includes an underlying database
(or some
similar mechanism) to store the metdata in a consistent machine
accessible fashion.
Without that it's all a bit of a waste of time as it'll have to be done
again when we
do start using the data. So let's try and get it right first time, this
means getting down to
defining the semantics and syntax of the metadata.
[Page metadata is, perhaps, easier but again should be agreed - here's
my suggestion
URL
Date last modified (if known)
Date last accessed
Date for next scheduled access
HTTP status code (I use some extra "made-up" codes > 500 for things like
duplicate and robot excluded)
Character set
Checksum (used for duplicate detection)
Word count
Size (total byte count)
Total bytes in words ("payload")
Transfer time
Outlink count - to same site
Outlink count - to other sites in same institution
Outlink count - others
[All above are for "regular" pages rather than images etc.,]
Inlink count - from same site
Inlink count - from other sites in same institution
Inlink count - others
----------- more speculative ------------
Flesch-Kincaid readability score (or similar)
Dewey decimal classification (or other similar scheme)
Total bytes in included stuff (images and media)
Outlink count to known advertising sites.
Google Page Rank
All this information would be useful grist for the ranking engine's mill.
]
> I would eventually like to see it advance even further and provide
> some mechanisms for providing input to understanding the content as
> well, finding dates and common patterns in the URLs, hints about what
> parts of the content are dynamic or static, etc.
>
I really like the idea of deriving and storing this sort of information
- things such as
what normal style is used for things such as dates, telephone numbers,
postal codes,
language/spelling (US or UK English etc.,), typical page generator
software, content
management system used etc.
The implications of site-wide style sheets and scripts could also be
handled. Only
problem is that some (especially larger) sites are not necessarily
consistent
across all documents. Although it's terribly Web 1.0, I can think of
University
departments in the good old days (~1995) in which everybody created
their own
personal home page using whatever tools they felt like using - resulting in
a glorious hodge-podge.
Rather than a strict idea that a site is defined in terms of a DNS name (or
possibly names) it may be better to take a more relaxed view that we work
with a "group" of pages whose (full) URLs share a common initial string.
> I don't expect this to be the *authority* on any of these points, but
> instead be the human override, the guide, for when automated systems
> fail or make mistakes.
>
> It will be treated like any other wiki, with full transparency.
> Anyone can create a Site page and suggest all kinds of hints, and all
> their actions are visible. Even Grub will get tied directly into
> this, so that the content a user crawls is verifiable by others.
>
More information about the Search-l
mailing list