[Search-l] Grub + Wiki = Open Crawler

jer jeremie at jabber.org
Wed Aug 15 17:59:23 UTC 2007


I love all your suggestions peter, thanks.  What I was originally  
outlining is the *human* contributed attributes (what value can a  
human provide or override to the crawling process).  What you started  
outlining is a meta-index of the output of a crawler, and I  
absolutely agree that all those attributes and more should be available.

I want to use a wiki with grub.org as an input for the human side,  
and the output of the crawler will be an open DB with the kinds of  
discovered attributes you described.

On Aug 13, 2007, at 6:44 PM, peter burden wrote:

> jer wrote:
>>
>> I'm not talking about having a wiki page for every single URL or   
>> every domain (that doesn't scale well), but the option to have  
>> one  _if_needed_ is important.  The wiki would host a special  
>> "Site" page  that anyone can create, and that uses a regex to  
>> match single or any  grouping of URLs.  These Site pages can host  
>> human-guided hints to a  crawler about that set of URLs.  I don't  
>> have a fixed and all- encompassing list of what these hints are or  
>> what attributes are  valuable, I only have a list of suggestions  
>> based on my experience,  and I expect this to change a lot as we  
>> go :)
>>
>> Some quick ideas on what a wiki "Site" page would enable humans  
>> to  contribute:
>> 	- block domains, IPs, paths
>> 	- spider traps, accidental or intentional
>> 	- identify throwaway parts of URLs (like query strings with  
>> session  identifiers)
>> 	- identical content that might be obfuscated for auto-detection
>> 	- override hostname aliasing to the right hostname
>> 	- suggest update frequencies or best times to check
>> 	- discoverability (are new urls trustworthy)
>> 	- url models (blog indexes vs. permalinks, mailing list and  
>> board  styles)
>> 	- override titles (suggest where to get the titles in the content)
>> 	- influence-ability (open wikis and comment areas that get spammy)
>> 	- a robots and sitemap definition (if the domain is missing them)
>>
> Unless I'm more confused than usual this sounds like what I'd call  
> site metadata.
> Focussing on sites rather than pages is, IMHO, essential to a good  
> efficient
> spider. To the above list I'd add (since I use them in my spider)  
> things such as
> DNS information (TTL, number of IPs), server identification  
> (remember many
> sites can share the same server and traffic control really ought to  
> be applied on
> a per-server basis).
> I would hope very much that the vision includes an underlying  
> database (or some
> similar mechanism) to store the metdata in a consistent machine  
> accessible fashion.
> Without that it's all a bit of a waste of time as it'll have to be  
> done again when we
> do start using the data. So let's try and get it right first time,  
> this means getting down to
> defining the semantics and syntax of the metadata.
>
> [Page metadata is, perhaps, easier but again should be agreed -  
> here's my suggestion
> URL
> Date last modified (if known)
> Date last accessed
> Date for next scheduled access
> HTTP status code (I use some extra "made-up" codes > 500 for things  
> like duplicate and robot excluded)
> Character set
> Checksum (used for duplicate detection)
> Word count
> Size (total byte count)
> Total bytes in words ("payload")
> Transfer time
> Outlink count - to same site
> Outlink count - to other sites in same institution
> Outlink count - others
> [All above are for "regular" pages rather than images etc.,]
> Inlink count - from same site
> Inlink count - from other sites in same institution
> Inlink count - others
> ----------- more speculative ------------
> Flesch-Kincaid readability score (or similar)
> Dewey decimal classification (or other similar scheme)
> Total bytes in included stuff (images and media)
> Outlink count to known advertising sites.
> Google Page Rank
>
> All this information would be useful grist for the ranking engine's  
> mill.
> ]
>> I would eventually like to see it advance even further and  
>> provide  some mechanisms for providing input to understanding the  
>> content as  well, finding dates and common patterns in the URLs,  
>> hints about what  parts of the content are dynamic or static, etc.
>>
> I really like the idea of deriving and storing this sort of  
> information - things such as
> what normal style is used for things such as dates, telephone  
> numbers, postal codes,
> language/spelling (US or UK English etc.,), typical page generator  
> software, content
> management system used etc.
> The implications of site-wide style sheets and scripts could also  
> be handled. Only
> problem is that some (especially larger) sites are not necessarily  
> consistent
> across all documents. Although it's terribly Web 1.0, I can think  
> of University
> departments in the good old days (~1995) in which everybody created  
> their own
> personal home page using whatever tools they felt like using -  
> resulting in
> a glorious hodge-podge.
>
> Rather than a strict idea that a site is defined in terms of a DNS  
> name (or
> possibly names) it may be better to take a more relaxed view that  
> we work
> with a "group" of pages whose (full) URLs share a common initial  
> string.
>> I don't expect this to be the *authority* on any of these points,  
>> but  instead be the human override, the guide, for when automated  
>> systems  fail or make mistakes.
>>
>> It will be treated like any other wiki, with full transparency.    
>> Anyone can create a Site page and suggest all kinds of hints, and  
>> all  their actions are visible.  Even Grub will get tied directly  
>> into  this, so that the content a user crawls is verifiable by  
>> others.
>>
>
>




More information about the Search-l mailing list