[Search-l] What Is Wikia and How Real Is It?

John McCormac jmcc at hackwatch.com
Tue Aug 7 17:08:17 UTC 2007


peter burden wrote:
> Well the sums aren't that difficult. Assuming 10,000,000,000 pages total,
> an average size of 10kBytes and an average "half-life" of 6 months you
> need about 50 Mbd bandwidth. The size and "half-life" figures were derived
> from researches I did some 3-4 years ago. The increasing number of Web 2.0
> sites infested with Ajax etc., make it harder but it still looks doable 
> with
> a particularly well-endowed personal set-up.

It is a good starting point. However the web is larger than 10B pages 
and pages are increasing in size. The average on the .eu survey I did in 
June was about 7KB but some of the larger pages were nearly 900KB in 
size. When you add in the ccTLDs, you could well be looking a multiple 
of the 10B starting point.

A lot of the web is static so the six month average half-life would work 
out for them. The dynamic web is not so easy to deal with. It takes at 
least a year or so of continual spidering to detect the neccessary 
patterns to identify sites that should be spidered more frequently.

> Is the "search index" here just a list of sites/URLs or all the parsed 
> content and

The search index (the way I am using it here) is effectively the target 
list but it tends to be interchangable with the search database.

> metadata? Is the concern those numerous but small sites that aren't 
> linked from
> anywhere else? What about persuading the domain registrars to let us do
> periodic DNS zone transfers? Then you get a list of all of the sites, of 
> course

The gTLDs are easy. The ccTLDs do not permit zone transfers and do not 
make their zonefiles available. In the EU, it is a data privacy thing. 
The .us is available though. If you tell them that the resultant data is 
to be open sourced, the ccTLD registries will probably just laugh.

Detecting the unlinked web is a bit more complex but you can never get 
100% of it.

> you've still got to sort the wheat from the chaff. The current debates 
> on this

This is the part that has to be highly automated to be efficient. Human 
intervention should only be necessary to build the set of rules for 
parsing. However the amount of programming required is considerable. And 
it is an ongoing task.

> list don't seem to suggest any real answer to this problem other than 
> getting
> "the community" to look at each one. [Assuming the aforementioned
> 10,000,000,000 pages to be checked every 6 months and assuming it
> takes a human 30 seconds to check and that that human will be able to
> devote 2 hours a day to the task - that's a community of about 250,000
> people.]

Are there 250K people out there who have the background and knowledge to 
make those decisions? I think that such a process (as opposed to a 
proper search index quality assurance system) would be a waste of 
resources. Perhaps it might be better to use this theoretical community 
for some kind of classification work.

> What suggestions and ideas are there for telling the wheat from the chaff?

Each TLD/ccTLD requires its own set of rules. The complexity of those 
rules increases with the number of languages being used in the 
TLD/ccTLD. The most important is learning the strings "coming soon" and 
"under construction" in all known languages. :)

Regards...jmcc
-- 
******************************************************
John McCormac  *  e-mail: jmcc at whoisireland.com
MC2            *  voice:  +353-51-873640
22 Viewmount   *  web:  http://www.whoisireland.com/
Waterford      *  blog: http://blog.whoisireland.com
Ireland        *  Irish Domain Stats & Market Research
******************************************************



More information about the Search-l mailing list