[Search-l] call to action....

peter burden peter.burden at gmail.com
Mon Jun 11 15:04:17 UTC 2007


Sami M wrote:
>
> Hi Folks,
>
> I've been following this list with interest. Internet search is my 
> passion and I'd been following it since 1998 starting with some 
> projects in grad school.
>
> I'd been working on a prototype of a large scale search engine the 
> last 2 years. This has been done mostly moonlighting nights and 
> weekends except for the last three months when I left my day job to 
> work on it fulltime. In summary… I've built a working implementation 
> of the original Google prototype according to their Stanford paper all 
> pushing approximately 60K lines of mostly C code. It is scalable on a 
> cluster of commodity linux boxes with some additional work. A single 
> server in this case can crawl, index, and serve 50M documents.
>

That's most interesting but I'd echo concerns about whether this would 
violate anybody's IPR, software
patents etc. A quick check reveals 50 Google patents

See

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2Fsearch-bool.html&r=47&f=G&l=50&co1=AND&d=ptxt&s1=Google.ASNM.&OS=AN/Google&RS=AN/Google

(comments about the length of the URL should be directed to the US 
Patent Office)

I haven't yet fully analysed this and the diagrams seem to use an image 
format
my browser doesn't know how to handle. About 20% of the patents look as if
they may be relevant to an SE - some are clearly to do with extra bells 
and whistles
on page ranking - others are more interesting, for example 6,658,423 
(december
2003) relates to detection of near-identical duplicates and looks very 
similar to
the techniques I use for the same purpose in my crawler (Grr!)

I'm not sure what other expressions of Google IPR we might transgress 
although,
being secretive, they may not have patented some of their best ideas in 
the interests
of secrecy.

Incidentally if we do build on Sami's software I can offer a crawler 
that will do
50 pages/sec using two very modest domestic PCs (one crawling/parsing and
one saving metadata in a MySQL database). It's written in C and is 
multi-threaded.

> It is turning out to be a big task and now I am looking out for 
> options on what direction to take next. This is a call to action. I am 
> open to any suggestions or feedback. If anyone is interested in 
> joining hands, collaborating, or investing in any sort of way I'd be 
> interested in talking about it. I am based in San Francisco bayarea.
>
I'm about 6,000 miles from San Francisco so can't really offer much 
directly.

> Cheers..
>
> Sami
>
> sami2065 at gmail.com <mailto:sami2065 at gmail.com>
>
>
>     Re: The Anatomy of a Large-Scale Hypertextual Web Search Engine
>     (http:// infolab.stanford.edu/~backrub/google.html
>     <http://infolab.stanford.edu/%7Ebackrub/google.html> )
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Search-l mailing list
> Search-l at wikia.com
> http://lists.wikia.com/mailman/listinfo/search-l
> Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l





More information about the Search-l mailing list