[Search-l] call to action....
peter burden
peter.burden at gmail.com
Mon Jun 11 15:04:17 UTC 2007
Sami M wrote:
>
> Hi Folks,
>
> I've been following this list with interest. Internet search is my
> passion and I'd been following it since 1998 starting with some
> projects in grad school.
>
> I'd been working on a prototype of a large scale search engine the
> last 2 years. This has been done mostly moonlighting nights and
> weekends except for the last three months when I left my day job to
> work on it fulltime. In summary… I've built a working implementation
> of the original Google prototype according to their Stanford paper all
> pushing approximately 60K lines of mostly C code. It is scalable on a
> cluster of commodity linux boxes with some additional work. A single
> server in this case can crawl, index, and serve 50M documents.
>
That's most interesting but I'd echo concerns about whether this would
violate anybody's IPR, software
patents etc. A quick check reveals 50 Google patents
See
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2Fsearch-bool.html&r=47&f=G&l=50&co1=AND&d=ptxt&s1=Google.ASNM.&OS=AN/Google&RS=AN/Google
(comments about the length of the URL should be directed to the US
Patent Office)
I haven't yet fully analysed this and the diagrams seem to use an image
format
my browser doesn't know how to handle. About 20% of the patents look as if
they may be relevant to an SE - some are clearly to do with extra bells
and whistles
on page ranking - others are more interesting, for example 6,658,423
(december
2003) relates to detection of near-identical duplicates and looks very
similar to
the techniques I use for the same purpose in my crawler (Grr!)
I'm not sure what other expressions of Google IPR we might transgress
although,
being secretive, they may not have patented some of their best ideas in
the interests
of secrecy.
Incidentally if we do build on Sami's software I can offer a crawler
that will do
50 pages/sec using two very modest domestic PCs (one crawling/parsing and
one saving metadata in a MySQL database). It's written in C and is
multi-threaded.
> It is turning out to be a big task and now I am looking out for
> options on what direction to take next. This is a call to action. I am
> open to any suggestions or feedback. If anyone is interested in
> joining hands, collaborating, or investing in any sort of way I'd be
> interested in talking about it. I am based in San Francisco bayarea.
>
I'm about 6,000 miles from San Francisco so can't really offer much
directly.
> Cheers..
>
> Sami
>
> sami2065 at gmail.com <mailto:sami2065 at gmail.com>
>
>
> Re: The Anatomy of a Large-Scale Hypertextual Web Search Engine
> (http:// infolab.stanford.edu/~backrub/google.html
> <http://infolab.stanford.edu/%7Ebackrub/google.html> )
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Search-l mailing list
> Search-l at wikia.com
> http://lists.wikia.com/mailman/listinfo/search-l
> Change options or unsubscribe: http://lists.wikia.com/mailman/options/search-l
More information about the Search-l
mailing list