[Search-l] Architecture for Self Scaling Search
Philip Haynes
phaynes at ozemail.com.au
Thu Aug 9 23:56:07 UTC 2007
Hi Gerard,
Whilst I have been working on this architecture style over a period I
havent published.
Instead I have been experimenting with implementations & testing aspects in
industrial usage.
At a high level the approach is as follows (I hope this makes sense):
1. Goal:
a. Treat a network of computers as single logical network computer.
2. Design Requirements:
a. Each computing node must have the same capability as every other.
Presentation, application server, comms, and persistence etc.
b. Reliability achieved through software on an unreliable network.
3. Approach:
a. Software Miniaturisation. The traditionally software stack is
separated into discrete applications (e.g. Apache, App Server, My SQL). This
causes significant replication of code & software size. Integrating each of
these layers into a single executable sees a 1000x reduction in application
size making it feasible for each node on the network to have the same
capability as every other one. Miniaturisation of the software stack also
sees significant performance improvements since a full application request
is all in memory.
b. Asynchronous HTTP as the communications protocol. Each node is both
a client & server. An asynchronous design enables connection to 10,000+
other nodes on the network. It also means multi-cast communication behaviour
can be simulated. This is has been used to replicate HTTP session state for
example.
c. Persistence is achieved via transactional HTTP PUTs
d. Application Logic via a REST APIs to enable execution of remote
objects.
e. Reliability is achieved through layered into discrete restartable
services ala Recoverability Oriented Computing (ROC). Each execution
instance is monitored to ensure it is up and automatically micro-rebooted if
required.
f. Shared Spaces. A shared space is a logical execution environment. A
shared file space automatically distributes files across the network.
g. Code distribution. Achieved via a shared file spaces, remote code
load and executed through a service.
So for Search my very preliminary thinking is:
1. Each node on the network contains a proxy server. All client HTTP
traffic goes through this. Via pre-emptive page loading mechanism data is
cached. This enables faster client behaviour as well as enabling a mechanism
to gather real time usage statistics (privately of course).
2. Logic for spidering, parsing, compression & indexing etc, as well as
usage monitoring is distributed & is executed on node computers.
3. Low cost root computers running ZFS provide a mechanism to keep
index data. As of today it is possible to purchase servers for about $13KUSD
and store 12TB of data in a reliable configuration. A relatively small
number of nodes (say 100), are distributed globally. My calculations show
about 900 servers would be required to serve the worlds static HTTP
traffic, with search being a logical subset of say 10%.
Phil
_____
From: Gérard Dupont [mailto:ger.dupont at gmail.com]
Sent: Friday, 10 August 2007 1:02 AM
To: search-l at wikia.com
Cc: phaynes at ozemail.com.au
Subject: Re: Search-l Digest, Vol 9, Issue 4
Hi,
Your self scale architecture appear to be promising and your figure too !
Unfortunately, I can't find any paper on it and your abstract does not
describe enough to see if it can be applied to search. Do you have any link
?
G.Dupont
Hi,
I have only recently been lurking in this community. I hope my question
isn't out of school or already well understood.
I have been developing / researching architectures to support very large
scale reliable computing at low cost. It would appear search is a good
candidate application. The purpose of this note is to solicit interest in
collaborating with me to flesh out and prototype a specific Search
architecture for Search.
The architectural model I have being used is self-scaling (as per figure
below). The goal is being able to treat a network of computers as a logical
computing space. The implementation integrates HTTP servers & clients into a
single execution space using an asynchronous I/O engine. With this approach
the capability to serve 10,000+ concurrent HTTP connections and maintain a
multi-gigabyte throughput has been demonstrated on a single device. Combined
with technologies such as ZFS, means terra bytes of content can be cheaply
stored, search and served. This enables 100-1000x lowering in the cost to
serve a web request, but with a highly reliable configuration. The
capability to serve 1/7th of Australia internet traffic with a single $2,500
PC was recently demonstrated.
My thinking for search was that if a proxy server & client was deployed on
each device, content could be pre-downloaded and served locally. Anyone
using the search service would have a much faster internet experience. This
content could also be processed and monitored in real time for actual usage.
The communities accessing sites would do most data aggregation, only finally
transferring source data to a much smaller number of root computers. Network
& computing costs for search are thus mostly offloaded to the edge of the
network. A better service that improves the more people use it, but at a
decreasing or static cost.
If this is an approach worth pursuing, I would appreciate the feed back.
Regards,
Phil Haynes
Figure 1 Self Scaling computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.wikia.com/pipermail/search-l/attachments/20070809/23461f66/atta
chment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 15101 bytes
Desc: not available
Url :
http://lists.wikia.com/pipermail/search-l/attachments/20070809/23461f66/atta
chment.jpe
------------------------------
_______________________________________________
Search-l mailing list
Search-l at wikia.com
http://lists.wikia.com/mailman/listinfo/search-l
End of Search-l Digest, Vol 9, Issue 4
**************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/search-l/attachments/20070810/6864a469/attachment.html
More information about the Search-l
mailing list