Lecture No. 31
Date :- 7 Nov. 2003
Scribe by Bharat Kumar Jain
WEB CACHING
-------------------------------------------------------------------------------------------------------------------------------------------------------
Web Caching:
Caching of documents in the
proxies so that frequently used documents are available on
cache(main memory),
rather than going to the server to fetch it each time the
document is request.
The advantage of Web caching are
1. Reduces Network Traffic.
2. Decreases Response time
3. Reduces load on the server.
4. Cost is reduces, if the client
has to pay for the channel(link) capacity used.
ICP Web caching Protocol:
ICP stands for Internet Cache Protocol. In ICP protocol,
whenever a client request for a document, the proxy
first sees it cache for the document, if found it is given
to the client(this is called local cache hit). Whenever a local
cache miss ccurs, the proxy multicast the requests to all
other proxies in the network. Proxies having that document
will send it to the requesting proxy. This is called cache hit.
If the all proxies in the network does not have the
document then the request is send to the server. This is called cache
miss.
The advantage of ICP protocol are
1. It reduces cache miss ratio.
2. Decreases response time if more local cache
hit occurs.
3. It reduces the cost, if we need to pay for
the channel between proxies and the server.
Disadvantages of ICP protocol are
1. Huge number of messages are transmitted between
proxies whenever a local cache miss occurs. Hence
scalability of proxies is the
problem.
2. The response time is very high whenever the
document is not present in any of the proxies.
-------------------------------------------------------------------------------------------------------------------------------------------------
To overcome the above disadvantages following suggestion were given in
the class.
Just
query the neighbouring proxies rather than all proxies. If
neighbouring proxy does not have it then
neighbouring proxies queries their neighbours and so on.
Though this method reduces Network overhead but
it increases response time.
The other suggestion given by a
student was, when a queried proxy does not have
requested document,
instead of generating a NACK, it should fetch the document from
the server and pass it to the requesting proxy.
But the disadvange of above approach is that if requesting proxy
wants it again, it has to again request it to other
proxies. The second disadvantage of this approach is if there are
n proxies and no one has the requested document
then n-1 requests will be send to the server, thus increasing the
network traffic by a factor of n-1.
----------------------------------------------------------------------------------------------------------------------------------------------------
To overcome the above disadvantages a new protocol called "Summary Cache" has been proposed.
The main idea of it is, instead of querying all the
proxies; query only those proxies which has greater chances of
having the document. Now the question arises how can a proxy knows
which are others the proxies that can have a
required document. This is done by maintaining the cache directory
information(summary) of other proxies.
Before getting into the protocol, lets define two terms frequently used
in the protocol.
False Hit : If a
request to other proxy does not result in the cache hit then it is
called as false hit. This may happen
as
the Cache directory may not contain accurate information.
False Miss : If a request to
other proxy may have resulted in the cache hit but the proxy did not
requested because
there was no entry in the cache directory.
The two issues to be resolved for the above approach are:
1. When to update the Summary
information.
2. Representation of Summary.
The answer to first question is, instead of update after each change in
the cache, update only when there are more
that X% of changes in the cache. It has been found that X can
have value of 1%-10% through Trace Driven Simulation.
The two simple approaches for representation of Summary are :
Exact Directory: The disadvantage of
this approach is that too much main memory is required for storing
the
Directories.
Server Name : Storing only the server
names may result in many false hit.
The answer to the second question is to use BLOOM filter. Bloom filter is a
computationally efficient hash based
probabilistic scheme that can represent a set of URLs of cached
documents with minimum memory requirement while
answer queries with zero false negative and small false positivies.
Bloom Filter:
Bloom filter is essential a data structure for efficient membership
queries. Bloom filter is a method for representing
a set A={a1,a2,a3,a4.....an} of n elements to suport
membership queries. The idea is to allocate a vector of m
bits,
initially all set to 0, then choose k independent hash functions
h1...hk, each with range {1....,m}. For each element
'a' of A, the bits at position h1(a),...hk(a), in v are set
to 1. Note that a particular bit might be set to 1 multiple times.
Now given a query for b we check the bits at positions h1(b),
h2(b)..hk(b) if any one of them is zero then certainly
b is not in the set A. If all the bits are set we can guess with
certain probability that b is present.
False Positive: If we guess b
is present and but the guess is wrong, this is called False Positive.
It has been found that the probability of false positive is 0.9% for 5
hash functions and m/n = 10 which
is almost zero.
Using Bloom Filter in Summary
Representation:
Each proxy maintains a local Bloom filter
to represent its own cached documents. To reflect the changes in the
set A, a bit aray of count is maintained which keeps track of how many
times a bit is set to 1. All counts are initially
zero, When a key is inserted or deleted the counts
c(h1(a),...c(hk(a)) are incremented or decremented respectively.
A proxy builds a Bloom filter from the list of URLs of cached
documents and sends the bit array plus the specification
of the hash functions to other proxies. When updating the summary
the proxy can either specify which bits in the bit
array are flipped or send the whole array, whichever is smaller. The
number of bits used to represent a average number
of documents in the cache is called
LOAD FACTOR. Average number of documents is calculated
by dividing cache size
by 8k. The advantage of Bloom filter is, there provide tradeoff between
the memory requirement and the false positive
ratio just by changing the m/n value.
Example :
Assume that 100 proxies each with 8Gb of cache would like to
cooperate. Each proxy stores on average about 1M
web pages. The bloom filter memory needed to represent 1M pages
is 2Mb at load factor 16. Each proxy needs about
200 Mb to represent all the summaries plus another 1Mb to
represent its own counters. The memory requirement
for ICP protocol to represent 1M pages is 16Mb, therefore each proxy
requires 1600Mb to represent all the summaries.
Clearly ICP is not scalable.
Conclusion:
Thus it can be
concluded that Summary Cache reduces the number of Inter-proxy Messages
implying
less bandwidth requirement and also memory requirment is low when
compared to currently used ICP protocol, as
can be seen in above example. This two advantages helps in
achieving scalibility of protocol. All this advantages
does not effect the Cache hit ratio.