SCRIBE NOTES FOR LECTURE 31:
TOPIC : WEB CACHE SHARING
WEB CACHING
DEFINITION :
The storage of Web files for later re-use at a point more quickly accessed
by the end user.
Caching can happen at many places, including proxies (i.e. the user's
ISP) and the user's local machine. The objective is to make efficient use
of resources and speed the delivery of content to the end user.
The purposes of web caching are :
- Reduce network bandwidth consumption -- If the page is
available in the web cache itself, there is no need to go to the web-server.
- Reduce server load -- SInce the number of requests served decreases.
- Reduce client latency -- Since the required page is available locally.
CACHE SHARING :
The Sharing of caches among Web proxies.It
will reduce the Web traffic and alleviate network bottlenecks. There are
two protocols for this cache sharing.
1. ICP - Internet Cache Protocol.
2. Summary-Cache.
CACHE SHARING VIA ICP :
- When one proxy has a cache miss, it sends queries to all siblings
(and parents) i.e. multicasting , "Do you have the URL ?".
- If some proxy responds with "YES", then send a request
to fetch the file .
- If no proxy responds with "YES" within certain time limit, send
a request to web server.
OVERHEAD OF ICP :
- It is not a scalable protocol. The simulation results show
that ICP incurs considerable overhead even when the number of cooperating
proxies is as low as 4.There are large number of query messages
to be exchanged.
- As the number of proxies increases, the overhead quickly becomes
prohibitive.
Compared with no cache sharing, ICP
- Increases network packets to each proxy by 8-29%.
- Increases CPU overhead by 13-32%.
- Increases user latency by 2-12%.
SUMMARY CACHE :
The compressed directories which consists of the URLs of the
documents are called "summaries". In the summary cache, each proxy
stores a summary of URLs of documents cached at every other proxy. When a
user request for a document then first, the URL of that document is looked
in the local cache.If a cache miss occurs in the local cache, the proxy checks
the stored summaries to see if the requested document is present in other
proxies.If it presents then the proxy sends out requests to the relevent
proxies to fetch the document.If it is not present in other proxies also,
then the proxy sends the request directly to the Web server.
Overheads :
1.False Misses : The document requested is cached at some other proxy
but its summary does not reflect the fact. In this case, a remote cache hit
is lost, and the total hit ratio with in the collection of caches is reduced.
2.False Hits : The document requested is not cached at some kother
proxy but its summary indicates that it is present. The proxy will send a
query message to the other proxy, only to be notified that the document is
not cached there.In this case a query message is wasted.
3.Stale Hits : The document is stored at some other proxy, but that copy
is a stale copy.The effect is wasted query messages.
Two issues to resolve :
1. When to do sumary Updates ?
2. How to summarize ?
If we update the summary database when ever there is a change,then the network
overhead increases.
There are two other approaches for this.
i. Periodic summary Updates
ii. Delay summary Updates until X% of cached documents are 'new'.
For the second option, Trace-driven simulations indicates
that Delay threshold of 1-10% works well in practice. This translates to
Update frequency of about once in 5 minutes.
The second issue to be resolved is How to summarize.
For performance reasons,the summaries are stored in main memory rather than
in hard disk. The memory requirement is determined by the frequency
of summary updates and by the number of cooperating proxies.Since the memory
grows linearly with the number of proxies, it is important to keep the individual
summaries small.
First consider the two summary representations :
i. Exact-directory
ii. Server-name.
In the exact-directory approach, the summary is essentially
the list of URLs represented by its 16-byte MD5 signature.
In the server-name approach, the summary is the collection
of Web Server names in the URLs of cached documents. Since on average, the
ratio of different URLs to different Web Server names is about 10 to
1 the server-name approach can cut down the memory requirement by a factor
of 10.
Neither of the above two approaches is good. The
exact-directory approach consumes too much memory.
Consider the below example.
Let Proxy size is 8GB.
Average File size is 8KB.
There are 16 proxies.
The exact-directory approach would consume (16-1)*16*(8GB/8KB) = 240MB of
main memory per proxy.
The server-name approach, though consuming less memory,
generates too many flase hits that significantly increase the network traffic.
BLOOM-FILTERS :
Bloom-filters support membership test for a set of keys.
A Bloom filter is a method for representing a set A = {a1,a2,...,an}
of n elements to support membership queries.
In this Bloom-filters allocate a vector v of "m" bits,
initially all set to 0, and then choose k independent hash functions, h1(a),h2(a),...hk(a)
in v are set to 1.(A particular bit might be set to 1 multiple times.)
Given a query for b we check the bits at positions h1(b),h2(b),...hk(b).
If any of them is 0,then certainly b is not in the set A. Otherwise we conjuncture
that b is in the set although there is a certain probability that we
are wrong.
The parameters k and m should be chosen such that the
probability of a false positive is acceptable.
The probability that a particular bit is still 0 is exactly
p=(1-1/m)^(kn).
Probability of false positive is (1-p)^k = (1-e^(kn/m))^k.
This value is minimised when k is ln2 * (m/n).
The minimum value is 1/2^k=(0.6185)^(m/n). Since the base
value is less than 1 probabilty decreases exponentially with m/n. This
(m/n) indicates the number of bits per data item.
Proxy builds Bloom-filter and sends to other proxies.This Bloom-filter mechanism
is scalable well . It requires less memory even for moderately large number
of proxies.