CSCi 5131 -- Lab 1: Web Proxy

Due: September 30 in class

Introduction:

     Web proxies sit between the client (normally a browser) and a web server performing many useful roles such as caching, filtering, etc. They pass requests from the browser through to the server (if there is a cache miss), and responses from the server back to the browser. In this lab, you will implement a simple Web proxy that speaks HTTP on one end (to a web server) and interacts using a simple message-based interface to/from a client. In the lab, you will apply Java network programming principles discussed in class, implement network protocols including a small portion of HTTP, multithreading, and develop a client-server architecture (in some sesnse, your proxy is both a client and a server). We will also suggest some extra-credit options should you require more challenges (though points will be minimal). A simplified proxy architecture is shown in the figure below.

                                        1) The Web proxy reads a request from a browser and processes the request according to its filter policy.
                                        2) The Web proxy forwards the modified request to the web server.
                                        3) The Web server sends the response back to the web proxy.
                                        4) The Web proxy processes the response and forwards it to the browser.
 
 

Details:

     You will program your proxy in Java (not PERL) and endow it with the ability to gather simple statistics on the traffic flowing through it, filter requests based on some simple criteria, support multiple concurrent connections, and cache documents. You will also evaluate the performance of your proxy. We will give you some skeleton code for the proxy that you may use or ignore.  Either way, you will implement a full-fledged web proxy as specified below.  This lab should be done in a group of 2 or more. See the course overview and intro slides for acceptable bounds of inter-group interaction.

Part 1: Build a simple web proxy:


    A simple web client code (HttpClient.java) is given. When started, three arguments are given on command line as follows(Proxy host name, proxy port number, URL). You can also test with a simple telnet session to the proxy (the telnet client replaces your Java client). However, to gather performance data it will be easier to use your Java client since you can insert timers.

 >> HttpClient.java  tera.cs.umn.edu 8887  www-users.itlabs.umn.edu/classes/Fall-2003/csci5131/test.txt

     The web client connects to the web proxy and send the request(third argument) using socket. Your proxy accepts this request and extracts the web server URL. After connecting to the web server, it generates HTTP Request message and passes it to the web server (see HTTP material discussed in class and/or on the website). When the web proxy gets the response from the web server, it forwards the response to the client.
 

Part 2: Implement a simple caching scheme

Requested files can be temporarily stored at the machine where the web proxy is running. When the web proxy receives a request, it checks if the requested file is stored at the proxy disk cache. If it is, the web proxy returns the stored file. If it isn't, the web proxy forwards the request to the web server and forwards the response from the web server to the client. Before forwarding the response, it stores the returned file from the web server. If there is no room in the cache, the web proxy deletes an old file from the cache using a strategy of your choice, e.g. the LRU (Least Recently Used) strategy. Use an in-memory hash table that hashes the URL to the table entry - the entry can contain residence info and other fields (up to you). It will enable the proxy to quickly determine if the file is there w/o going to the OS. Such a structure could also enable in-memory caching. For files below a certain size, put them in an in-memory file cache. The disk file cache size and the in-memory file cache size should be given as an argument when the proxy is started.

Part 3: Implement a multi-threaded web proxy handling multiple connections

     The first step is implementing a web proxy that can handle multiple connections from clients. The given skeleton code (HttpProxy.java) is a single-threaded process that can handle only one connection at a time. This may be not a big problem if there are not too many requests from clients, but is unacceptable in a real proxy that might have hundreds or thousands of clients contacting it simultaneously. In order to prepare your proxy for the real world situation, you need to modify the skeleton code to handle more than one connection at a time using Java threads. Be careful: threads may be sharing the caching structures implemented in Part 2!

Part 4: Performance Evaluation

Evaluate the performance of your proxy. Devise experiments to examine the benefits of caching and multithreading. You should present performance as seen from the client as a function of number of concurrent connections and size of files retrieved. For caching experiments, you may want to consider both "local" web servers and more "distant" ones. Submit a short description of your experimental setup and performance results (tables or graphs are fine).

Part 5: Acquire statistics

    Your proxy should keep statistics on requests that go through the proxy. Again, be careful about the operations of concurrent threads. Your proxy has to open a log file and save the statistics in the following format:

            Date :: ClientHostName :: URL  ::  FileName  ::  MIME_Type  ::  Size  ::  Status
            Tue 28 September 2003 12:45:00 :: 128.101.35.159 :: http://www-users.itlabs.umn.edu/classes/Fall-2003/csci5131/  :: test.txt  :: text/plain  :: 123  :: Allowed

        Date: Date when the request is received
        ClientHostName: Client host name that issues a request
        URL: requested URL
        FileName: requested file name
        MIME Type: MIME type of requested file
        Size: the size of file sent back to the client
        Status: Allowed/Denied
 

Grading Criteria

1) Basic Web proxy (10 points)
1) Multiple connections (25 points)
2) Statistics (20 points)
3) Caching (35 points)
4) Other Criteria: quality of the solution, including cleanliness of the code, documentation provided, and examples of the program in operation demonstrating all features (10 points).
 

Submission

    You have to submit all files related to your full-fledged web proxy. In addition, you need to submit README file where you give an explanation about how to test your program. This file should also contain the file names submitted.

Extra Credit (5 points each)

There is no partial credits here. Either you have something working up to the level described or not ... The README must also indicate how we can run/test the extra credit options.

Filtering by type of file / Contents(body) of file

Your proxy should be able to deny or allow access to certain files according to its MIME type and the content of file. For example your proxy may want to prevent any picture files (jpg or gif) from going through the connection. Your proxy should be able to block accesses to files containing certain key words (for example, "top secret", "proposal", or etc). For this purpose, you should keep a file (filter.conf) that contains the list of MIME types that should be refused access and the list of key words that should be refused access if they appear in the requested file. When your proxy is started, it should read this configuration file and filter requests according to this configuration. The format of "filter.conf" is up to you.
 

Cooperative Caching

Extend your caching scheme to allow a "web" of proxies to be used or shared. There are many ways in which proxy caches can cooperate - they can be networked in a hierarchical topology such that if one proxy does not hold a document in cache, another can tried. Proxy caches can learn about the existence of documents in other caches and stores links to those caches, etc. Investigate a cooperative caching scheme from the literature (do a google search on those keywords) and implement the scheme. See this interesting paper for limits on the benefits of such caching schemes.