For this lab, you will implement a distributed system that we will design in class. The system will host a chat service, where multiple users can log in and exchange public messages over the internet. Our system must meet these two requirements, which will significantly impact our design:
In addition to the requirements for scalability and reliability, we will make a few simplifying assumptions about the system:
By “tolerating a failure,” we mean that existing users can continue to exchange messages. New users may not be able to connect, at least during some types of failures.
The system will not keep a log of previous messages. Users will only see messages that were received after they connected, and only if they keep a local copy of those messages.
We will trust users to participate fairly and honestly in this system; we could add a cryptographic protocol for verifying messages in the future.
We will begin on lab day by discussing these requirements and proposing systems that could satisfy them.
Group information is no longer available for this course.
ui_add_messagefrom multiple threads?
writefrom multiple threads?
You will be implementing a peer-to-peer system, where clients connect to each other rather than to a central server. This approach limits the number of connections any one node in the network needs to support, and makes the system relatively tolerant to failures. The overall system will have two types of nodes: a directory server, and clients.
The directory server keeps track of clients who are currently active in the peer-to-peer network. This server doesn’t need a perfectly-accurate list of clients. When a new user wants to join the network, it contacts the directory server to request a list of possible clients it can connect to. Once the client has this list, it chooses one of the existing clients to connect to and joins the network. At this point the client can close its connection to the directory server.
As part of the request to the directory server, the client will need to send the port where it is listening. The directory server will then record this port and the client’s IP so it can be returned to future clients.
If a client ever loses its connection to another client, or if all of the addresses returned from the directory server are unreachable, it should be able to request a new list of potential peers from the directory server. You will probably want clients to identify themselves when they send these requests, so it would be a good idea to assign each client a unique ID when they make their first request.
The directory server will keep a list of clients that are currently in the network, but this list may include some clients who have disconnected. There are a number of strategies you could use to keep the list up to date, but we can use a simple approach for this lab; when a client is exiting, it should send a message to the directory server to report that it will not longer be accepting incoming connections. This does not handle cases where clients crash or lose their network connection unexpectedly, but that’s an issue for a more sophisticated implementation, not this lab.
Here is a summary of the messages that the directory server should accept from clients:
Connections to the directory server will be short-lived. Clients will connect, send a request of some sort, and then wait for a response. Once they’ve received their response, the connection is closed. Because of this communication pattern, you can get away with a single-threaded server for the directory server.
Sockets do not have built-in support for sending and identifying different types of messages.
You will need to develop a scheme to attach identifying information to a message that tells the directory server what kind of message it is.
This could be as simple as an
int at the start of the message.
The client is the more complex part of this system; clients will interact with users using the provided user interface, and will both accept incoming connections and initiate connections to other clients. The description of the directory server gives a sense of the start-up process for a client, but I will repeat some of that information here.
The client connects to a directory server (passed in on the command line) and sends a Client Join message. The client then receives a list of other clients (or an empty list if it is the only client). The client should choose a random other client from this list to connect to. If the connection fails, try another client. If all the client connections fail, ask the directory server for a new list.
Eventually this client will establish a connection to what we’ll call its parent. At this point, the client should begin listening for incoming connections from other clients.
When a client receives an incoming connection from another peer, it should accept that connection. We’ll call the clients that connect to this client its children. Unlike the connection to the directory server, client connections will continue as long as both clients are active in the network. Because of this, you’ll need to start a new thread to handle the connection to each child.
When the local user types in a new message, that client should send the message to all of its neighbors; that includes both its parent and all of its children. The client can trust that its neighbors will relay those messages to their neighbors, eventually distributing the message throughout the entire network.
When a client receives a message from one of its neighbors, it should display that message to the local user and relay it to its other neighbors. You’ll need to be careful not to send a message back to the client it was received from, otherwise you’ll end up with duplicate messages.
If a client loses a connection to one of its children, you can simply proceed without doing anything special. You do need to handle the case where a client loses its connection to its parent. In this case, the client should contact the directory server to request a new list of potential peers. It is possible you will see some lost of duplicated messages during client recovery. That’s fine for this lab, but we can discuss strategies for avoiding these problems in a more full-featured system.
One issue that came up in class is the possibility for creating circular networks. During normal operation the network will have a tree structure (hence the parent/child names), but recovery could lead a node to connect to one of its grandchildren. This same issue could also lead to a network partition. To avoid this problem the directory server should only suggest potential parents to a client that have been in the network longer than that client. That means the oldest client in the network will be the root of the tree. If the root client disconnects, the older of its two children will become the new root. In this case, the directory server will suggest zero potential parents to the new root node. All other children of the old root (the client that disconnected) will receive at least one suggested parent.
Clients should report shutdowns to the directory server, then close all of their connections. Any children of the client will recover from the loss of connection to their parent, so there’s no need to notify them of the connection loss.
Clients will only exchange chat messages with one another; all the other mechanisms revolve around opening and closing TCP connections. Chat messages only need to contain the username and message.
You can begin by implementing clients without a directory server. Instead of passing in a directory server on the command line, provide a client IP address and port at startup. That will allow you to test everything except the recovery mechanisms before implementing a directory server.
We’ve discussed a number of useful features or potential points of failure that could come up in a distributed system, but you do not need to handle all of them. Here is an exhaustive list of the requirements for your implementation.
Any failure recovery mechanisms beyond the above list are not required. Here are a few cases you specifically do not need to handle:
If you are concerned about any other potential failures I am happy to discuss them.