Lab: Distributed Systems

Assigned:
Tuesday, Mar 13, 2018
Due:
Tuesday, Apr 17, 2018 by 10:30pm
Collaboration:
Work with your assigned partner for this lab. You can use online resources for your lab, provided they do not provide complete answers and you cite them in your code. If you do not know whether it is acceptable use a specific resource you can always ask me.

Overview

For this lab, you will implement a distributed system that we will design in class. The system will host a chat service, where multiple users can log in and exchange public messages over the internet. Our system must meet these two requirements, which will significantly impact our design:

Scalability
The system should scale to many thousands of users without the need for a massive server or high-end internet connection. That means clients should not all connect to a single central server.
Reliability
The system should tolerate failure of any individual node in the service; there should be no single point of failure.

In addition to the requirements for scalability and reliability, we will make a few simplifying assumptions about the system:

  1. By “tolerating a failure,” we mean that existing users can continue to exchange messages. New users may not be able to connect, at least during some types of failures.

  2. The system will not keep a log of previous messages. Users will only see messages that were received after they connected, and only if they keep a local copy of those messages.

  3. We will trust users to participate fairly and honestly in this system; we could add a cryptographic protocol for verifying messages in the future.

We will begin on lab day by discussing these requirements and proposing systems that could satisfy them.

Groups

Group information is no longer available for this course.

Questions & Answers

Is it safe to call ui_add_message from multiple threads?
Yes, but make sure only one thread calls it at a time. Just acquire a lock first, and release it when you’re done.
Is it safe to call read and write from multiple threads?
It probably works, but you should use locks to make sure only one thread is writing to a file descriptor at a time. It is safe to read and write to one file descriptor at the same time.

System Design

You will be implementing a peer-to-peer system, where clients connect to each other rather than to a central server. This approach limits the number of connections any one node in the network needs to support, and makes the system relatively tolerant to failures. The overall system will have two types of nodes: a directory server, and clients.

Directory Server

The directory server keeps track of clients who are currently active in the peer-to-peer network. This server doesn’t need a perfectly-accurate list of clients. When a new user wants to join the network, it contacts the directory server to request a list of possible clients it can connect to. Once the client has this list, it chooses one of the existing clients to connect to and joins the network. At this point the client can close its connection to the directory server.

As part of the request to the directory server, the client will need to send the port where it is listening. The directory server will then record this port and the client’s IP so it can be returned to future clients.

If a client ever loses its connection to another client, or if all of the addresses returned from the directory server are unreachable, it should be able to request a new list of potential peers from the directory server. You will probably want clients to identify themselves when they send these requests, so it would be a good idea to assign each client a unique ID when they make their first request.

The directory server will keep a list of clients that are currently in the network, but this list may include some clients who have disconnected. There are a number of strategies you could use to keep the list up to date, but we can use a simple approach for this lab; when a client is exiting, it should send a message to the directory server to report that it will not longer be accepting incoming connections. This does not handle cases where clients crash or lose their network connection unexpectedly, but that’s an issue for a more sophisticated implementation, not this lab.

Messages

Here is a summary of the messages that the directory server should accept from clients:

Client Join
The new client sends the port it is listening on. The directory server returns a unique identifier for the client and a list of several existing clients the new client could connect to.
Request New Peers
The client sends its unique identifier with a request for new peers. The directory server returns several clients this client could connect to.
Client Exit
The client sends its unique identifier to report that it is exiting. After this point, the directory server should not return this client to new clients as a potential peer.

Implementation Hints

Connections to the directory server will be short-lived. Clients will connect, send a request of some sort, and then wait for a response. Once they’ve received their response, the connection is closed. Because of this communication pattern, you can get away with a single-threaded server for the directory server.

Sockets do not have built-in support for sending and identifying different types of messages. You will need to develop a scheme to attach identifying information to a message that tells the directory server what kind of message it is. This could be as simple as an int at the start of the message.

Client

The client is the more complex part of this system; clients will interact with users using the provided user interface, and will both accept incoming connections and initiate connections to other clients. The description of the directory server gives a sense of the start-up process for a client, but I will repeat some of that information here.

Client Start-Up

The client connects to a directory server (passed in on the command line) and sends a Client Join message. The client then receives a list of other clients (or an empty list if it is the only client). The client should choose a random other client from this list to connect to. If the connection fails, try another client. If all the client connections fail, ask the directory server for a new list.

Eventually this client will establish a connection to what we’ll call its parent. At this point, the client should begin listening for incoming connections from other clients.

Adding Peers

When a client receives an incoming connection from another peer, it should accept that connection. We’ll call the clients that connect to this client its children. Unlike the connection to the directory server, client connections will continue as long as both clients are active in the network. Because of this, you’ll need to start a new thread to handle the connection to each child.

Sending Messages

When the local user types in a new message, that client should send the message to all of its neighbors; that includes both its parent and all of its children. The client can trust that its neighbors will relay those messages to their neighbors, eventually distributing the message throughout the entire network.

Receiving and Relaying Messages

When a client receives a message from one of its neighbors, it should display that message to the local user and relay it to its other neighbors. You’ll need to be careful not to send a message back to the client it was received from, otherwise you’ll end up with duplicate messages.

Recovering from Failure

If a client loses a connection to one of its children, you can simply proceed without doing anything special. You do need to handle the case where a client loses its connection to its parent. In this case, the client should contact the directory server to request a new list of potential peers. It is possible you will see some lost of duplicated messages during client recovery. That’s fine for this lab, but we can discuss strategies for avoiding these problems in a more full-featured system.

Avoiding Cycles

One issue that came up in class is the possibility for creating circular networks. During normal operation the network will have a tree structure (hence the parent/child names), but recovery could lead a node to connect to one of its grandchildren. This same issue could also lead to a network partition. To avoid this problem the directory server should only suggest potential parents to a client that have been in the network longer than that client. That means the oldest client in the network will be the root of the tree. If the root client disconnects, the older of its two children will become the new root. In this case, the directory server will suggest zero potential parents to the new root node. All other children of the old root (the client that disconnected) will receive at least one suggested parent.

Client Shutdown

Clients should report shutdowns to the directory server, then close all of their connections. Any children of the client will recover from the loss of connection to their parent, so there’s no need to notify them of the connection loss.

Messages

Clients will only exchange chat messages with one another; all the other mechanisms revolve around opening and closing TCP connections. Chat messages only need to contain the username and message.

Hints

You can begin by implementing clients without a directory server. Instead of passing in a directory server on the command line, provide a client IP address and port at startup. That will allow you to test everything except the recovery mechanisms before implementing a directory server.

Implementation Requirements

We’ve discussed a number of useful features or potential points of failure that could come up in a distributed system, but you do not need to handle all of them. Here is an exhaustive list of the requirements for your implementation.

  1. Clients should be able to initiate a connection to the network by first contacting the directory server.
  2. Clients in the network should be able to exchange messages through the network, avoiding partitions and cycles (see the “Avoiding Cycles” hint above).
  3. When a client loses its parent connection, it should reestablish a new parent connection (unless it becomes the new root client)
  4. If the directory server crashes, existing clients should be able to continue exchanging messages.

Any failure recovery mechanisms beyond the above list are not required. Here are a few cases you specifically do not need to handle:

  1. You do not need to recover from a directory server crash. Existing clients should continue functioning, but a new directory server would effectively be a new chat network.
  2. You do not need to handle simultaneous failures of multiple nodes in the network, including the directory server.
  3. You do not need to handle cases where a client becomes entirely inaccessible (e.g. total network or power failure).

If you are concerned about any other potential failures I am happy to discuss them.