Gabe's Musings

A blog about my research and nonsense thoughts

Kademlia (My Implementation)

May 21, 2020 — Gabe Appleton

If you haven't yet read it, please see my overview of Kademlia.

Kademlia is obviously an idea that I have a fairly deep interest in at this point, so I spent the last couple weeks writing up another implementation of it. This one managed to solve many of the problems I had in the past by using a slightly different concurrency model, by focusing narrowly on UDP, and by using equivalent alternatives to some structures defined in the Kademlia specification.

What I want to do in this post is walk through how this implementation works, what progress it makes on my larger goal of developing Retrion, and what problems it has that remain to be solved.

Code can be found here, in the files that use protocol version 4. Version 5 is my current work to try and extend it.

How It Works

Concurrency Model

The node's world is divided into four threads (plus the main thread, but that doesn't count). Threads 1 and 2 listen for incoming messages on UDP/IPv4 and UDP/IPv6 respectively. Thread 3, the reaction thread, is responsible for reacting and responding to messages. Thread 4, the heartbeat thread, is responsible for scheduled events in the future. It implements a sort of event loop using the sched module.

The listener threads exist primarily out of laziness, since select would have easily let me consolidate them. It should be noted, however, that having multiple threads like this will allow for future abstraction. If each listening socket gets its own thread, then you don't need to deal as much with how those transport methods differ. A websocket, for example, would have a very different interface when compared with a UDP one or a Bluetooth one.

Serialization

Serialization was done with u-msgpack-python, which is a particularly nice implementation of the MessagePack serialization scheme. It's easiest to think of it as "JSON in binary, with some customizability".

Of particular usefulness, MessagePack allows you to define how custom objects should be serialized and deserialized using their concept of "extension objects". Basically they throw a header onto whatever bytestream you hand them and they can use that header to send the stream to the correct factory function. I'm going to borrow their notation for how to represent these bytestreams:

one byte:
+--------+
|        |
+--------+

a variable number of bytes:
+========+
|        |
+========+

variable number of objects stored in MessagePack format:
+~~~~~~~~~~~~~~~~~+
|                 |
+~~~~~~~~~~~~~~~~~+

For an example of how this looks, let's show the PING message:

+========+--------+--------+--------+--------+--------+========+========+
| header |    0   |    4   |compress|10010011|    1   |sequence|senderID|
+========+--------+--------+--------+--------+--------+========+========+
                                    |-------------compressed------------|

Let's look at that cell-by-cell. The header is a few bytes that MessagePack tacks on to the beginning. It essentially says "I am an extension object __ bytes in length."

The first value we actually give it is that 0, which is the extension typecode I defined for Message objects.

The next cell indicates the compression method used for the remainder of the bytestream. This is negotiated between nodes in a HELLO message. The ones currently defined are none, ZLIB, GZIP, LZMA, and BZ2, taking their values in that order.

The next cell is a header for an array. The first four bits indicate that it is an array, and the second four bits indicate the length of the array.

The next cell is a 1, which is the message type assigned to PING.

The next cell is the sequence number of the packet. This number should monotonically increase and is assigned by the sending node. It is what gets referred to if a response gets sent back.

The next cell is the sender ID. By default this is just 20 random bytes. In future implementations it will likely be a hash of some node information in a canonical order.

This entire array is unpacked and then handed to a factory function to regenerate that message object.

Reactions

When nodes reconstruct a message, they find that each have a react() method. For the simplest example, let's look at the IDENTIFY message:

class Message(BaseMessage):
    ...

    def react(self, node, addr, sock):
        """Delay sending a PING to the sending node, since the connection is clearly active if you got a message."""
        try:
            event = node.timeouts.pop((self.sender, ))
            node.schedule.cancel(event)
        except (KeyError, ValueError):
            pass

        def ping():
            node._send(sock, addr, PingMessage())

        node.timeouts[(self.sender, )] = node.schedule.enter(60, 2, ping)
        # enter() takes in delay, priority, callable

@Message.register(int(MessageType.IDENTIFY))
class IdentifyMessage(PingMessage):
    def __init__(self, compress: int = 0, seq: Optional[int] = None, sender: bytes = b''):
        super().__init__(compress, seq, sender)
        self.message_type = MessageType.IDENTIFY

    def react(self, node, addr, sock):
        """Send a HELLO back in leiu of an ACK."""
        node.logger.debug("Got an IDENTIFY request from %r (%r)", addr, self)
        node._send_hello(sock, addr)
        Message.react(self, node, addr, sock)

A node who sends IDENTIFY is saying "hey, I have your address but I don't know who you are. Would you mind introducing yourself?" So this event gets processed in two phases. The first is just the node sending a HELLO message. The second part is to modify the node's schedule. By default, nodes will send a PING after 60 seconds of dead air to make sure that their peer is still alive. If you get a message this timer needs to be reset, which is what happens in Message.react().

Responses

Several messages also have a method called react_response(). The idea here is that if you received an ACK, probably that means something to some other message. The way this gets triggered looks like:

@Message.register(int(MessageType.ACK))
class AckMessage(Message):
    ...

    def react(self, node, addr, sock):
        """Clear the message from node.awaiting_ack and call the relevant react_response() method."""
        node.logger.debug("Got an %sACK from %r (%r)", 'N' if self.status else '', addr, self)
        super().react(node, addr, sock)
        try:
            node.routing_table.member_info[self.sender].local.hits += 1
        except KeyError:
            pass
        if self.resp_seq in node.awaiting_ack:
            node.awaiting_ack[self.resp_seq].react_response(self, node, addr, sock)
            del node.awaiting_ack[self.resp_seq]

Let's break that down.

The first step is just like above, where it resets that PING timer for that peer.

After that, we'd like to record that a message successfully went through. So, if we have the sender in our routing table, we look up our local information on that node (as opposed to public data like their ID or listening addresses) and record that there was a hit. This information can be used when pruning peers (which I did not yet implement).

The last step is to see if the node was expecting this acknowledgement for something. If it was, we call the message in question's react_response() method. Currently there are only two messages that use this feature: FIND_NODE and FIND_KEY. Let's look at the former:

@Message.register(int(MessageType.FIND_NODE))
class FindNodeMessage(Message):
    ...

    def react_response(self, ack, node, addr, sock, message_constructor=PingMessage):
        """Attempt to connect to each of the nodes you were told about."""
        node.logger.debug("Got a response to a FIND_NODE from %r (%r, %r)", addr, self, ack)
        for info in ack.data:
            name = info.name
            if name != node.id and name not in node.routing_table:
                for address in info.addresses:
                    try:
                        if node.routing_table.add(name, address.args, address.addr_type):
                            node._send(address.addr_type, address.args, message_constructor())
                            node.routing_table.member_info[name].public = info
                        break
                    except Exception:
                        continue

A FIND_NODE message expects to get back an array of GlobalPeerInfo objects, the details of which don't matter very much here. The only important things for our purposes is that it contains a list of listening addresses and a list of supported compression methods in order of preference.

For each of objects, we check that it's not someone we already know about or ourself. If not, then we try to send a message to each address in sequence, ignoring errors.

Design Problems

The math on $b$

The idea behind $b$, Kademlia's accelleration parameter, is that you can trade between routing efficiency and routing table size by looking at $b$-bit symbols rather than 1-bit ones. So if you are ID 101010..., $b=1$ would have you making a routing table with entries like:

  • 0...
  • 01...
  • 010...
  • 0101...
  • 01010...
  • 010101...
  • etc.

If $b=2$, however, you would end up with a routing table like:

  • 00...
  • 01...
  • 11...
  • 0100...
  • 0101...
  • 0111...
  • 010100...
  • 010101...
  • 010111...
  • etc.

The trade off you make here is that the number of queries you need to find a node is $O(\log_{2^b}(n))$ whereas routing table size is $O(2^b \cdot \log_{2^b}(n))$. A more precise guess for the routing table is that it would scale as $\lceil\tfrac{(2^b - 1)}{b}\rceil \cdot \log_{2^b}(n)$, but I haven't actually sat down to check that math, so take it with a grain of salt.

The problem is, I'm doing the math wrong in my routing table for $b > 1$, and I haven't been able to identify where. All I know is that occasionally I get an IndexError and my routing table has to fall back to searching over the set of all known peers instead of a particular bucket.

The Crappy Broadcast Algorithm

The broadcast method I used in this implementation is naive at best. It uses a simple flooding model using the following rules:

  1. If you saw this broadcast already (identified by sender, sequence pairs), ignore this
  2. For each peer in your routing table, echo this message to them unless
    1. They are the original sender, as identified by the sender
    2. They are the node that sent it to you, as identified by the addr, sock pair in react()

This ends up with $O(n^2)$ messages sent, and given the topology of a Kademlia network, $O(log(n))$ hops before reaching the final node.

This is obviously sub-optimal, and is a topic I will explore further when I manage to get a write-up for Solution for the broadcasting in Kademlia peer-to-peer overlay by Czirkos and HosszĂș.

What's This $\alpha$ Parameter?

Kademlia defines a concurrency parameter called $\alpha$, which essentially says "thou shalt send $\alpha$ messages per request." Confusingly, some messages use $k$ for their concurrency parameter instead. The main culprit there is STORE, but rather than figure out where to do what, I just assumed that $k=\alpha$. It's a relatively small change to fix that, so presumably I will before I do anything major with this.

Retrion Progress

Recall the initial goals of Retrion:

  1. An object model and serialization standard that can be used easily in most languages
  2. A simple high-level API (along with a fine-grained one)
  3. $log(n)$ distributed hash table get/sets
  4. Support for disabling DHT support network-wide
  5. Support for "edge nodes" which do not store DHT data except as a cache
  6. $log(n)$ broadcast delivery time
  7. $log(n)$ routed message delivery
  8. Support for multicast routed messages
  9. Support for in-protocol group messages
  10. Transparent (to the developer) compression
  11. transparent (to the user) encryption
  12. Support for multiple transport protocols (TCP, UDP, etc.)
  13. Support for "virtual transports" (using other nodes as a bridge)
  14. A well-defined standard configuration space
  15. Support for extension subprotocols and implementation-specific configurations

I would say that this implementation likely satisfies 1, 3, 6, 10, 14, and 15.

It certainly does not satisfy 4, 5, 7, 8, 9, 11, 12, or 13.

That means we need some kind of roadmap to delivering these other properties and addressing other deficiencies

4. Support for disabling DHT support network-wide

This is probably the easiest to do, as it just puts you into a mode similar to how Ethereum already works. Probably it's as simple as adding a flag to the NetworkConfiguration object and having done with it.

5. Support for "edge nodes" which do not store DHT data except as a cache

This one is harder, as it presents some design challenges. My initial thought would be to have edge node status indicated by having the most significant bit of your ID be 1, but this presents a few problems:

  1. How do you guarantee that every edge node is in someone's routing table? Otherwise no broadcast support
  2. This means that nodes have to decide that at ID-assignment time, unless we allow nodes to make new IDs
  3. The use case for edge nodes is something that runs in a browser or phone. That means we need support for

    1. roaming addresses, and
    2. websocket or webrtc connections

For now, I think this is off the table. It definitely should be revisited in the future, but there are too many problems with the concept as it exists now to implement it.

7. $log(n)$ routed message delivery

This one is trivial to implement, I just haven't gotten to it yet. It's almost a duplicate of the work I already do in FIND_NODE and STORE anyways, I just need to change the side effect.

Probably there are optimizable algorithms that I'll need to look for.

8. Support for multicast routed messages

An naive implementation of this should be fairly easy, I would just need to change the target field in a message like above to hold an array of targets instead of just a target.

9. Support for in-protocol group messages

Leveraging the above, it should be possible to make a chatroom context or the like in-protocol. The real question is whether it should use a name registry at the network level which references some key in the DHT, or whether individual nodes should track their own rooms and be able to attach a room name to a multicast routed message. I kinda lean towards the latter, but we'll see.

11. transparent (to the user) encryption

I have no idea how to approach this yet beyond broad concepts. It would be fantastic if you could have a bit in your GlobalPeerInfo object that says "hey, use the OWS protocol with me now", or something similar. The only question is going to be how many methods are easily implementable in many languages. Probably the languages I will choose to care about here are C/C++, JavaScript, Python, and possibly Java or Kotlin.

12. Support for multiple transport protocols (TCP, UDP, etc.)

No idea how to approach this in terms of limiting its downsides. I'll update when I have a confident idea of how to handle the local resource overhead (number of sockets the system lets you have, for example) of connection-based transports before this is really approachable. I think the way I'll approach this is a two-pronged approach.

First I implement it under the assumption of UDPv4 being the minimum requirement to avoid network fragmentation.

In parallel I need to do some kind of formal analysis on network fragmentation risk if you allow some nodes to have incompatible lists of supported transports. It would be useful if your sandboxed browser code could interact with a large network of external nodes that communicate with more convenient transports than WebSockets, but I'm fairly worried about how many risks and inefficiencies that would expose.

13. Support for "virtual transports" (using other nodes as a bridge)

This is an idea that I might want to scrap. If routed messages are workable, then virtual transports are probably trivial to implement. The only question is gonna be how inefficient it is without explicitly setting up a bridge route.

Conclusions

I've got a long road ahead of me, but this is a pretty good start!

tags: research-progress, hobby-horse, retrion

Kademlia (The Protocol)

May 17, 2020 — Gabe Appleton

Kademlia describes a protocol for implementing a Distributed Hash Table (DHT) and is described in these two papers. It is a widely used and widely extended protocol that serves to enable or improve many other systems, so it's worth taking some time to dive into it.

Some Background

There are many DHTs that came before this, and there will likely be many which come after. I want to first describe Chord as background to show why this was a useful idea, and to build some intuitions for how DHTs work.

Chord

Chord is a precursor to Kademlia. What it does is picture its network as a circle.

Each node in this circle is assigned an ID by some agreed-upon process. All nodes are required to hold a connection with the node before and after it.

To make a request, whether that is to get or set an entry, you pass that request from one node to the next along that circle until you arrive at the node "nearest" to that entry's key by some agreed-upon method. To take an example, let's look at a network that supports a max of 8 nodes and uses the last 3 bits of MD51 to decide who owns what. All of these distances are done by "Price is Right" rules, regardless of method. Distances are only ever counted going clockwise around the circle.

What this toy model might look like

This model is missing a lot, though. For instance, this setup would yield you $O(n)$ lookup time, which is not at all ideal. Chord then imposes one further rule: each node must keep a "finger table" where each node is at most the next power of two away. So if you're node 0, you have a connection to the nodes closest to 1, 2, and 4.

What this toy model might look like from the perspective of Node 0

Note that in the above image, the green arrow corresponds to a connection required by Node 6. All the other connections are required by Node 0.

Doing it this way you end up with an $O(log_2(n))$ lookup time. The worst case is if you are requesting something owned by your direct predesessor. Node 0 forwards the request to Node 4, who forwards to Node 6, who forwards to Node 7.

There are some additional rules for adding nodes, but much of it just follows from trying to hold the above properties, so I won't talk about it here for either Chord or Kademlia.

BitTorrent

Most BitTorrent clients use Kademlia to find peers, starting since about 2005.

Ethereum

The Ethereum network uses Kademlia for peer selection, rather than supporting its whole system. This is useful because it allows you to make assumptions in performance analysis that you would not be able to make in an unstructured network, and because it provides some assurances about Denial of Service resistance.

The Main Ideas

Kademlia largely brought together innovations present in other projects. Exponential hopping came from Chord, and XOR distance somewhat comes from Pastry.2 The main benefits they got came from putting a much larger emphasis on parallelism and from making a logical leap that the Pastry authors did not.

XOR Distance

At first glance, Kademlia looks a lot like Chord. All nodes are assigned an ID using an agreed method, all requests are routed towards nodes that are closer to the target.

The biggest difference here is that instead of using a circular distance metric, Kademlia uses the XOR function. This seems like a strange decision at first, or even an invalid one, but XOR is a perfectly valid distance metric. In fact, it has more useful properties than Chord's circular one.

A general distance metric must satisfy the following four properties:

  1. $d(x, x) = 0$
  2. $d(x, y) > 0$ where $x \ne y$
  3. $d(x, y) = d(y, x)$
  4. It satisfies the triangle inequality, so $d(x, z) \le d(x, y) + d(y, z)$

The Chord metric violates rule 3 (in our example $d(1, 2) = 1$ but $d(2, 1) = 7$), whereas XOR satisfies all of them.

Exponential Hopping

Using XOR as a metric allows you to have exponential hopping in a much more intuitive way. This part is where Kademlia looks most like Pastry, using a prefix-based routing scheme.

All Kademlia nodes maintain a routing table organized into "k-buckets" of $\le k$ nodes where their IDs share the first $x$ bits of your ID, or alternately, where the first bit they agree on is the $x$th. When trying to decide on a node to send your request to, you pick from the group with the same prefix as the key you're requesting. This leads to exponential hopping in a way that is both more flexible than and feels less artificial than Chord's structure.

Emphasis on Parallelism and Replication

Because Kademlia's routing table is organized into groups, they are able to use this productively. Many implementations essentially say "screw trying to find the optimal node, let's send the request to everyone in that bucket." Indeed, this is a strategy actually pointed out by the paper.

Kademlia [...] can send a query to any node within an interval, allowing it to select routes based on latency or even send parallel, asynchronous queries to equally appropriate nodes.

They even give a system parameter, $\alpha$, to describe how many nodes get sent non-STORE messages.

The Protocol

Kademlia consists entirely of 4 RPCs. All of them share some base properties. For instance, all of these messages share the property that if the sender doesn't receive a reply then that is noted in the routing table. If a node misses too many it will get replaced when another suitable peer is found.

PING

This probes a node to see if it's online. Fairly self explanatory.

FIND_NODE

This RPC is the heart of the Kademlia protocol. It takes a node ID as an argument, and the node which receives it will send back information for the $k$ closest nodes it knows of. For a well-saturated node, this will usually come from a single bucket in its routing table. If a node doesn't know $k$ nodes yet, it sends as many as it does know of.

Using this, Kademlia nodes can implement what they call a node lookup. The gist of it is:

  1. Pick the $\alpha$ nodes closest to your target
  2. Send each of them a FIND_NODE for your target
  3. If you receive a reply that includes your target, break
  4. If this round doesn't get you closer to your target, break
  5. Otherwise, go to 1

FIND_KEY

Retrieving a key is very similar to finding a node. When receiving a FIND_KEY message, Kademlia nodes may respond in one of two ways: If they don't know of $k$ closer nodes to the target or they have that key in their local cache, they simply return the answer. Otherwise, they respond exactly as if they had gotten a FIND_NODE message, and the peer will use this to send further FIND_KEY requests.

Kademlia calls out one additional step that I want to highlight verbatim:

For caching purposes, once a lookup succeeds, the requesting node stores the (key, value) pair at the closest node it observed to the key that did not return the value.

STORE

To store a value, one first looks up the closest $k$ nodes to that key. Once they are found, send a STORE message to each of them with the (key, value) pair. It is important that you get the closest $k$ nodes, because all others will drop it from their local storage at a rate inversely proportional to their distance to the key.

Those nodes that deem themselves responsible for a key will republish it as needed to keep it alive. Usually this includes the one who set it and the $k$ closest nodes, but there could be other interested parties as well.

Why it works

Structurally, this works for a lot of the same reasons Chord does. It can guarantee logarithmic lookup times with its hopping mechanism. In fact, it can do even better because it implements this in parallel.

Persistance of keys relies on some assumptions, though, so I want to outline those here. The only way for a key to die would be if the $k$ closest nodes to all die or leave the network with no lookups or republishing between them, and for this value to not have been cached in any other node on the search path. Thus, the assumption is that this value $k$ is somewhat tuned to how many nodes you expect to leave in the longest expected refresh periods.

The authors include in the longer version of the paper some analysis they did on Gnutella networks to justify their belief that this is a reasonable assumption. The gist of it is that the longer any particular node has been online, the more likely they will be online an hour from now. They also use this to provide some Denial of Service resistance by requiring that nodes have a preference for older peers over younger ones when selecting peers for their routing tables.

My Thoughts

This protocol is super interesting to me, especially considering how small it is. It seems to offer a lot of room for extension into more general-use protocols, which I'll be trying to do this summer. In the next week, expect an update on an implementaion of Kademlia that I've been working on.

Footnotes

1: Yes, this is bad practice, but it's a toy model. Back

2: The most major difference being that Pastry uses two routing functions, falling back to numeric distance for last-mile deliveries. Back

tags: research-review, research-progress, hobby-horse, retrion

Hybrid Logical Clocks: Summary

March 02, 2020 — Gabe Appleton

This paper focuses on the idea of time synchronization. They compare several different methods, and ultimately introduce a new one which they term "Hybrid Logical Clocks". While the original title of the paper is "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases", I find this a bit too wordy to say more than once.

Some Background

Physical Clocks and NTP

Many distributed systems developers might be tempted to just use the physical clock on your local host. However, as anyone who has actually tried this knows, there is often a significant drift between the clocks of any two systems, especially as they increase in distance. The Network Time Protocol (NTP) can usually keep nodes within ~10ms of each other, but that still leaves large windows of uncertainty when ordering events for, say, a video game. You also have to worry about things like leap seconds, or sysadmins changing the time on you.

TrueTime

To try to address these problems, Google recently proposed a protocol called TrueTime (PDF warning, I wasn't able to find a useful article elsewhere, see page 5). What this allows is for you to get a strict bounding on clock error by using a combination of GPS and local atomic clocks to synchronize a "global clock". The problem with this, of course, is that you need to have your own personal atomic clock. That is highly problematic for the average user. An alternative might be to have a receiver for some of the publicly-broadcasting atomic clocks and just store some offset to it, but this is also problematic because it both requires extra hardware and would reduce TrueTime's error guarantees.

So while this is helpful, it is not generally useful.

Lamport Clocks

One of the earliest methods to address the problems of using physical time was the Lamport Clock (LC). The idea here is to make a simple counter on each node. Start at 0, and with each event that node can see (purely local or otherwise), increment the counter. If it receives a message, that message will always come with LC timestamp from the originating node, and so you take max(yours, theirs) + 1 as your new LC time.

This makes two guarantees:

  • Event $X$ happens before event $Y$ $\implies$ $X.lc < Y.lc$
  • If $Y$ is directly related to event $X$, $X.lc < Y.lc$ $\implies$ $X$ happens before $Y$
    • $Y$ is directly related to $X$ if they are a send/receive pair, or
    • $Y$ is directly related to $X$ if they are directly next to each other on some node's timeline

Example:

Vector Clocks

Vector Clocks are an extension of Lamport Clocks which allows you to make strong guarantees about ordering.

The idea behind it is that timestamps, rather than being a single number, are a vector of numbers where each item corresponds to the last known Lamport time of some other process.

Example:

The major benefit here is that event $X$ happens before event $Y$ $\iff$ $X.vc < Y.vc$, so long as you define comparisons by the rules given by their algorithm. They would specify: $X.vc < Y.vc$ $\iff$ $[(\forall i \; X.vc_i \le Y.vc_i)$ $\land$ $(\exists i \; X.vc_i < Y.vc_i)]$.

This, however, comes with a number of drawbacks. The biggest of these is that you need to know how many nodes your network will have before runtime. A more minor one is that the size of timestamps grows proportional to the number of nodes in your network.

HybridTime

This is a method which combines Vector Clocks with a set of optimizations based on loose synchronization of physical clocks. I can't find any info on this, except that I think they are referencing this article that I can't get access to. I'll ask my advisor if I remember to. They reference this in the paper, but not for much more than to say "hey, people have tried this".

The Main Ideas

By the time we're done, the authors will have shown us an algorithm that:

  1. Guarantees $X$ h.b. $Y$ $\implies$ $X.hlc < Y.hlc$
  2. Can be used in a backwards-compatible way with NTP timestamps
  3. Uses $O(1)$ memory
  4. Is resistant both to "stragglers" and "rushers" who are significantly behind or ahead of the rest of the network
  5. Is directly useful in identifying distributed snapshots

The Algorithm

The Components

A Hybrid Logical Clock consists of three elements:

  • $l$, which represents the highest physical time this node has seen, truncated to $2^{-16}$ second precision
  • $c$, which is a Lamport Clock that resets whenever $l$ increases
  • $pt$, which is a local property representing the given node's physical time

On a Send or Local Event

When a node has a local event or sends a message, they should do the following:

def on_local(l, c):
    new_l = max((l, truncate(physical_clock())))
    if new_l == l:
        logical = c + 1
    else:
        logical = 0
    return HybridLogicalClock(new_pt, logical)

On Receiving a Message

When a node receives a message, they should do the following:

def on_receive(l, c, m_l, m_c):
    new_l = max((l, m_l, truncate(physical_clock())))
    if new_l in (l, m_l):
        new_c = max([(l, c), (m_l, m_c)])[1] + 1
    else:
        new_c = 0
    return HybridLogicalClock(new_l, new_c)

Example Trace

Note that the paper has a typo in this image. The cell which reads as "3, 13" should read as "3, 10, 3". This is because they recycled another figure from their paper but forgot to re-annotate that cell.

Why it works

Frozen PT = Lamport Clock

Let's suppose for a moment that $pt$ never increments for any node. In such a scenario, this will act exactly like a Lamport Clock. Fortunately, most systems don't have permanently frozen clocks.

PT and L are Tightly Coupled

If $pt$ ever gets past $l$, then $l$ will change and $c$ will reset. So the way we compare two timestamps is just by using tuple comparisons of the form $\left\langle X.l, X.c\right\rangle$. For review, the way that works is that you first compare the first element. If they are not equal, return the result of the comparison. If they are equal, move on to the next element and repeat until the end. If you get past the end, they are equal (and in this case, explicitly concurrent). From this follows a very easy implication: $X$ h.b. $Y$ $\implies$ $\left\langle X.l, X.c\right\rangle$ $<$ $\left\langle Y.l, Y.c\right\rangle$.

This paper also shows that the gap between $pt$ and $l$ is bounded in any system without adversarial sysadmins. To get there we have to go through another theorem though. $X.l > X.pt$ $\implies$ $(\exists Y : Y$ h.b. $X$ $\land$ $Y.pt = X.l)$. What they do next is introduce a system parameter $\epsilon$, which is defined as the maximum clock drift in the system (the difference between the smallest and largest $pt$s on the network).

Given that, it is impossible to find some pair of events such that $X$ h.b. $Y$ and $X.pt > Y.pt + \epsilon$. Because of that, it must follow that $(\forall X)(|X.l - X.pt| \le \epsilon)$.

C has a Maximum Value

This also means that we can put bounds on the value of $c$ (assuming a system where sysadmins don't intervene). If we assume that the minimum gap between events is some timestep $d$, then $(\forall X)(X.c le \tfrac{\epsilon}{d} + 1)$.

Encoding as an NTP Timestamp

So this ends up creating a system with the causal guarantees of a Lamport Clock, but can we encode it as a traditional timestamp? It turns out the answer is yes, but mostly because NTP provides some absurd precision in its timestamps.

NTP timestamps are 64-bits, divided into two 32-bit sections. The first portion is for a whole number of Unix-time seconds. The second portion is for fractional seconds. To be clear here, that means that NTP allows for $2^{-32}$ second precision. This ends up being ~$\tfrac{1}{4}$ns precision, which is just silly. It's a little easier to show the change they made when expressed as a C struct:

#define FRAC_SECONDS_DENOM (1 << 32)

struct NTP_stamp {  // note: NTP is big-endian
    uint32_t seconds;
    uint32_t frac_seconds;
};

What this paper does is instead allow for $2^{-16}$ second precision in the physical time portion by saying:

#define FRAC_SECONDS_DENOM (1 << 16)

struct HLC_stamp {  // still big-endian
    uint32_t seconds;       // corresponds to l
    uint16_t frac_seconds;  // corresponds to l
    uint16_t c;
};

Or, making use of the bare union extension:

#define FRAC_SECONDS_DENOM (1 << 32)

struct HLC_stamp {  // still big-endian
    union {
        struct {
            uint64_t l : 48;
            uint16_t c : 16;
        };
        struct {
            uint32_t seconds;
            uint32_t frac_seconds;  // always assign this *before* c, never after
        };
    };
};

Note that the latter version is more faithful to the original paper, since $c$ is fully incorporated into the fractional seconds field. This, in effect, means that the fractional seconds field has an actual precision of $2^{-16}$, but a perceived precision of $2^{-32}$, with the least significant bits corresponding entirely to the $c$ value.

From here they perform some experiments to show that there is essentially no overhead to doing this, as compared to physical time synchronization.

My Thoughts

This is a very interesting way to go about things. Coming out of it, my biggest question is whether, given enough additional information, one could use this to replace Vector Clocks. As it currently stands, that would be a poor decision in small systems, but this stands as a very promising addition to peer-to-peer protocols.

This is a topic that my group in CSE 812 will be incorporating into our research. Sometime in the next few days I hope to have a post up about the other paper we are basing things on. It is significantly longer, but also can be condensed more easily if I approach things right.

Let me know what you think below!

tags: research-review, cse-812, advisee-group, msu

Hello World!

February 21, 2020 — Gabe Appleton

Who are you?

My name is Gabe Appleton. Currently I'm a first-year PhD student at Michigan State University with a focus in distributed systems. I also am an amateur photographer and astronomer, and an avid player of D&D.

Why should I care?

Probably you don't. But on the off chance you like reading random blogs, I'd like to think that I will write interesting things. My posts will probably break down into four categories:

Research Review

Probably the most frequent posts will be about my research. The first of these post types is the Research Review. This is an experiment that my advisor and I are doing where we read a research paper each week, and whoever presented that is to write a blog post summarizing and reviewing it. Because these posts are meant to reflect our thoughts on the matter, and not, say, get in a peer-reviewed journal, there are going to be lower standards here. Not everything will be cited or neutrally-framed. Not everything will be objective or unopinionated. That's not the goal with this format.

A lot of these, especially in the early days, will be about time syncronization in distributed systems. In the next week or two I hope to get a post up about Hybrid Logical Clocks, which is my advisor's current focus area.

Research Progress

The second research category will be about my personal research. This can be further divided into my "hobby horse" and my formal research.

Hobby Horse

The idea I have been playing around with for a long time at this point is to try and make the networking end of distributed systems much easier. Basically to do SocketIO but for peer-to-peer networking.

This is a long-term goal of mine, and progress may be sporradic. For now I am writing this under the name Retrion, but a better name might come along at some point. I have a veritable manifesto in my head about how this should work, but the main goal is to generate a network protocol with the following properties:

  1. An object model and serialization standard that can be used easily in most languages
  2. A simple high-level API (along with a fine-grained one)
  3. log(n) distributed hash table get/sets
  4. Support for disabling DHT support network-wide
  5. Support for "edge nodes" which do not store DHT data except as a cache
  6. log(n) broadcast delivery time
  7. log(n) routed message delivery
  8. Support for multicast routed messages
  9. Support for in-protocol group messages
  10. Transparent (to the developer) compression
  11. transparent (to the user) encryption
  12. Support for multiple transport protocols (TCP, UDP, etc.)
  13. Support for "virtual transports" (using other nodes as a bridge)
  14. A well-defined standard configuration space
  15. Support for extension subprotocols and implementation-specific configurations

Formal Research

This will be progress on my academic research. Since the topic of this is not yet decided-upon (except for a small project on machine learning for a class) this category will be largely empty for a while.

Thoughts

This category will be me musing on things. I don't expect it to happen much, but when it does it will probably be hot takes about current events or about my personal life (with context stripped as needed).

Photography

This is where I will post the pictures I take, when I'm happy with them. Backstory will usually be provided.

tags: research-review, research-progress, thoughts, photography, misc