In what seems like a life time ago, I was tasked with load testing a series
of application clusters to understand the characteristics of the clusters
with certain load profiles. Identifying what happens with each load profile
in a deterministic way, so that we can repeat them when new changes or
releases of software land. This sounds like a relatively easy challenge,
especially from the outside; instead I'm left longing for a better
understanding of some scenarios, so that I can reason about potential
problems in the future.
At the heart of the cluster is a internal gossip members list, when any new
core service comes online it gossips to other gossip services. Each gossip
member internally tracks what it thinks is the state of affairs, so who has
joined, left or leaving. This slightly over simplifies the actual
interaction, but it's enough to describe the journey.
Sitting in front of the gossip members list is a hash-ring, or specifically
a consistent hash-ring. The basic premise behind a consistent hash-ring in
this usage, is the ability to be able to query the ring with a key (the key
in this sense can be anything as long as we can hash it; URL, asset name or
a username) and return a member from the members list. The member that is
returned is the same member if the same key is supplied to the ring again.
Each member is also replicated multiple times when added to the ring, to
give better distribution around the ring.
The ring should be able to tolerate the additions of members to the gossip
members list, as long as the underlying members don't diverge too much.
Understanding what is too much depends on the size of the ring and the
replication factors. Deletion of members should then attempt to evenly
distribute the results across the ring, so that hot spots shouldn't occur
on the ring. All the while stabilising to become eventually consistent
between each internal member and by that I mean, if there was no additions
or deletions over time, the state (checksum) of the ring should be identical
between members. Without ring stabilisation, each member will have a
different view of the members list (we assume some sort of sorting has
happened to prevent inconsistencies) and therefore will be inconsistent
when a query against the ring with a key is performed.
Inconsistencies in the ring after changes to the members list doesn't matter
in the immediate future, because items stored in each member are relatively
short lived and also re-routing happens if nothing is found. As long as more
consistent results converge over time becoming eventually consistent the
better. Re-routing can be an expensive exercise, especially if the
re-routing requires going cross region.
So when might a ring become unstable, it's best to describe what we mean by
ring instability. Given a time X, what is the state of the ring, then given
time Y what is the state of the ring at Y and as long as X and Y haven't
become unreasonable that comparing stability is nonsensical.
Any change to the members list will cause ring instability. Given that this
is likely to happen, each member also gossips other information between
each node. Information might include their own ring state, latency between
members and what sort of load or data is on each. Building better heuristics
of members with in the list and hopefully allowing the members ring to
converge a little bit faster.
Some of the load profiling tests where to see how well the clusters can
scale (up and down). The question we wanted to ask ourselves was; given a
series of load profiles what happens to both the members list and the
hash-ring state? Additionally do they converge over time and how long
did that take?
Starting simply from adding one additional node to a single cluster with 10
nodes; each node had 5 services on each and each service registering with
the members list. The test ran successfully, adding the members in a timely
fast manor giving us a good convergence rate. Deletion of that node was even
better, as members un-registering themselves also notify the rest of the
members that it's going offline gracefully and should no longer be queried.
If any information about that node is gathered from the members list other
than deletion, we can assume is stale information and can be safely ignored.
Clean exiting is our best use case, it's simple and causes convergence of
ring state to become consistent in a short time frame.
Repeating that setup multiple times showed that given a low traffic ambient
environment, addition and deletion of singular nodes is deterministic and
can be even consistently timed for completion.
What became somewhat surprising when repeating the exercise, with more load
the time to stabilise was harder to determine. The load in question wasn't
heavy and nothing matching peek observed load, but was a nice consistent
flat 100MBps after an initial ramp up. Previously the ambient load tests
showed that we could approximate when a ring would become stable, but this
wasn't so with the new tests. Although we could put an upper bound on this,
sometimes the addition of a member would take almost twice as long. Deletion
wasn't hit as badly, but nether the less, it wasn't as deterministic. There
was no way to judge when the ring would be stable, only to stay it became
stable with X time frame.
Instead of stopping with what we had, we wanted to see what would happen
if we ramped up the cluster, going multi-region and utilising 100 nodes,
each with 5 services on them and adding 50 more nodes. How fast can we add
the nodes and how long did they become stable. From our previous tests we
knew the fastest time we would expect this to happen, so we weren't
expecting less than that.
Not surprisingly it was almost impossible to stabilise with this many
nodes. The time to add nodes just caused new a cyclic regurgitation of
never ending ring changes. Stabilisation sometimes happened, but most
often it didn't. In fact we left a ring to attempt to stabilise over an
hour (we went for lunch) and when we got back, it was still churning a way.
There was something inherently wrong with the system! The design of the
system took into account what it's own view of the world looked like, but
also the view of others. Other participants ring state would be broadcast
to other ring members (not via gossip) to allow a comparison check to
happen. The whole aim of this secondary setup was to be able to check for a
couple of things; split brains or network partitions and if a node is
unreachable from another node, yet was available from else where in the
ring.
All this trafficking of ring state between nodes just created too much load
for each node to handle. The exponential communication would cause nodes to
be come unavailable whilst it tried to handle the burst of traffic and in
turn was unreachable from other nodes. So you end up with a flickering of
nodes on and off, becoming available to the ring one second and then
unavailable the next. With the end result that the hash ring can never
stabilise.
There is a lot of ongoing work to remedy this, things that include reducing
the number of broadcasts that one node can send and spacing the timings out
between the sends. Additionally reduce the payload by sending a bloom
filter or cuckoo filter to each node and use that for checking a node
exists, instead of the original large JSON payload.