*Disclaimer: None of this is meant as a slight against any client in particular. There is a high likelihood that each client and possibly even the specification has its own oversights and bugs. Eth2 is a complicated protocol, and the people implementing it are only human. The point of this article is to highlight how and why the risks are mitigated.*
With the launch of the Medalla testnet, people were encouraged to experiment with different clients. And right from genesis, we saw why: Nimbus and Lodestar nodes were unable to cope with the workload of a full testnet and got stuck. [0][1] As a result, Medalla failed to finalise for the first half hour of its existence.
On the 14th of August, Prysm nodes lost track of time when one of the time servers they were using as a reference suddenly jumped one day into the future. These nodes then started making blocks and attestations as though they were also in the future. When the clocks on these nodes were corrected (either by updating the client, or because the timeserver returned to the correct time), those that had disabled the default slashing protection found their stakes slashed.
Exactly what happened is a bit more subtle, I highly recommend reading Raul Jordan’s write-up of the incident.
Clock Failure – The enworsening
The moment when Prysm nodes started time traveling, they made up ~62% of the network. This meant that the threshold for finalising blocks (>2/3 on one chain) could not be met. Worse still, these nodes couldn’t find the chain that they were expecting (there was a 4 hour “gap” in the history and they all jumped ahead to slightly different times) and so they flooded the network with short forks as they guessed at the “missing” data.
Prysm currently makes up 82% of Medalla nodes đŸ˜³ ! [ethernodes.org]
At this point, the network was flooded with thousands of different guesses at what the head of the chain was and all the clients started to buckle under the increased workload of figuring out which chain was the right one. This led to nodes falling behind, needing to sync, running out of memory, and other forms of chaos, all of which worsened the problem.
Ultimately this was a good thing, as it allowed us to not only fix the root problem relating to clocks, but to stress test the clients under condition of mass node failure and network load. That said, this failure need not have been so extreme, and the culprit in this case was Prysm’s dominance.
Shilling Decentralisation – Part I, it’s good for eth2
As I’ve discussed previously, 1/3 is the magic number when it comes to safe, asynchronous BFT algorithms. If more than 1/3 of validators are offline, epochs can no longer be finalised. So while the chain still grows, it is no longer possible to point to a block and guarantee that it will remain a part of the canonical chain.
Shilling Decentralisation – Part II, it’s good for you
To the maximum possible extent, validators are incentived to do what is good for the network and not simply trusted to do something because it is the right thing to do.
If more than 1/3 of nodes are offline, then penalties for the offline nodes start ramping up. This is called the inactivity penalty.
This means that, as a validator, you want to try to ensure that if something is going to take your node offline, it is unlikely to take many other nodes offline at the same time.
The same goes for being slashed. While, there’s always a chance that your validators are slashed due to a spec or software mistake/bug, the penalties for single slashings are “only” 1 ETH.
However, if many validators are slashed at the same time as you, then penalties go up to as high as 32 ETH. The point at which this happens is again the magic 1/3 threshold. [An explanation of why this is the case can be found here].
These incentives are called liveness anti-correlation and safety anti-correlation respectively, and are very intentional aspects of eth2’s design. Anti-correlation mechanisms incentivise validators to make decisions that are in the best interest of the network, by tying individual penalties to how much each validator is impacting the network.
Shilling Decentralisation – Part III, the numbers
Eth2 is being implemented by many independent teams, each developing independent clients according to the specification written primarily by the eth2 research team. This ensures that there are multiple beacon node & validator client implementations, each making different decisions about the technology, languages, optimisations, trade-offs etc required to build an eth2 client. This way, a bug in any layer of the system will only impact those running a specific client, and not the whole network.
If, in the example of the Prysm Medalla time-bug, only 20% of eth2 nodes were running Prysm and 85% of people were online, then the inactivity penalty wouldn’t have kicked in for Prysm nodes and the problem could have been fixed with only minor penalties and some sleepless nights for the devs.
In contrast, because so many people were running the same client (many of whom had disabled slashing protection), somewhere between 3500 and 5000 validators were slashed in a short period of time.* The high degree of correlation means that slashings were ~16 ETH for these validators because they were using a popular client.
* At the time of writing, slashings are still pouring in, so there is no final number yet.
Try something new
Now is the time to experiment with different clients. Find a client that a minority of validators are using, (you can look at the distribution here). Lighthouse, Teku, Nimbus, and Prysm are all reasonably stable at the moment while Lodestar is catching up fast.
Most importantly, TRY A NEW CLIENT! We have an opportunity to create a more healthy distribution on Medalla in preparation for a decentralised mainnet.