hide
Free keywords:
Computer Science, Distributed, Parallel, and Cluster Computing, cs.DC
Abstract:
The vast majority of hardware architectures use a carefully timed reference
signal to clock their computational logic. However, standard distribution
solutions are not fault-tolerant. In this work, we present a simple grid
structure as a more reliable clock propagation method and study it by means of
simulation experiments. Fault-tolerance is achieved by forwarding clock pulses
on arrival of the second of three incoming signals from the previous layer.
A key question is how well neighboring grid nodes are synchronized, even
without faults. Analyzing the clock skew under typical-case conditions is
highly challenging. Because the forwarding mechanism involves taking the
median, standard probabilistic tools fail, even when modeling link delays just
by unbiased coin flips.
Our statistical approach provides substantial evidence that this system
performs surprisingly well. Specifically, in an "infinitely wide" grid of
height~$H$, the delay at a pre-selected node exhibits a standard deviation of
$O(H^{1/4})$ ($\approx 2.7$ link delay uncertainties for $H=2000$) and skew
between adjacent nodes of $o(\log \log H)$ ($\approx 0.77$ link delay
uncertainties for $H=2000$). We conclude that the proposed system is a very
promising clock distribution method. This leads to the open problem of a
stochastic explanation of the tight concentration of delays and skews. More
generally, we believe that understanding our very simple abstraction of the
system is of mathematical interest in its own right.