hide
Free keywords:
Computer Science, Distributed, Parallel, and Cluster Computing, cs.DC
Abstract:
Incidents in production systems are common and downtime is expensive.
Applying an appropriate mitigating action quickly, such as changing a specific
firewall rule, reverting a change, or diverting traffic to a different
availability zone, saves money. Incident localization is time-consuming since a
single failure can have many effects, extending far from the site of failure.
Knowing how different system events relate to each other is necessary to
quickly identify \emph{where} to mitigate. Our approach, Aggregate Comparison
of Traces (ACT), localizes incidents by comparing sets of traces (which capture
events and their relationships for individual requests) sampled from the most
recent steady-state operation and during an incident. In our quantitative
experiments, we show that ACT is able to effectively localize more than 99% of
incidents.