Resilient Overlay Networks
by David G. Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris
url show details
Details
type: | misc | booktitle: | SOSP | year: | 2001 | month: | aug # "~13 | series: | ACM SIGOPS Operating Systems Review | journal: | Computer Communication Review | annote: | David Andersen (MIT Laboratory for Computer Science); Hari Balakrishnan (MIT Laboratory for Computer Science); Frans Kaashoek (MIT Laboratory for Computer Science); Robert Morris (MIT Laboratory for Computer Science); | volume: | 35, 5 | publisher: | ACM Press | pages: | 131--145 | abstract: | A Resilient Overlay Network (RON) is an architecture that allows distributed Internet applications to detect and recover from path outages and periods of degraded performance within several seconds, improving over today's wide-area routing protocols that take at least several minutes to recover. A RON is an application-layer overlay on top of the existing Internet routing substrate. The RON nodes monitor the functioning and quality of the Internet paths among themselves, and use this information to decide whether to route packets directly over the Internet or by way of other RON nodes, optimizing application-specific routing metrics. | number: | 1 | school: | Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science | editor: | Greg Ganger | address: | New York |
|
|
You need to log in to add tags and post comments.
The internet, in its current form is designed for wide area scalability. For example, the information shared between AS's is heavily filtered and
summarized using BGP-4. This ability to scale comes at the cost of reduced fault tolerance. This paper describes Resilient Overlay Networks(RON) -- an application layer overlay -- that can be used to improve the reliability and performance of end to end
links that aims to address some of these issues by adding a layering a RON over the underlying Internet Substrate.
Design:
Each node in RON periodically exchanges metrics on latency,loss and throughput from *every* other node in RON.
This information is sent using the overlay network itself. The choice of the
metric to optimize depends on the policies chosen -- RON allows for much more
expressive policy routing than Internet policy routing.
While RON doesn't find the route with the best throughput, it does attempt to avoid low-througput paths whenever possible.
In case of a failure of an overlay link, RON attempts to send the packet to its destination by routing through an intermediate RON node that is still reachable. While multiple such hops over RON nodes are possible, the authors choose to limit themselves to one intermediate node and optimize some of their algorithms for this setting.
Advantages:
Robustness, improved throughput, expressive policies, tighter integration with applications.
Validation:
Using small overlay networks (the paper refers to them as RON1 and RON2), over
12 and 16 nodes respectively. Most of these hosts were on university networks.
Results:
Robustness: While there were multiple failures for direct Internet paths over the period of evaluation, RON was able to avoid *all* these failures
and never experienced significant loss rates(>30%).
Loss Rate: RON improved the loss rate by more than 0.05 for a small fraction of the time, primarily due to the ability of RON to detect outages and aggresively reroute packets.
Improved Throughput: On slow paths, RON improved latency by tens to hundreds of milliseconds;
however, in many cases it degraded throughput, albeit not appreciably.
Points to discuss
Scalability: Each node in RON maintains state about every other node, making
it difficult to scale. Is there a middle ground between BGP, which heavily
summarizes information in order to scale, and the design philosophy of RON ?
Experimental Validation: How realistic is it? For
example, note that RON was able to circumvent all failures implying that none
of these happened at the end host. This might not be the typical failure mode
for average home users.
I think this paper was well written and the authors were honest about the results. RON address fault-tolerance, and tries to recover faster than BGP.
One interesting aspect of RON is the fact that each application can choose its metrics, creating different classes of services (this is similar to one of our previous readings about QoS that addressed how to control fairness for different types of applications).
I don't know a lot about the service agreement between ASes. Could this technology provoke legal issues between providers and consumers, if consumers (lower level ISP) can forward traffic in a way not conforming to the providers routing policies? Anyway, this is interesting idea, and there is also on-going open source projects like Tor (onion routing. they focus on anonymity rather than performance though.)
An interesting paper describing overlay networks that 1) can detect and recover from link failures quickly, 2) incorporate application-specific metrics in path selections, and 3) enable expressive policy routing.
Minor question: How could "hysteresis" mentioned in section 4.2.2. help in avoiding path oscillations?
I liked the paper, as it seems to provide an effective solution for an important problem. I have a few comments about section 4.2.2 (Path Evaluation and Selection). First, they use EWMA to estimate latency (as RED does to estimate avg queue length), but to estimate loss rates they use the average of the last k = 100 probe samples. Is there any theoretical motivation for using these different update rules? For the loss rate they claim "we found this to be a better estimator than EWMA" -- I would be interested in some more discussion, and why they just restricted themselves to these two options. Did they iterate over all k's and find k = 100 to be best? And where does the alpha = 0.9 come from in the EWMA rule? Are they using some sort of optimization, or just guessing constants that seems reasonable?
The paper is basically about the extension of a well known idea to tackle a particular problem. The idea being overlay networks and problem being fault tolerance & time for recovery.
The authors have mentioned some potential applications in their introduction section. One application that RON could specifically be used for is cyber-physical systems.
Although the deployment of RONs might be useful for certain applications/organizations, overall I was not very impressed by this paper. I did not see any revolutionary ideas behind RON; it seemed to me that most of the observed benefits come from sacrifizing scalability to get a more precise picture of a smaller overlay subnetwork and then exploit its specific features.
My major concern was the placement of RONs. In the paper the RON nodes seem to have been strategically placed at key locations within different ASes. This does not seem very realistic to me. My understanding is that a RON deployed by a smaller organization (e.g. different CMU campuses) wouldn't probably have access to so many different ASes to really get any actual benefits. Moreover, as stated in the paper, RON could potentially violate BGP policies, which I think would make it less attractive for most ASes.
This paper presents an interesting network architecture that enables high-performance Internet-based distributed systems. Its discussion section already well addresses possible criticism on RON. It seems to be ok that only a small number of population take advantage by slightly changing the rule of game. But, what if a significant portion of members do that? It would be interesting to see the interactions between co-existing RONs when they become popular.
The design of RON seeks to meet several goals. Fast failure detection and recovery is the first goal. RON’s goal is to detect and recover from failures within several second, compared to several minutes in wide area routing protocol, which is attractive. Another goal of RON is tighter integration with applications. The last goal of RON is expressive policy routing. This paper would be most beneficial to the streaming applications like p2p TV. Is there any evaluation of such application?
Vijay (comment #19): Yes, in that case I agree. I was thinking of a scenario where there is a small number of RON nodes that DO connect through different ASes, but only a handful. Would building a RON provide any benefits in this case or would it require a larger network (with more ASes involved) to perform well?
The paper presents RON and three design goals, of which I think the main contribution is to provide fast re-routing of traffic for broken or ill-performed links. I like the way introduction and background materials are presented. For the O(n^2) overhead, even if it can be reduced, is it still generating a lot of traffic at a high information exchange rate?