Share to learn: Computer Networking

Showing posts with label Computer Networking. Show all posts

Cementing High Availability in OpenFlow with RuleBricks

In this study, the authors proposed a so-called RuleBricks data structure residing in controller and representing (wildcard) forwarding rules to address the limitations caused by the churn of replicas running "behind" Openflow switches (controller replicas). Via modelling rules by RuleBricks, they also leveraged Chord protocol to define how to apply such data structures to elements in SDN network.

Problems

How can HA policies be added to Openflow's forwarding rules?

When a replica fails, the incoming flow should be reassigned to a designated backup (of course without corrupting session state).
The fragment of forwarding rules as replicas are created or destroyed

Solution - RuleBricks

Each brick corresponds to a segment of the address space. All flows with source IP in the address range covered by a brick will be forwarded to the same replica.

Operate via three primitives:

Drop: add new active rules
Insert: add new backup rules => planning for failure
Reduce: make rules more efficient

As in Chord ring, each replica is responsible for objects (flows) that map to a portion of the address space. When a new replica is introduced, it usually requires to drop new bricks to reassign flows to the corresponding replicas.

B4: Experience with a Globally-Deployed Software Defined WAN

Google solutions: building a WAN connecting multiple data centers and face with these following overheads:
- WAN links are very expensive and WAN routers consist of high-end, specialized equipments that place much value on high availability.
- WAN treat all bits the same => all applications are equally treated regardless whether or not they deserve.

Why uses SDN and OpenFlow for B4 to provide connectivity among datacenter?

Unique characteristics of data center WAN

Centralized control to application, servers and LANs
Elastic bandwidth demand by applications
Moderate number of data centers (large forwarding tables are not required)
30-40% utilization of WAN link entails unsustainable cost

Could not achieve the level of scale, fault tolerance, cost efficiency and control required for their network using traditional WAN architectures
Desire to simpler deploy novel routing, scheduling, monitoring and management functionality and protocols
Others (out of scope): rapid iteration on novel protocols, simplified testing environments, improved capacity planning available from a deterministic central TE server rather than capturing the synchronous routing behavior of distributed protocols, simplified management through a fabric-centric rather than router-centric WAN view

B4 Architecture

Composes of 3 layers:

Global layer: logically centralized applications, enable the central control of the entire network
Site controller: network control applications (NCA) and Openflow controllers (maintain network state based on NCA directives and switch events)
Switch hardware: B4 switches peer with traditional BGP routers => SDN-based B4 had to support interoperability with non-SDN WAN implementation.

Deploy routing protocols and traffic engineering as independent services

How to integrate existing routing protocols running on separate control servers with OpenFlow-enabled hardware switches?

Switch Design

Properties:

Be able to adjust transmission rates to avoid the need for deep buffers while avoiding expensive packet drops
Don't need large forwarding tables because used by a relatively small set of data centers
Switch failures usually caused by software rather than hardware issues => move software functionality off the switch hardware, we can manage software fault tolerance

Develop an OpenFlow Agent:

Running as a user-level process on switch hardware
Connect to OpenFlow controller hosted by NCS

Network Control Functionality

Routing

To integrate existing routing protocols with Openflow-based switch, they implemented a Routing Application Proxy (RAP) to provide connectivity between Quagga and OF Switch:

BGP/ISIS route updates
routing-protocol packet flowing between OF switches and Quagga
interface update from switches to Quagga

Quagga acts as control plane, perform BGP/ISIS on NCS (only control plane, there's no data plane)

RAP bridges Quagga and OF switch. RAP caches Quagga RIB and translates into NIB entries for use by Onix (platform for OF Switch?)

Traffic Engineering

Centralized TE architecture is composed of:

Network topology representing sites as vertices and site to site connectivity as edges.
Flow Group is defined as (source site, des site, QoS) tuple
Tunnel represents a site-level path, implemented as IP encapsulation
Tunnel Group maps FGs to a set of tunnels and weights

Bandwidth functions, TE Optimization Algorithm
Specifying the bandwidth allocation to an application given the flow's relative priority or an arbitrary, dimensionless scale, called fair share

TE Protocol and OpenFlow

3 roles for a switch and each of which is involved to a corresponding OF message.

an encapsulating switch initiates tunnels and splits tra›c between them
a transit switch forwards packets based on the outer header
a decapsulating switch terminates tunnels and then forwards packets using regular routes

ATM - Concepts and Architecture

ATM is connection-oriented network (vs connection-less IP network).

Protocol reference model:

Physic layer

Physical Medium Dependent: responsible for the transmission and reception of individual bits on a physical medium. These responsibilities encompass bit timing, signal encoding, interacting with the physical medium, and the cable or wire itself.
Transmission Convergence: functions as a converter between the bit stream of ATM cells and the PMD sublayer. When transmitting, the TC sublayer maps ATM cells onto the format of the PDM sublayer

ATM layer

ATM layer multiplexing blends all the different input types so that the connection parameters of each input are preserved. This process is known as traffic shaping.
ATM layer demultiplexing takes each cell from the ATM cell stream and, based on the VPI/VCI, either routes it (for an ATM switch) or passes the cell to the ATM Adaptation Layer (AAL) process that corresponds to the cell (for an ATM endpoint).
Supervises the cell flow to ensure that all connections remain within their negotiated cell throughput limits.

ATM Adaptation Layer (AAL):

Convergence sublayer - specifies type of services for higher layers (transmission timing synchronization, connection-oriented or connection-less, constant/variable bit rate)
Segmentation and Reassembly sublayer - segment to or reassemble ATM cells

AAL only presents in end systems, not in ATM switches

AAL laer segment is analogous to TCP segment in many IP packets

Short papers

Towards SmartFlow: Case Studies on Enhanced Programmable Forwarding in OpenFlow Switches

Problems
The limited capabilities of the switches renders the implementation of unorthodox routing and forwarding mechanisms as a hard task in OpenFlow => the goal is to inspect the possibilities of slightly smartening up the OpenFlow switches.

Case studies
Add new features (matching mechanism, extra action) to flow tables

Using Bloom filter for stateless multicast
Greedy routing => performed at switch rather than at controller
Network coding

An Openflow-Based Engergy-Efficient Data Center Approach

Problem
IaaS providers suffer from the inherent heterogeneity of systems and applications from different customers => different load + traffic patterns has to be handled

Solutions

Overprovision to sustain a constant service quality => only applied with huge budget and a lot of resources
Smart resource management => ECDC = machine information + network devices + environment data

Plug-n-Serve: Load-Balancing Web Traffic using OpenFlow

Problems

In data center or a dedicated web-hosting service, the HTTP servers are connected by a regular, over-provisioned network; the load-balancer usually does not consider the network state when load-balancing across servers => this way is not true for unstructured network (enterprise, campus) => traffic affects to performance of load-balancing and increase the response time of HTTP request

Solutions - Plug-n-Serve
Load-balances over arbitrary unstructured networks and minimize the average response time by considering the congestion of network and the load on the network + server.

It determines the current state of the network and the servers, including the network topology, network congestion, and load on the servers.
It choose the appropriate server to direct requests to, and controls the path taken by packets in the network, so as to minimize the response time

OpenFlow-Based Server Load Balancing Gone Wild

Problem

The switch in SDN network is used for load balancing and overloaded by a huge number of forwarding rules if each rule is installed for each connection.

Plug-n-Serve approach intercepts first packet of each connection and use network topology + load to determine the target replica before forwarding the traffic => many rules (as said above), delay (since involving the controller for each connection). This approach is called "reactive"

Solutions

Use wildcard rule to direct requests for larger groups of clients instead of each client/connection.

Based on the share of requests among replicas (called weight), the author proposed an partition algorithm to divide client traffic efficiently.

Building a IP-prefix tree with the height is log of sum of replica weight
Assign leaf nodes to replicas based on the proportion of its weight.
Reduce the number of rules by using a wildcard form (for example: we can use 01* instead of two leaf nodes 010* and 011* to create a corresponding rule for a replica)

How to handle when we need to move a traffic from a replica to another => note: exisitng connection should complete at original replica => two ways:

Controller inspects the incoming packet, if it is a non-SYN packet, then it will keep being sent to old replica. If not, the controller will install a rule to forward the packet (and the ones belong to same connection) to new replica
Controller installs high-priority rules to switch to forward traffic to old replica, and low-priority ones to forward to new replica. After soft deadline with no any traffic, that high-priority rule at switch is deleted.

The author also consider the non-uniform traffic as well as the case in which network is composed of multiple switches rather than two (one for gateway receiving client traffic and the other for load balancing)

SDN-based Application-Aware Networking on the Example of YouTube Video
Streaming

Problem
Northbound API enables the application information exchange between applications and network plan => determine how different kinds of information (such as per-flow parameter, app's signature, app quality parameter) can support a more effective network management in an SDN-enabled network

Solution
Conduct 6 experiments: pure, with interfering traffic, round-robin path selection (controller has no external, information, just automatically change switch ports sequentially for incoming packets), DPI experiments (network information) and Application-aware path selection (application state).

VL2: A Scalable and Flexible Data Center Network

General Problems

Cloud requires the agility of data center
Data center with conventional network architecture can't fulfill that demand

Different branches of network tree is required different capacity (switch at core layer is oversubscribed by factor 1:80 to 1:240 while ones at lower layer is 1:5 or more)
Does not prevent traffic flood by one service from affecting the others (commonly have to suffer collateral damage)
Conventional networks achieve scale by assigning servers IP addresses and dividing them into VLANs => migrating VMs requires reconfiguration, human involvement requires reconfiguration => limit the speed of deployment

Realizing this vision concretely translates into building a network that meets the following three objectives:

Uniform high capacity
Performance capacity
Layer-2 semantics

For the compatibility, changes to current network hardware is limited, except the software and operating system on data center servers.

Using a layer 2.5 shim in server's network stack to work around limitations of network devices.

VL2 consists of a network built from low-cost switch ASICs arranged into a Clos topology [2] that provides extensive path diversity between servers. To cope with this volatility, we adopt Valiant Load Balancing (VLB) to spread traffic across all available paths without any centralized coordination or tra c engineering.

Problems in production data centers

To limit overheads (packet flooding, ARP broadcast) => use virtual LAN technique for servers. However, it suffers from 3 limitations:

Limited server-to-server capacity (due to servers locate in different virtual LAN): idle server cannot be assigned to overloaded services
Fragmentation of resources: spreading a service outside a single layer-2 domain frequently requires reconfiguring IP addresses and VLAN trunks => avoid by reserving resource for each service to respond to overloaded cases (demand spike, failure). This in turn incurs significant cost and disruption
Poor reliability and utilization: there must be sufficient remaining idle capacity on a counterpart device to carry the load if an aggregation switch or access router fails => each device and link to be run up to at most 50% of its maximum utilization

Analysis and Comments

Traffic: 1) The ratio of entering/leaving traffic volume is 4:1. 2) Computation is focused on where high speed access to data is fast + cheap even though data is distributed across multiple data centers (due to cost of long-haul link). 3) Demand of bandwidth between servers inside a data center is growing faster than the demand for bandwidth to external host. 4) The network is a bottleneck to computation.

Flow distribution: Flow size is around 100MB no matter the total size of flows is GB. This is because the file is broken into chunks and stored in various servers. The percentage of machine with 80 concurrent flows is 5%, and more than 50% of the time, a machine has about 10.

Traffic matrix: N/A

Failure Characteristics: failure is defined as a event which is logged for a > 30s pending function. Most failures are small in size (involve few of devices) but downtime can be significant (95% of failures are resolved in 10 min but 0.09% last > 10 days). VL2 moves 1:1 redundancy to n:m redundancy.

VL2

Design principles:

Randomize to cope with volatility: using VLB to do destination-independent (e.g. random) traffic spreading across multiple intermediate nodes
Building on proven networking technology: using ECMP forwarding with anycast address to enable VLB with minimal control plane messaging or churn.
Separate names from locators: same as Portland
Embracing end systems

Scale-out topology

- Add intermedia nodes between two Aggregate switches => increase the bandwidth. This is an example of Clos network.

- VLB: take a random path up to a random intermediate switch and a random path down to a destination ToR switch

VL2 Addressing and Routing

Packet forwarding, Address resolution, Access control via the directory service
Random traffic spreading over multiple paths: VLB distributes traffic across a set of intermediate nodes and ECMP distributes across equal-cost paths

ECMP problems: 16-way => define several anycast address, switch cannot retrieve five-tuple values when a packet is encapsulated with multiple IP headers => use hash value

Backwards compatibility

VL2 direactory system
Store, lookup and update AA-to-LA mapping

Evaluation

Uniform high bandwidth: using goodput, efficiency of goodput
VLB fairness: evaluate effectiveness of VL2's implementation of VLB in splitting traffic evenly across the network.

Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Problem: the routing, forwarding, and management protocols that we run in data centers were designed for the general LAN setting and are proving inadequate along a number of dimensions.

With that in mind, there're requirements for future scenarios:

Any VM may migrate to any physical machine without changing their IP address (if so, it will break pre-existing TCP connection and application-level state)
An administrator should not need to configure any switch before deployment (if so, he is highly required to reconfigure when migrating any switch)
Any end host may communicate with any others along any of communication path (fault-tolerant)
No forwarding loop (especially in data center with a huge amount of data)
Failure detection should be rapid and efficient

R1 and R2 require a singer layer 2 fabric => IP address is not affected when migrating VM

R3 requires a large MAC forwarding table with a large number of entries => impractical with switch hardware.

R5 requires efficient routing protocol

Forwarding

Layer 3: small forwarding table (due to hierarchically assign IP), failure is easily detected, add new switch requires administrative burden
Layer 2: less administrative overhead, bad scalable

Portland => Ethernet-compatible forwarding, routing and ARP with the goal of meeting R1 -> R5.
- Scalable layer-2 routing, forwarding, addressing
- Using fabric manager composed of PMAC and IP mapping entries. Pseudo MAC is hierarchical address => efficient forwarding and routing, as well as VM migration.

How to work?
Case 1: A packet with unknown MAC address from a host arrives at ingress switch (IS)
1 - IS create an entry in local PMAC table mapping IP and MAC of that host to PMAC of IS
2 - Send this mapping to fabric manager

An egress switch replace MAC with PMAC to maintain an illusion of unmodified MAC address at the destination host.
An ingress switch will rewrite the PMAC destination address to the MAC for any traffic destined to the host connected to that switch.

Case 2: ARP broadcast to retrieve MAC address of corresponding IP address
1 - IS intercepts that broadcast request and forward to fabric manager
2 - The fabric return that PMAC in case the IP exists in fabric tables
3 - If the IP doesn't exist in fabric manager, that request will be broadcasted to all of other pods.
4 - Then the request sent by the right host will once again rewritten by the IS (replaced MAC with PMAC) and forward to fabric manager and the requesting host

Case 3: newly migrated VM sends a gratuitous ARP with its new IP to MAC address mapping. This ARP is forwarded to fabric manager.
1 - Another host is unable to communicate due to the corresponding host with expected PMAC has not existed any more.
2 - Fabric manager sends an invalidation message to that PMAC to trap handling of subsequent packets destined to the invalid PMAC

Gratuitous ARP: packet which src_ip & dst_ip are set to the host issuing the packet and destination broadcast MAC address. This is used for:

When a machine receives an ARP request containing a source IP that matches its own, then it knows there is an IP conflict.
A host broadcast a gratuitous ARP reply to another hosts for updating their ARP tables.

Distribution Location Discovery

PortLand switches use their position in the global topology to perform more e cient forwarding and routing using

PortLand switches periodically send a Location Discovery Message (LDM) out all of their ports both, to set their positions and to monitor liveness in steady state. => help detecting switch + link failure

Packets will always be forwarded up to either an aggregation or core switch and then down toward their ultimate destination => avoid forwarding loop

Comments:
- Change the way that conventional switches work
- Fabric manager => centralized point of failure due to the number of mapping entries
- Converting MAC to PMAC at switch may increase the delay