Share to learn

Introduction to network traffic modeling

This post is intentionally for an overview of modeling network traffic, which I read from this article. I think it is useful for my later work which is potentially related to analyzing, evaluating, or simulating network flows.

Some new concepts I have learned through this investigation are noted in this post.

Basic concepts

There are two common representations for modeling network traffic [1].

Counting process: how many packets arrive for a specific period of time
Inter-arrival time process: how long did time pass between two successive arrivals (named as A_n)?

In case of compound traffic of which the arrivals may happen in batches (several arrivals happen at the same moment), an additional sequence (which element is the number of arrivals in the batch) is introduced. However, I don't mention it at this instant.

Poisson Model

The traffic arrivals are assumed to have the following properties:

Independent
Exponential distribution (Note that exponential distribution is the probability distribution describing the time between events - the moment packets arrive - in a Poisson process)

The Poisson process exhibits an important property:

"Stateless" or memoryless, in other words, means that the number of arrivals in disjoint intervals is statistically independent.

Traffic Burstiness: the limitations of the Poisson model

The authors referred to the concept of the auto-correlation function of the arrivals {A_n}. As far as I understand, according to the Poisson process, the arrivals in any renewal process have no relationship with the former ones due to stateless (or memoryless) property, whereas the burstiness possibly gives rise to connections among arrivals in processes => not fit to Poisson model.

Let's keep continuing with a simpler traffic model that overcomes other limitations of the Poisson model before considering the self-similarity one.

Packet train model

The authors in [2] proposed an arrival model introducing dependence between packets traveling between the same end-nodes.

In the study, the author used mathematically statistical tools, namely mean, standard deviation, coefficient of variation on time series to evaluate. They also involved an auto-correlation function (ACF) to show the relationship of an element of a time series t_nwith a previous one t_n-1. Another tool is k^th percentile which is a value in a data set that splits the data into two pieces: the lower piece contains k elements and the upper contains the rest of the data (which amounts to 100 -k elements. "90^th percentile = a" means that there are 90% elements of the data set which have a value greater than "a".

The model introduces new parameters:

inter-train time: a user parameter, depends on the frequency with which applications use the network
inter-car time: a system parameter, depends on the network hardware and software

In the Poisson model, these are merged into a single parameter: mean inter-arrival time.

In my opinion, with the introduction of inter-car (and inter-train) time, this model does not directly address the issue of multiple time scale burstiness of network traffics. That inter- time makes the proposed model not randomly enough to fit the random occurrence of burstiness.

Self-similar model

Now, let's go to another model. I mostly read a number of references to have a little understanding of this. Its name exactly associates with an intuitive property of Fractal geometry which is a great tool to imitate nature or "chaos" like network traffic.

Many studies proved that the data network has very high variability in terms of time and "space" (actually, I don't get what space means). [3] is such an example.

Because network data usually have a long-range dependence temporal structure, the conventional time series model is not appropriate (this statement is what I know, I have no idea in detail). The self-similar model is a candidate for modeling network data dynamics with such a long correlation.

The authors in [4] also insisted on three implications of the self-similar nature of network traffic for traffic engineering purposes: modeling individual Ethernet source, the inadequacy of conventional notions of burstiness, and effects on congestion management for packet networks.

References

[1] Traffic Modeling for Telecommunications Networks Link

[2] From Poisson Model to Self-Similarity: a Survey of Network Traffic Models Link
[3] High time-resolution measurement and analysis of LAN traffic: Implications for LAN interconnection Link

[4] On the Self-Similar Nature of Ethernet Traffic Link

So amazing about mathematical-related stuff

Today, it takes me a whole day to just figure out what Fractal geometry is. This is related to a model used to model network traffic. This relevant Fractal model is better than the Poisson since it imitates traffic arrivals more accurately. Kinda interesting with many novel concepts. Listing here and hope that one day I have an opportunity to deeply dig into.

Fractal (for sure)
Chaos theory
Peano & Hilbert curve: damn, my love with Mathematics is mostly reborn when taking a look at this concepts. Intuitively and mathematically, a 1-dimensional curve can fill up a 2-dimensional space. The proposed solution is likely not one-to-one correspondence (mapping) between n-dimensional space and (n+1)-dimensional one. Really excited about it. It is stated that such space filling curve have non-integer dimension (or the number of dimension is between n and n+1).
Mandelbrot is the first one defined the concept of non-integer dimension in his study, namely "Fractals and the Geometry of Nature".
Expected value: the value you can expect for some kind of event.
Variance: intuitively, it measures how far a data set is spread out. Technically, it is the average of the squared differences from the mean.
Standard deviation: While variance gives you a rough idea of spread, the standard deviation is more concrete, giving you exact distances from the mean. The standard deviation tells you how tightly your data is clustered around the mean. In order to get the standard deviation, take the square root of the sample variance.
Coefficient of variation: the ratio between standard deviation and mean, can be used to compare variability between different measures, especially useful to apply for measures but with various mechanisms.

Some definitions related to statistics are referred from this website.

Cementing High Availability in OpenFlow with RuleBricks

In this study, the authors proposed a so-called RuleBricks data structure residing in controller and representing (wildcard) forwarding rules to address the limitations caused by the churn of replicas running "behind" Openflow switches (controller replicas). Via modelling rules by RuleBricks, they also leveraged Chord protocol to define how to apply such data structures to elements in SDN network.

Problems

How can HA policies be added to Openflow's forwarding rules?

When a replica fails, the incoming flow should be reassigned to a designated backup (of course without corrupting session state).
The fragment of forwarding rules as replicas are created or destroyed

Solution - RuleBricks

Each brick corresponds to a segment of the address space. All flows with source IP in the address range covered by a brick will be forwarded to the same replica.

Operate via three primitives:

Drop: add new active rules
Insert: add new backup rules => planning for failure
Reduce: make rules more efficient

As in Chord ring, each replica is responsible for objects (flows) that map to a portion of the address space. When a new replica is introduced, it usually requires to drop new bricks to reassign flows to the corresponding replicas.

Algorithm Explaination

Consistent hashing

Context and problem

Suppose we have n computers and want to distribute some data on those. By using naive hashing method, we may come up with a very simple solution hash(data) mode n to determine on which computer we should select for storing.

In case we add an extra computer, then nearly all of stored data n/(n+1) have to be reallocated => very expensive

Solution - Consisten Hash

Hash the number of computer and the data onto a circle.
Start moving clockwise in the circle, put the data to the next computer
In case a computer is down, just move those data to next one => only a fraction roughly 1/n of all of data need to be moved

Rendezvous hashing

Context and problem

How can a set of clients agree on where in a set of sites (servers) to place a data?

Solution - Highest Random Weight (HRW) Hashing

Each client computes the hash weights w1=hash(site1, data), w2=hash(site2, data), ...., wn=hash(siten, data) and select the site which yields the highest weight. If a site fails, the client just picks the site which yields the next highest value.

Note: In consistent hash method, if a server fails, all of its data will be moved to the next one => overload. Meanwhile, the HRW does not suffer this issue since its mechanism (using hash to determine the site) is random.

Tree of Caches - Harvest Cache

Cache is organized hierarchically. A user obtains a page by asking a nearby leaf cache. If neither this cache nor its siblings have the page, the request is forwarded to the cache's parent. If a page is stored by no cache in the tree, the request eventually reaches the root and is forwarded to the home site of the page.

Chord - P2P

In spite of loving to research the topics related to P2P area, I have no deep understanding the Chord algorithm. So I decided to learn it naturally, and beginning with a very short sentence: Chord specifies how keys are assigned to nodes and how a node can discover the value for a given key by first locating the node responsible for that key.

B4: Experience with a Globally-Deployed Software Defined WAN

Google solutions: building a WAN connecting multiple data centers and face with these following overheads:
- WAN links are very expensive and WAN routers consist of high-end, specialized equipments that place much value on high availability.
- WAN treat all bits the same => all applications are equally treated regardless whether or not they deserve.

Why uses SDN and OpenFlow for B4 to provide connectivity among datacenter?

Unique characteristics of data center WAN

Centralized control to application, servers and LANs
Elastic bandwidth demand by applications
Moderate number of data centers (large forwarding tables are not required)
30-40% utilization of WAN link entails unsustainable cost

Could not achieve the level of scale, fault tolerance, cost efficiency and control required for their network using traditional WAN architectures
Desire to simpler deploy novel routing, scheduling, monitoring and management functionality and protocols
Others (out of scope): rapid iteration on novel protocols, simplified testing environments, improved capacity planning available from a deterministic central TE server rather than capturing the synchronous routing behavior of distributed protocols, simplified management through a fabric-centric rather than router-centric WAN view

B4 Architecture

Composes of 3 layers:

Global layer: logically centralized applications, enable the central control of the entire network
Site controller: network control applications (NCA) and Openflow controllers (maintain network state based on NCA directives and switch events)
Switch hardware: B4 switches peer with traditional BGP routers => SDN-based B4 had to support interoperability with non-SDN WAN implementation.

Deploy routing protocols and traffic engineering as independent services

How to integrate existing routing protocols running on separate control servers with OpenFlow-enabled hardware switches?

Switch Design

Properties:

Be able to adjust transmission rates to avoid the need for deep buffers while avoiding expensive packet drops
Don't need large forwarding tables because used by a relatively small set of data centers
Switch failures usually caused by software rather than hardware issues => move software functionality off the switch hardware, we can manage software fault tolerance

Develop an OpenFlow Agent:

Running as a user-level process on switch hardware
Connect to OpenFlow controller hosted by NCS

Network Control Functionality

Routing

To integrate existing routing protocols with Openflow-based switch, they implemented a Routing Application Proxy (RAP) to provide connectivity between Quagga and OF Switch:

BGP/ISIS route updates
routing-protocol packet flowing between OF switches and Quagga
interface update from switches to Quagga

Quagga acts as control plane, perform BGP/ISIS on NCS (only control plane, there's no data plane)

RAP bridges Quagga and OF switch. RAP caches Quagga RIB and translates into NIB entries for use by Onix (platform for OF Switch?)

Traffic Engineering

Centralized TE architecture is composed of:

Network topology representing sites as vertices and site to site connectivity as edges.
Flow Group is defined as (source site, des site, QoS) tuple
Tunnel represents a site-level path, implemented as IP encapsulation
Tunnel Group maps FGs to a set of tunnels and weights

Bandwidth functions, TE Optimization Algorithm
Specifying the bandwidth allocation to an application given the flow's relative priority or an arbitrary, dimensionless scale, called fair share

TE Protocol and OpenFlow

3 roles for a switch and each of which is involved to a corresponding OF message.

an encapsulating switch initiates tunnels and splits tra›c between them
a transit switch forwards packets based on the outer header
a decapsulating switch terminates tunnels and then forwards packets using regular routes

ATM - Concepts and Architecture

ATM is connection-oriented network (vs connection-less IP network).

Protocol reference model:

Physic layer

Physical Medium Dependent: responsible for the transmission and reception of individual bits on a physical medium. These responsibilities encompass bit timing, signal encoding, interacting with the physical medium, and the cable or wire itself.
Transmission Convergence: functions as a converter between the bit stream of ATM cells and the PMD sublayer. When transmitting, the TC sublayer maps ATM cells onto the format of the PDM sublayer

ATM layer

ATM layer multiplexing blends all the different input types so that the connection parameters of each input are preserved. This process is known as traffic shaping.
ATM layer demultiplexing takes each cell from the ATM cell stream and, based on the VPI/VCI, either routes it (for an ATM switch) or passes the cell to the ATM Adaptation Layer (AAL) process that corresponds to the cell (for an ATM endpoint).
Supervises the cell flow to ensure that all connections remain within their negotiated cell throughput limits.

ATM Adaptation Layer (AAL):

Convergence sublayer - specifies type of services for higher layers (transmission timing synchronization, connection-oriented or connection-less, constant/variable bit rate)
Segmentation and Reassembly sublayer - segment to or reassemble ATM cells

AAL only presents in end systems, not in ATM switches

AAL laer segment is analogous to TCP segment in many IP packets

Short papers

Towards SmartFlow: Case Studies on Enhanced Programmable Forwarding in OpenFlow Switches

Problems
The limited capabilities of the switches renders the implementation of unorthodox routing and forwarding mechanisms as a hard task in OpenFlow => the goal is to inspect the possibilities of slightly smartening up the OpenFlow switches.

Case studies
Add new features (matching mechanism, extra action) to flow tables

Using Bloom filter for stateless multicast
Greedy routing => performed at switch rather than at controller
Network coding

An Openflow-Based Engergy-Efficient Data Center Approach

Problem
IaaS providers suffer from the inherent heterogeneity of systems and applications from different customers => different load + traffic patterns has to be handled

Solutions

Overprovision to sustain a constant service quality => only applied with huge budget and a lot of resources
Smart resource management => ECDC = machine information + network devices + environment data

Plug-n-Serve: Load-Balancing Web Traffic using OpenFlow

Problems

In data center or a dedicated web-hosting service, the HTTP servers are connected by a regular, over-provisioned network; the load-balancer usually does not consider the network state when load-balancing across servers => this way is not true for unstructured network (enterprise, campus) => traffic affects to performance of load-balancing and increase the response time of HTTP request

Solutions - Plug-n-Serve
Load-balances over arbitrary unstructured networks and minimize the average response time by considering the congestion of network and the load on the network + server.

It determines the current state of the network and the servers, including the network topology, network congestion, and load on the servers.
It choose the appropriate server to direct requests to, and controls the path taken by packets in the network, so as to minimize the response time

OpenFlow-Based Server Load Balancing Gone Wild

Problem

The switch in SDN network is used for load balancing and overloaded by a huge number of forwarding rules if each rule is installed for each connection.

Plug-n-Serve approach intercepts first packet of each connection and use network topology + load to determine the target replica before forwarding the traffic => many rules (as said above), delay (since involving the controller for each connection). This approach is called "reactive"

Solutions

Use wildcard rule to direct requests for larger groups of clients instead of each client/connection.

Based on the share of requests among replicas (called weight), the author proposed an partition algorithm to divide client traffic efficiently.

Building a IP-prefix tree with the height is log of sum of replica weight
Assign leaf nodes to replicas based on the proportion of its weight.
Reduce the number of rules by using a wildcard form (for example: we can use 01* instead of two leaf nodes 010* and 011* to create a corresponding rule for a replica)

How to handle when we need to move a traffic from a replica to another => note: exisitng connection should complete at original replica => two ways:

Controller inspects the incoming packet, if it is a non-SYN packet, then it will keep being sent to old replica. If not, the controller will install a rule to forward the packet (and the ones belong to same connection) to new replica
Controller installs high-priority rules to switch to forward traffic to old replica, and low-priority ones to forward to new replica. After soft deadline with no any traffic, that high-priority rule at switch is deleted.

The author also consider the non-uniform traffic as well as the case in which network is composed of multiple switches rather than two (one for gateway receiving client traffic and the other for load balancing)

SDN-based Application-Aware Networking on the Example of YouTube Video
Streaming

Problem
Northbound API enables the application information exchange between applications and network plan => determine how different kinds of information (such as per-flow parameter, app's signature, app quality parameter) can support a more effective network management in an SDN-enabled network

Solution
Conduct 6 experiments: pure, with interfering traffic, round-robin path selection (controller has no external, information, just automatically change switch ports sequentially for incoming packets), DPI experiments (network information) and Application-aware path selection (application state).

VL2: A Scalable and Flexible Data Center Network

General Problems

Cloud requires the agility of data center
Data center with conventional network architecture can't fulfill that demand

Different branches of network tree is required different capacity (switch at core layer is oversubscribed by factor 1:80 to 1:240 while ones at lower layer is 1:5 or more)
Does not prevent traffic flood by one service from affecting the others (commonly have to suffer collateral damage)
Conventional networks achieve scale by assigning servers IP addresses and dividing them into VLANs => migrating VMs requires reconfiguration, human involvement requires reconfiguration => limit the speed of deployment

Realizing this vision concretely translates into building a network that meets the following three objectives:

Uniform high capacity
Performance capacity
Layer-2 semantics

For the compatibility, changes to current network hardware is limited, except the software and operating system on data center servers.

Using a layer 2.5 shim in server's network stack to work around limitations of network devices.

VL2 consists of a network built from low-cost switch ASICs arranged into a Clos topology [2] that provides extensive path diversity between servers. To cope with this volatility, we adopt Valiant Load Balancing (VLB) to spread traffic across all available paths without any centralized coordination or tra c engineering.

Problems in production data centers

To limit overheads (packet flooding, ARP broadcast) => use virtual LAN technique for servers. However, it suffers from 3 limitations:

Limited server-to-server capacity (due to servers locate in different virtual LAN): idle server cannot be assigned to overloaded services
Fragmentation of resources: spreading a service outside a single layer-2 domain frequently requires reconfiguring IP addresses and VLAN trunks => avoid by reserving resource for each service to respond to overloaded cases (demand spike, failure). This in turn incurs significant cost and disruption
Poor reliability and utilization: there must be sufficient remaining idle capacity on a counterpart device to carry the load if an aggregation switch or access router fails => each device and link to be run up to at most 50% of its maximum utilization

Analysis and Comments

Traffic: 1) The ratio of entering/leaving traffic volume is 4:1. 2) Computation is focused on where high speed access to data is fast + cheap even though data is distributed across multiple data centers (due to cost of long-haul link). 3) Demand of bandwidth between servers inside a data center is growing faster than the demand for bandwidth to external host. 4) The network is a bottleneck to computation.

Flow distribution: Flow size is around 100MB no matter the total size of flows is GB. This is because the file is broken into chunks and stored in various servers. The percentage of machine with 80 concurrent flows is 5%, and more than 50% of the time, a machine has about 10.

Traffic matrix: N/A

Failure Characteristics: failure is defined as a event which is logged for a > 30s pending function. Most failures are small in size (involve few of devices) but downtime can be significant (95% of failures are resolved in 10 min but 0.09% last > 10 days). VL2 moves 1:1 redundancy to n:m redundancy.

VL2

Design principles:

Randomize to cope with volatility: using VLB to do destination-independent (e.g. random) traffic spreading across multiple intermediate nodes
Building on proven networking technology: using ECMP forwarding with anycast address to enable VLB with minimal control plane messaging or churn.
Separate names from locators: same as Portland
Embracing end systems

Scale-out topology

- Add intermedia nodes between two Aggregate switches => increase the bandwidth. This is an example of Clos network.

- VLB: take a random path up to a random intermediate switch and a random path down to a destination ToR switch

VL2 Addressing and Routing

Packet forwarding, Address resolution, Access control via the directory service
Random traffic spreading over multiple paths: VLB distributes traffic across a set of intermediate nodes and ECMP distributes across equal-cost paths

ECMP problems: 16-way => define several anycast address, switch cannot retrieve five-tuple values when a packet is encapsulated with multiple IP headers => use hash value

Backwards compatibility

VL2 direactory system
Store, lookup and update AA-to-LA mapping

Evaluation

Uniform high bandwidth: using goodput, efficiency of goodput
VLB fairness: evaluate effectiveness of VL2's implementation of VLB in splitting traffic evenly across the network.

Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Problem: the routing, forwarding, and management protocols that we run in data centers were designed for the general LAN setting and are proving inadequate along a number of dimensions.

With that in mind, there're requirements for future scenarios:

Any VM may migrate to any physical machine without changing their IP address (if so, it will break pre-existing TCP connection and application-level state)
An administrator should not need to configure any switch before deployment (if so, he is highly required to reconfigure when migrating any switch)
Any end host may communicate with any others along any of communication path (fault-tolerant)
No forwarding loop (especially in data center with a huge amount of data)
Failure detection should be rapid and efficient

R1 and R2 require a singer layer 2 fabric => IP address is not affected when migrating VM

R3 requires a large MAC forwarding table with a large number of entries => impractical with switch hardware.

R5 requires efficient routing protocol

Forwarding

Layer 3: small forwarding table (due to hierarchically assign IP), failure is easily detected, add new switch requires administrative burden
Layer 2: less administrative overhead, bad scalable

Portland => Ethernet-compatible forwarding, routing and ARP with the goal of meeting R1 -> R5.
- Scalable layer-2 routing, forwarding, addressing
- Using fabric manager composed of PMAC and IP mapping entries. Pseudo MAC is hierarchical address => efficient forwarding and routing, as well as VM migration.

How to work?
Case 1: A packet with unknown MAC address from a host arrives at ingress switch (IS)
1 - IS create an entry in local PMAC table mapping IP and MAC of that host to PMAC of IS
2 - Send this mapping to fabric manager

An egress switch replace MAC with PMAC to maintain an illusion of unmodified MAC address at the destination host.
An ingress switch will rewrite the PMAC destination address to the MAC for any traffic destined to the host connected to that switch.

Case 2: ARP broadcast to retrieve MAC address of corresponding IP address
1 - IS intercepts that broadcast request and forward to fabric manager
2 - The fabric return that PMAC in case the IP exists in fabric tables
3 - If the IP doesn't exist in fabric manager, that request will be broadcasted to all of other pods.
4 - Then the request sent by the right host will once again rewritten by the IS (replaced MAC with PMAC) and forward to fabric manager and the requesting host

Case 3: newly migrated VM sends a gratuitous ARP with its new IP to MAC address mapping. This ARP is forwarded to fabric manager.
1 - Another host is unable to communicate due to the corresponding host with expected PMAC has not existed any more.
2 - Fabric manager sends an invalidation message to that PMAC to trap handling of subsequent packets destined to the invalid PMAC

Gratuitous ARP: packet which src_ip & dst_ip are set to the host issuing the packet and destination broadcast MAC address. This is used for:

When a machine receives an ARP request containing a source IP that matches its own, then it knows there is an IP conflict.
A host broadcast a gratuitous ARP reply to another hosts for updating their ARP tables.

Distribution Location Discovery

PortLand switches use their position in the global topology to perform more e cient forwarding and routing using

PortLand switches periodically send a Location Discovery Message (LDM) out all of their ports both, to set their positions and to monitor liveness in steady state. => help detecting switch + link failure

Packets will always be forwarded up to either an aggregation or core switch and then down toward their ultimate destination => avoid forwarding loop

Comments:
- Change the way that conventional switches work
- Fabric manager => centralized point of failure due to the number of mapping entries
- Converting MAC to PMAC at switch may increase the delay

Convex Function - Definition and Properties

Định nghĩa (tập lồi): Tập $X\subset \mathbb{R}^n$ gọi là tập lồi nếu

$x,y\in X\Rightarrow \lambda x+(1-\lambda)y\in X, \forall \lambda\in [0,1]$

Nghĩa là nếu $x,y\in X$ thì đoạn thẳng $\left[x,y\right]\in X$ .

Định nghĩa (tập affine): Tập $X\subset \mathbb{R}^n$ gọi là tập affine nếu

$x,y\in X\Rightarrow \lambda x+(1-\lambda)y\in X, \forall \lambda\in\mathbb{R}$

Nghĩa là nếu $x,y\in X$ thì đường thẳng đi qua $x,y$ cũng nằm trong $X$ .

Định lý (Tính chất của tập affine):

Tập affine là tập lồi
Nếu $X$ affine và $a\in X$ , tập $L=\{y:\exists x\in X, y=x-a\}$ là không gian con của $\mathbb{R}^n=X-a$ , đồng thời $L$ là duy nhất đối với $X$ không phụ thuộc vào $a$ . Ta cũng viết $X=a+L$ .
Tập $X$ là tập affine nếu và chỉ nếu $\exists A,b$ sao cho $X=\{x:Ax=b\}$ .

Định nghĩa (tổ hợp lồi): Tổ hợp tuyến tính $\displaystyle\sum_{i=1}^n\lambda_i x_i$ gọi là tổ hợp lồi của $x_1,\ldots,x_n\in\mathbb{R}^n$ nếu

$\lambda_i\geq 0,\forall i, \displaystyle\sum_{i=1}^n\lambda_i=1$

Định nghĩa (hàm lồi): Hàm $f$ trên $\mathbb{R}^n$ lồi nếu

Miền xác định $\mathrm{Dom}f$ lồi.
$x,y\in \mathrm{Dom}f: f(\lambda x+(1-\lambda)y)\leq\lambda f(x)+(1-\lambda)f(y)$ (*)

Định lý (Bất đẳng thức Jensen): Nếu $f$ lồi và $x_i\in\mathrm{Dom}f,\lambda_i\geq 0,\sum_{i=1}^n\lambda_i=1$

$\sum_{i=1}^n f(\lambda_i x_i)\leq\sum_{i=1}^n \lambda_i f(x_i)$

Ref: http://csstudyfun.wordpress.com

How do debuggers work? - P.3

Ok, as mentioned in the previous entry, let's me continue to discuss about the problem a debugger faces - thread or more exactly, multithread.

When you want to stop a debuggee that's churning like mad, you need to get a breakpoint jammed into the CPU instruction stream so that you can stop in the debugger. The question is, what do you need to do to get the instruction in there? If a thread is running, the only thing you can do to get it to a known point is to suspend it by using the SuspendThread API function. Once the thread is suspended, you can look at it with the GetThreadContext API function and determine the current instruction pointer. Once you have the instruction pointer, you're back to setting simple breakpoints. After you set the breakpoint, you need to call the ResumeThread API function so that you can let the thread continue execution and have it hit your breakpoint.

Although breaking into the debugger is fairly simple, you still need to think about a couple of issues. The first issue is that your breakpoint might not trigger. If the debuggee is processing a message or doing some other work, it will break. If the debuggee is sitting there waiting for a message to arrive, however, the breakpoint won't trigger until the debuggee receives a message. Although you could require the user to move the mouse over the debuggee to generate a WM_MOUSEMOVE message, the user might not be too happy about this requirement.

Exploring Windows Memory Management - Virtual Memory

In this entry, I'll try to explain more the other strategy for managing memory - virtual memory.

Virtual Memory

The basic idea behind virtual memory is that the combined size of the program, data, and stack may exceed the amount of physical memory available for it. The operating system keeps those parts of the program currently in use in main memory, and the rest on the disk. For example, a 512-MB program can run on a 256-MB machine by carefully choosing which 256 MB to keep in memory at each instant, with pieces of the program being swapped between disk and memory as needed.

Virtual memory can also work in a multiprogramming system, with bits and pieces of many programs in memory at once. While a program is waiting for part of itself to be brought in, it is waiting for I/O and cannot run, so the CPU can be given to another process, the same way as in any other multiprogramming system.

Paging

Most virtual memory systems use a technique called paging, which we will now describe. On any computer, there exists a set of memory addresses that programs can produce. When a program uses an instruction like

MOV REG, 1000

It does this to copy the contents of memory address 1000 to REG (or vice versa, depending on the computer). Addresses can be generated using indexing, base registers, segment registers, and other ways.

These program-generated addresses are called virtual addresses and form the virtual address space. On computers without virtual memory, the virtual address is put directly onto the memory bus and causes the physical memory word with the same address to be read or written. When virtual memory is used, the virtual addresses do not go directly to the memory bus. Instead, they go to an MMU (Memory Management Unit) that maps the virtual addresses onto the physical memory addresses.

A very simple example of how this mapping works is shown in the following figure. In this example, we have a computer that can generate 16-bit addresses, from 0 up to 64K. These are the virtual addresses. This computer, however, has only 32 KB of physical memory, so although 64-KB programs can be written, they cannot be loaded into memory in their entirety and run. A complete copy of a program's memory image, up to 64 KB, must be present on the disk, however, so that pieces can be brought in as needed.

The virtual address space is divided up into units called pages. The corresponding units in the physical memory are called page frames. The pages and page frames are always the same size. In this example they are 4 KB, but page sizes from 512 bytes to 1 MB have been used in real systems. With 64 KB of virtual address space and 32 KB of physical memory, we get 16 virtual pages and 8 page frames. Transfers between RAM and disk are always in units of a page. When the program tries to access address 0, for example, using the instruction

MOV REG, 0

virtual address 0 is sent to the MMU. The MMU sees that this virtual address falls in page 0 (0 to 4095), which according to its mapping is page frame 2 (8192 to 12287). It thus transforms the address to 8192 and outputs address 8192 onto the bus. The memory knows nothing at all about the MMU and just sees a request for reading or writing address 8192, which it honors. Thus, the MMU has effectively mapped all virtual addresses between 0 and 4095 onto physical addresses 8192 to 12287. Similarly, an instruction

MOV REG, 8192

is effectively transformed into

MOV REG, 24576

because virtual address 8192 is in virtual page 2 and this page is mapped onto physical page frame 6 (physical addresses 24576 to 28671). As a third example, virtual address 20500 is 20 bytes from the start of virtual page 5 (virtual addresses 20480 to 24575) and maps onto physical address 12288 + 20 = 12308.

By itself, this ability to map the 16 virtual pages onto any of the eight page frames by setting the MMU's map appropriately does not solve the problem that the virtual address space is larger than the physical memory. Since we have only eight physical page frames, only eight of the virtual pages are mapped onto physical memory. The others, shown as crosses in the figure, are not mapped. In the actual hardware, a present/absent bit keeps track of which pages are physically present in memory.

What happens if the program tries to use an unmapped page, for example, by using the instruction

MOV REG, 32780

which is byte 12 within virtual page 8 (starting at 32768)? The MMU notices that the page is unmapped (indicated by a cross in the figure) and causes the CPU to trap to the operating system. This trap is called a page fault. The operating system picks a little-used page frame and writes its contents back to the disk. It then fetches the page just referenced into the page frame just freed, changes the map, and restarts the trapped instruction.

For example, if the operating system decided to evict page frame 1, it would load virtual page 8 at physical address 4K and make two changes to the MMU map. First, it would mark virtual page 1's entry as unmapped, to trap any future accesses to virtual addresses between 4K and 8K. Then it would replace the cross in virtual page 8's entry with a 1, so that when the trapped instruction is re-executed, it will map virtual address 32780 onto physical address 4108.

Now let us look inside the MMU to see how it works and why we have chosen to use a page size that is a power of 2. In the next figure we see an example of a virtual address, 8196 (0010000000000100 in binary), being mapped using the MMU map of the previous one. The incoming 16-bit virtual address is split into a 4-bit page number and a 12-bit offset. With 4 bits for the page number, we can have 16 pages, and with 12 bits for the offset, we can address all 4096 bytes within a page.

The page number is used as an index into the page table, yielding the number of the page frame corresponding to that virtual page. If the present/absent bit is 0, a trap to the operating system is caused. If the bit is 1, the page frame number found in the page table is copied to the high-order 3 bits of the output register, along with the 12-bit offset, which is copied unmodified from the incoming virtual address. Together they form a 15-bit physical address. The output register is then put onto the memory bus as the physical memory address.

Segmentation

This is another scheme to manage memory bases on virtual memory concept. In that way, the virtual address space is a collection of segments. Each segment has a name and a length. Thus the addresses specify both the segment name and the offset within that segment. The user therefore specifies each address by two quantities: a segment name and an offset. Contrast this scheme with the paging one, in which the user specifies only a single address which is partitioned by the hardware into a page number and an offset, all invisible to the programmer.

For simplicity of implementation, segments are numbered and are referred to by a segment number rather than by a segment name. That's the reason why a logical address in segmentation approach consists of a two tuple:

<segment-number, offset>

Although the user can refer to objects in the program by a two-dimensional address, the actual physical address is still, of course, a one-dimensional sequence of bytes. Thus, segment table is defined as a mapping from two-dimensional user-defined address to one-dimensional physical address. Each entry in the segment table has a segment base and a segment limit. The segment base contains the starting physical address where the segment resides in memory, whereas the segment limit specifies the length of the segment.

Intel Architecture's strategy
Both paging and segmentation have advantages and disadvantages. In fact, some architectures provide both and Intel one is one of them. It supports both pure segmentation and segmentation with paging.

In Pentium system, the CPU generates logical address which are given to the segmentation unit. The segmentation unit produces a linear address for each logical address. The linear address is then given to the paging unit which in turn generates the physical address in main memory. Thus, the segmentation unit and paging unit form the equivalent of the memory-management unit as follows:

The Pentium architecture allows a segment to be as 4GB and the maximum number of segments per process is 16KB. The logical address space is divided into two partitions. The first one consists of up to 8KB segments that are private to that process. The second consists of up to 8KB segments that are shared among all the process. About the paging, a page size is either 4 KB or 4 MB.

References:

Andrew S. Tanenbaum, Operating Systems Design and Implementation, Third Edition.
Abraham Silberschatz, Peter Baer Galvin, Greg Gagne, Operating System Concepts, 7th Edition

Exploring Windows Memory Management - Swapping

Before keeping going with the second part about debugger, there're somethings associated with the way in which Windows manages memory. Therefore, I decided to bookmark that second one and have begun exploring the memory architecture of Windows.

This entry will take a closer look at two memory management approach: swapping and virtual memory. The difference between them is the way in which a process is loaded to main memory. Know that to fully understand the memory architecture can't be achieved just in only several entries. Anyway, I hope that I can partially get a better knowledge about this concept than before :-)

Swapping

The operation of a swapping system is illustrated in the following figure. Initially, only process A is in memory. Then processes B and C are created or swapped in from disk. In (d) A is swapped out to disk. Then D comes in and B goes out. Finally A comes in again. Since A is now at a different location, addresses contained in it must be relocated, either by software when it is swapped in or (more likely) by hardware during program execution.

When swapping creates multiple holes in memory, it is possible to combine them all into one big one by moving all the processes downward as far as possible. This technique is known as memory compaction. It is usually not done because it requires a lot of CPU time. For example, on a 1-GB machine that can copy at a rate of 2 GB/sec (0.5 nsec/byte) it takes about 0.5 sec to compact all of memory. That may not seem like much time, but it would be noticeably disruptive to a user watching a video stream.

A point that is worth making concerns how much memory should be allocated for a process when it is created or swapped in. If processes are created with a fixed size that never changes, then the allocation is simple: the operating system allocates exactly what is needed, no more and no less.

If, however, processes' data segments can grow, for example, by dynamically allocating memory from a heap, as in many programming languages, a problem occurs whenever a process tries to grow. If a hole is adjacent to the process, it can be allocated and the process can be allowed to grow into the hole. On the other hand, if the process is adjacent to another process, the growing process will either have to be moved to a hole in memory large enough for it, or one or more processes will have to be swapped out to create a large enough hole. If a process cannot grow in memory and the swap area on the disk is full, the process will have to wait or be killed.

If it is expected that most processes will grow as they run, it is probably a good idea to allocate a little extra memory whenever a process is swapped in or moved, to reduce the overhead associated with moving or swapping processes that no longer fit in their allocated memory. However, when swapping processes to disk, only the memory actually in use should be swapped; it is wasteful to swap the extra memory as well. In the following figure we see a memory configuration in which space for growth has been allocated to two processes.

If processes can have two growing segments, for example, the data segment being used as a heap for variables that are dynamically allocated and released and a stack segment for the normal local variables and return addresses, an alternative arrangement suggests itself, namely that of the figure (b). In this figure we see that each process illustrated has a stack at the top of its allocated memory that is growing downward, and a data segment just beyond the program text that is growing upward. The memory between them can be used for either segment. If it runs out, either the process will have to be moved to a hole with sufficient space, swapped out of memory until a large enough hole can be created, or killed.

Ok, that's enough for a concept :-) The remaining strategy of memory management - virtual memor, will be mentioned in the next entry.

References

Andrew S. Tanenbaum, Operating Systems Design and Implementation, Third Edition.

How do debuggers work? - P.2

I'll continue to talk about the way how a debugger works after investing the memory management.

Reading and Writing Memory

Reading from a debuggee's memory is simple. ReadProcessMemory takes care of it for you. A debugger has full access to the debuggee if the debugger started it because the handle to the process returned by the CREATE_PROCESS_DEBUG_EVENT debug event has PROCESS_VM_READ and PROCESS_VM_WRITE access. If the debugger attaches to the process with DebugActiveProcess, OpenProcess must be used to get a handle to the debuggee, and it's needed to specify both read and write access.

Before I can talk about writing to the debuggee's memory, I need to briefly explain an important concept: copy-on-write. When Windows loads an executable file, Windows shares as many mapped memory pages of that binary as possible with the different processes using it. If one of those processes is running under a debugger and one of those pages has a breakpoint written to it, the breakpoint obviously can't be present in all the processes sharing that page. As soon as any process running outside the debugger executed that code, it would crash with a breakpoint exception. To avoid that situation, the operating system sees that the page changed for a particular process and makes a copy of that page that is private to the process that had the breakpoint written to it. Thus, as soon as a process writes to a page, the operating system copies the page.

An interesting detail about the Win32 Debugging API is that the debugger is responsible for getting the string to output when an OUTPUT_DEBUG_STRING_EVENT comes through. The information passed to the debugger includes the location and the length of the string. When it receives this message, the debugger goes and reads the memory out of the debuggee. There're multiple trace statements could easily change an application's behavior when running under a debugger. Because all threads in the application stop when the debug loop is processing an event, calling OutputDebugString in the debuggee means that all your threads stop.Take a look at the following code to see how a debugger (WDBG) handles the OUTPUT_DEBUG_STRING_EVENT. Notice that the DBG_ReadProcessMemory function is the wrapper function around ReadProcessMemory from LOCALASSIST.DLL.


static DWORD OutputDebugStringEvent(CDebugBaseUser *      pUserClass  ,

                                    LPDEBUGGEEINFO        pData       ,
  DWORD                 dwProcessId ,

                                    OUTPUT_DEBUG_STRING_INFO & stOD
  DWORD                 dwThreadId  ,
SI)
{
    TCHAR szBuff[512];
    HANDLE hProc = pData->GetProcessHandle();
    DWORD dwRead ;

  // Read the memory.

    BOOL bRet = DBG_ReadProcessMemory(hProc,


                                      stODSI.lpDebugStringData,
                                      szBuff ,

sizeof(szBuff),
                                      min
(

                                     stODSI.nDebugStringLength),

                                       &dwRead);


    ASSERT ( TRUE == bRet ) ;
  if ( TRUE == bRet ) {

  // Always NULL terminate the string.

       szBuff [ dwRead + 1 ] = _T ( '\0' ) ;


        // Convert CR/LFs if I’m supposed to.

        pUserClass->ConvertCRLF(szBuff, sizeof(szBuff));

        // Send the converted string on to the user class.

        pUserClass->OutputDebugStringEvent(dwProcessId, dwThreadId, szBuff);


    }

   return ( DBG_CONTINUE ) ;
}

Breakpoint and Single Stepping

Most engineers don't realize that debuggers use breakpoints extensively behind the scenes to allow the debugger to control the debuggee. Although you might not directly set any breakpoints, the debugger will set many to allow you to handle tasks such as stepping over a function call. The debugger also uses breakpoints when you choose to run to a specific source file line and stop. Finally, the debugger uses breakpoints to break into the debuggee on command (via the Debug Break menu option in WDBG, for example).

The concept of setting a breakpoint is simple. All you need to do is have a memory address where you want to set a breakpoint, save the opcode (the value) at that location, and write the breakpoint instruction into the address. On the Intel Pentium family, the breakpoint instruction mnemonic is "INT 3" or an opcode of 0xCC, so you need to save only a single byte at the address you're setting the breakpoint. Other CPUs, such as the Intel Merced, have different opcode sizes, so you would need to save more data at the address.

Take a look at the SetBreakPoint function in the following code:


int CPUHELP_DLLINTERFACE __stdcall SetBreakpoint ( PDEBUGPACKET dp,  ULONG ulAddr  ,OPCODE *   pOpCode ) {    

    

    DWORD dwReadWrite = 0 ;

    BYTE bTempOp = BREAK_OPCODE ;
    BOOL bReadMem ;

    BOOL bFlush ;
    BOOL bWriteMem ;  

    MEMORY_BASIC_INFORMATION mbi ;
    DWORD dwOldProtect ;




   ASSERT ( FALSE == IsBadReadPtr ( dp , sizeof ( DEBUGPACKET ) ) ) ;

   ASSERT ( FALSE == IsBadWritePtr ( pOpCode , sizeof ( OPCODE ) ) ) ;

   if ( ( TRUE == IsBadReadPtr ( dp , sizeof ( DEBUGPACKET ) ) ) || ( TRUE == IsBadWritePtr ( pOpCode , sizeof ( OPCODE ) ) )   ) {

 

         TRACE0 ( "SetBreakpoint : invalid parameters\n!" ) ;

         return ( FALSE ) ;

  }


// If the operating system is Windows 98 and the address is above 2 GB, just leave quietly.

    if ( ( FALSE == IsNT ( ) ) && ( ulAddr >= 0x80000000 ) ) {

         return ( FALSE ) ;
   }

  // Read the opcode at the location.


  bReadMem = DBG_ReadProcessMemory ( dp->hProcess, (LPCVOID)ulAddr, &bTempOp, sizeof ( BYTE ) , &dwReadWrite) ;

    ASSERT ( FALSE != bReadMem ) ;

   sizeof ( BYTE ) == dwReadWrite ) ;
    ASSERT ( if ( ( FALSE == bReadMem) ||  ( sizeof ( BYTE ) != dwReadWrite ) ) {

           return ( FALSE ) ;
    }


 // Is this new breakpoint about to overwrite an existing breakpoint opcode?

    if ( BREAK_OPCODE == bTempOp ) {

          return ( -1 ) ;
    }


// Get the page attributes for the debuggee.

   DBG_VirtualQueryEx ( dp->hProcess                        ,

                        (LPCVOID)ulAddr                     ,


                          &mbi                                ,


                         sizeof ( MEMORY_BASIC_INFORMATION )  ) ;

   // Force the page to copy-on-write in the debuggee.

   if ( FALSE == DBG_VirtualProtectEx ( dp->hProcess           ,

                                        mbi.BaseAddress        ,

                                          mbi.RegionSize         ,

                                        &mbi.Protect         
                                         PAGE_EXECUTE_READWRITE ,

    ) )
    {

  ASSERT ( !"VirtualProtectEx failed!!" ) ;
     
 

       return ( FALSE ) ;
 
    }

// Save the opcode I’m about to whack.

   *pOpCode = (void*)bTempOp ;

  bTempOp = BREAK_OPCODE ;    dwReadWrite = 0 ;    // The opcode was saved, so now set the breakpoint.


   bWriteMem = DBG_WriteProcessMemory ( dp->hProcess, (LPVOID)ulAddr ,                                         (LPVOID)&bTempOp ,


                                         sizeof ( BYTE )  ,

                                         &dwReadWrite      ) ;
    ASSERT ( FALSE != bWriteMem ) ;

sizeof ( BYTE ) == dwReadWrite ) ;
    ASSERT (

 if ( ( FALSE == bWriteMem             ) || 
  
 

        ( sizeof ( BYTE ) != dwReadWrite ) )
 
    {

 return ( FALSE ) ;
      
 
    }


// Change the protection back to what it was before I blasted the
   
 

    // breakpoint in.

    VERIFY ( DBG_VirtualProtectEx ( dp->hProcess    ,
                                    mbi.BaseAddress ,

                                    mbi.Protect     ,
                                    mbi.RegionSize  ,

   ) ) ;

   
                                    &dwOldProtect

  // Flush the instruction cache in case this memory was in the CPU

    // cache.

    bFlush = DBG_FlushInstructionCache ( dp->hProcess    ,

                                         (LPCVOID)ulAddr ,

                                         sizeof ( BYTE )  ) ;

    ASSERT ( TRUE == bFlush ) ;

   return ( TRUE ) ;
 


}

After you set the breakpoint, the CPU will execute it and will tell the debugger that an EXCEPTION_BREAKPOINT (0x80000003) occurred—that's where the fun begins. If it's a regular breakpoint, the debugger will locate and display the breakpoint location to the user. After the user decides to continue execution, the debugger has to do some work to restore the state of the program. Because the breakpoint overwrote a portion of memory, if you, as the debugger writer, were to just let the process continue, you would be executing code out of sequence and the debuggee would probably crash. What you need to do is to move the current instruction pointer back to the breakpoint address and replace the breakpoint with the opcode you saved when you set the breakpoint. After restoring the opcode, you can continue executing.

There's only one small problem: How do you reset the breakpoint so that you can stop at that location again? If the CPU you're working on supports single-step execution, resetting the breakpoint is trivial. In single-step execution, the CPU executes a single instruction and generates another type of exception, EXCEPTION_SINGLE_STEP (0x80000004). Fortunately, all CPUs that Win32 runs on support single-step execution. For the Intel Pentium family, setting single-step execution requires that you set bit 8 on the flags register. The Intel reference manual calls this bit the TF, or Trap Flag. The followingcode shows the SetSingleStep function and the work needed to set the TF. After replacing the breakpoint with the original opcode, the debugger marks its internal state to reflect that it's expecting a single-step exception, sets the CPU into single-step execution, and then continues the process.


BOOL CPUHELP_DLLIMNTERFACE __stdcall SetSingleStep(PDEBUGPACKET dp)

{

     BOOL bSetContext ;

     ASSERT ( FALSE == IsBadReadPtr(dp, sizeof(DEBUGPACKET)));

     if(TRUE == IsBadReadPtr(dp, sizeof(DEBUGPACKET)))
     {

  TRACE0("SetSingleStep : invalid parameters\n!"); 


         return (FALSE);
     }
 

// For the i386, just set the TF bit.


    dp->context.EFlags |= TF_BIT ;


    bSetContext = DBG_SetThreadContext(dp->hThread, &dp->context ) ;


      ASSERT(FALSE != bSetContext);


     return (bSetContext);

}

After the debugger releases the process by calling ContinueDebugEvent, the process immediately generates a single-step exception after the single instruction executes. The debugger checks its internal state to verify that it was expecting a single-step exception. Because the debugger was expecting a single-step exception, it knows that a breakpoint needs to be reset. The single step caused the instruction pointer to move past the original breakpoint location. Due to that, the debugger can set the breakpoint opcode back at the original breakpoint location without any impact to the execution process. The operating system automatically clears the TF each time the EXCEPTION_SINGLE_STEP exception occurs, so there's no need for the debugger to clear it. After setting the breakpoint, the debugger releases the debuggee to continue running.

One more thing when writing a debugger is considered is concerned with thread, especially in multithread environment. But it seems that this entry is long enough, so I decide to leave it in the third part.

Greatz thanks to:

John Robbins, Debugging Applications, 2000.