Like this article? PLEASE +1 it! Evan Signature
Evan Carmichael Top Header about About Home Profiles articles Tools forums inspirational quotes About facebook Twitter YouTube Blog
Share for a Cause











Virtual Machine Checkpointing in High-Availability Environments

Guest post by: Eric Wheeler

Article Overview: In commercial environments, integrity is key to business operation; in many cases, confidentiality is ignored for the assurance of integrity. Integrity is the basis for different disk management schemes which leads to assurance. Assurance in terms of service availability is important for the mitigation of failure. By holding a strong base in integrity, systems can be made available quickly even after catastrophic failure. Virtual machine monitors offer separation between virtual machines to offer confidentiality; properly configured, two virtual machines are unable to communicate with each other outside of their design (barring unknown covert channels). The compromise of one virtual machine within a system does not imply the compromise of another.

Free Download - Virtual Machine Checkpointing in High-Availability Environments By Eric Wheeler
Name: Email:

Virtual Machine Checkpointing in High-Availability Environments

Introduction

In commercial environments, integrity is key to business operation; in many cases, confidentiality is ignored for the assurance of integrity. Integrity is the basis for different disk management schemes which leads to assurance. Assurance in terms of service availability is important for the mitigation of failure. By holding a strong base in integrity, systems can be made available quickly even after catastrophic failure. Virtual machine monitors offer separation between virtual machines to offer confidentiality; properly configured, two virtual machines are unable to communicate with each other outside of their design (barring unknown covert channels). The compromise of one virtual machine within a system does not imply the compromise of another. Over the past several years, academia has been abuzz with fresh ideas and research directions in virtualization. Much of this is due to virtualization technology reaching commodity hardware. New desktop systems are fast enough and parallel in nature to support virtualization out of the box. Business, academic and government organizations are excited about consolidating leading-edge and legacy systems into the same physical hardware. The goal of this paper is to discuss virtual machine checkpointing and how it relates to high availability environments. To reach this goal, background information about virtualization methods and data representation is necessary. Once these two topics have been covered, virtualization as it applies to security in terms of assurance, integrity and confidentiality will be explored.

Virtual Machine Monitor (VMM)

The virtual machine monitor, or hypervisor, is a system which mediates the simultaneous execution of multiple operating systems on the same physical hardware. In general, the operating systems may be unrelated in terms of implementation or architecture and are unaware that they are virtualized. The VMM holds direct access rights over the hardware and grants access to each virtual machine by explicitly allocating hardware resources. Most importantly, the VMM offers the partition of virtual machine memory and hardware resources. In addition, the VMM schedules execution so that all virtualized operating systems are guaranteed CPU time and resource starvation is avoided. Virtualization Techniques

All virtualization techniques described in this paper create the illusion of a 32-bit physical memory range 0x00000000 - 0xFFFFFFFF for the guest virtual machine1. In addition, the physical hardware resources such as network, disk, video and serial interfaces are abstracted by the VMM. This abstraction allows the virtual machine to believe that it is running directly on physical hardware. Modern processors are designed with several rings of execution. Typically ring-0 is the most privileged and has direct access to the hardware. Rings 1, 2, ... n are lesser-privileged modes of execution which cause a trap when privileged instructions are executed. For example, if a ring-3 user-space application attempts to access memory which is not loaded into physical memory, the processor traps to the privileged ring which may map physical memory into the virtual address space of the processor. In an unvirtualized (native) environment, the operating system receives the trap and services the memory request. If memory was allocated to the user-space process and the memory read/write was authorized by the operating system, the memory is swapped from disk into physical memory. Once in physical memory, the operating system maps the physical address of where the memory was loaded into the virtual address of the user-space process which was executing [10].

Similar operating system traps exist for accessing physical hardware. From the perspective of a user-space application, IO operations are requests to the operating system via software interrupts to perform the IO request on its behalf, then passing the result to the user-space application. This mechanism isolates the physical hardware from the user space process and implies a concept of privilege separation. The goal of any virtualization environment is to allow the execution of these processes without modification to user-space software and with minimal or no modification to the operating system being virtualized. In 1974, Popek and Goldberg [9] discussed three essential aspects of a well-behaved virtual machine:

1. Fidelity: Software on the VMM executes identically to its execution on hardware, barring timing effects.

2. Performance: An overwhelming majority of guest instructions are executed by the hardware without the intervention of the VMM.

3. Safety: The VMM manages all hardware resources.

At the time of Goldberg and Popek's writing, the above example of trap-emulate-return was the only implemented form of virtualization, though other alternatives such as software emulation were theoretically possible [1]. The return to hardware-assisted virtualization is discussed in the section below on fully virtualized environments.

Full Virtualization

Full virtualization is a mechanism which executes unmodified operating systems natively on the physical processor, yet still under the management of a VMM. Fully virtualized VMM implementations virtualize the hardware which is allocated to the virtual machine, though the virtual machine is unaware the hardware devices it is accessing are virtual [8]. To support full virtualization, the system processor must be designed to support it. The implementation of Intel-VT hardware virtualization technology will be used as an example. The trick of full virtualization is to allow traps such as page faults and software interrupts to take place without the virtualized operating system detecting that it is being virtualized. A common way to implement full virtualization is known as ring deprivileging. This refers to running the VMM in the most privileged ring, the virtualized operating system on another ring and the user-mode applications on a third ring. One proposed implementation places the VMM in ring-0, virtualized operating system in ring-1, and user-mode applications in ring-3. The problem that occurs in this scenario is known as ring aliasing [10,12].

Ring aliasing occurs when the operating system is executing in a ring which it was not written to execute in. For example, if the operating system is able to sense that it is actually running in ring-1 and expects to be in ring-0, it may abort execution. The VMM must emulate every check by the operating system of the current privilege ring and let it think it is running directly on hardware and within the ring privilege it intends to execute. Traditional 32-bit Intel Architecture (IA32) causes a trap only when privileged registers are written to, but not read from; some privileged instructions do not trap at all, are simply ignored, or are masked off. It is not possible to hide the ring of execution from software running within any ring of IA32 hardware, nor is it possible for the VMM to emulate these checks [10].

To address ring aliasing, Intel added VT-x extensions to IA32 to resolve the issues which prevented virtualization in previous processor generations. Prior to VT-x extensions, not all privileged instructions would cause exceptions to be handled by a VMM. In solution, Intel created two classes of execution: VMX-root and VMX-non-root. In the VMX-root class of operation, the CPU behaves in much the same way as older generations: the VMM operates within the context of VMM-root. To start a virtual machine, the VMM configures a VMX-non-root context. This context appears to be identical in terms of rings of privilege from the perspective of an operating system running in a VMX-non-root context. This allows a virtualized operating system to run in any ring it chooses since they are all available to the virtual machine. What the virtualized machine is unable to detect is that it is actually executing within a VMX-non-root environment: to the virtualized operating system, it is running directly on hardware [10].

An exception bitmap allows the VMM to configure which instructions executed within a VMX-non-root context should trap to the VMM and which should not. The VMM can also configure each VMX-non-root context to trap on specific IO writes, memory reads and memory writes. This allows the VMM to virtualize the hardware which the virtualized operating system is operating on. Details and information about implementation of Intel's VT-x and VT-i (Itanium) extensions are available in [10]. Software known to support Intel VT-x includes VMware 5.5, Xen 3.x, Kernel-based Virtual Machine (KVM), Parallels Virtual Machine, and Virtual Logix VLX [2].

Emulated and Binary Translated Virtualization

Virtualization environments where the hardware does not handle the trap-emulate-return virtualization are said to be emulated. Emulators allow operating systems with different processor architecture to execute on the same machine. For many years, Apple used processor emulators to dynamically translate x86 instructions to PowerPC in order to run Windows under MacOS. In "A Comparison of Software and Hardware Techniques for x86 Virtualization" [1], Adams and Agesen discuss the differences between software and hardware-assisted virtualization. VMware uses a form of virtualization known as binary translation to virtualize operating systems during runtime. The instruction stream is broken into intermediate objects and analyzed much like a regular expression. VMware acts on the way the expression is matched to manage rewriting the instruction stream during realtime execution. It is interesting to note that page faults, page table modifications and IO requests are emulated in software. In addition, VMware incorporates adaptive binary translation to cache and reduce the emulation code necessary for commonly executed procedures, such as page faults and system calls.

By contrast, hardware-assisted virtualization such as VT-x creates a new set of rings of execution as described in [10]. Intel's implementation creates a virtual page table and virtual machine control block which holds the virtualized state and control registers for the virtual machine. Page faults, IO operations, context switches and page table modifications must trap into the VMM, have the trap analyzed by the VMM, and return back to the guest machine (note that syscalls do not require a trap). This trap-and-return (vmrun/vmexit instructions) mechanism is expensive and called frequently by modern operating systems. However, execution which does not require trap to the VMM executes at full native speed. Adams and Agesen note that the expensive vmrun/vmexit calls for kernel-mode routines actually cost more than the software emulated equivalent. In page fault scenarios, software emulation by adaptive binary translation is shown to be ~30x faster than hardware-assisted virtualization [1].

Software emulation is currently slower for system calls since they cannot be executed directly on hardware as they can in the case of hardware virtualization. For syscall-intensive applications, hardware virtualization is far more efficient. Binary translation incurs additional overhead due to the need of checking the intermediate objects against expressions. Software emulation only runs at full native speed for code which does not branch or access memory. Branches and memory read/write operations must be checked by the software emulator to ensure that they are reading within the virtual machine space and not within the VMM space, as both are running in the same context. VMware makes the assumption that user-mode ring-3 code is well-behaved and will not access memory outside of the memory space allocated by the operating system for optimization. To enforce this assumption, memory addresses are truncated to force them into the address space of the virtual machine during binary translation [1].

Based on this analysis, it becomes clear that a hybrid approach offers the best of both worlds. Hardware virtualization offers full native speed for code with syscalls and user-mode code when executing context-switch free. Software emulation allows us to emulate expensive vmrun/vmexit instructions, IO operations, context switches and page faults with much less overhead. This is currently being researched by VMware. The goal is to eventually replace vmrun/vmexit instructions with less expensive routines and allow full native execution performance with hardware assisted virtualization technology [1].

Paravirtualized

Some implementations of virtual machines change the kernel of the guest operating system to streamline communication between the VMM (hypervisor) and virtualized OS. Paravirtualized systems typically perform much faster than emulated or fully virtualized implementations. The basic premise for paravirtualization is that all communication channels are well-defined, known and implemented by the kernel running beneath the VMM. By defining interfaces for disk, network, serial and video, emulating physical hardware devices is unnecessary. Most overhead in virtualization is due to IO operations. If a virtual machine runs with zero-IO, we would see 100% efficiency when comparing native and virtualized environments [10]. Current paravirtualized environments include Xen [7, 12,14], User-Mode-Linux [5,11], and Danali [6]. Xen implements paravirtualization with a hypervisor (VMM) and a privileged virtual machine which handles drivers and IO (domain0). Xen refers to virtual machines as a "domain"; unprivileged domains are generally referred to as a domU for usermode domain. The privileged domain allocates (virtual) resources such as network and disk to the domU's and handles domain scheduling to allocate CPU time. Physical devices such as network adapters and disk controllers may also be allocated to domU's.

Other paravirtualized environments such as User-Mode-Linux (UML) implement paravirtualization by porting the kernel to a new "user-mode" architecture, implementing all hardware IO as syscalls to the parent operating system. For instance, new processes are implemented as threads and block read/writes are implemented as read() and write() syscalls. This allows executing the kernel as a user-mode application, thus sandboxing the kernel and all applications within it into the privileges of the user executing the kernel [11].

Virtual Machine Disk Images

From the perspective of a virtual machine, a disk is a physical block device much like an SCSI or IDE device. This allows the virtual machine to format the device with its native filesystem as if it were on physical hardware. For example, Xen exports its virtual disks to virtual domains as /dev/xvdX where X is a digit a-z; other operating systems use their native disk naming scheme which varies by operating system. Common virtual disk images exported to virtual machines include flat files, block devices and network block devices. This disk abstraction allows simple management from the perspective of the VMM. Flat File Images

The VMM may be configured to manage virtual disks in many different ways. Most commonly, virtual disks are exported to virtual machines as flat files. To express this in terms of the VMM, we may have a 4gb boot filesystem named /var/vm/windows/disk0.raw. To the guest operating system, this same file may appear to be a 4gb hard drive on IDE Channel 0/Master. Since the disk image is a normal file just like any other on the filesystem, backing up the file and duplicating virtual machines is trivial. In addition, sparse files2 may be used to save space. If two virtual machines need to be configured nearly identically, one can be installed and configured as /var/vm/system1/disk0.raw. The second system can be duplicated by simply copying /var/vm/system1/disk0.raw to /var/vm/system2/disk0.raw and by making appropriate configuration changes in the new virtual machine. Manageability is extended further by the use of copy-on-write disk images which are discussed below.

When performance is critical, file-based images are suboptimal as they are a filesystem operating within the VMM's filesystem. Using file-based disk images increases the latency of read/write operations since they must write through a second filesystem. This may cause commit-delay problems if the parent filesystem caches writes to the file-based disk image. If the parent filesystem caches critical writes from the guest operating system, the guest's assumptions about data commit to disk may be inaccurate. Data corruption is probable if these critical writes, which the guest assumes to be committed to disk, are still in the VMM's filesystem cache upon power failure.

Block Device Images

One method for increasing the performance and reliability of virtual machine disk images is to directly export a block device from the VMM to the guest operating system. A block device may be an entire physical disk, disk partition (slice), network attached disk or a volume exported by a logical volume manager. In terms of management, the latter option is easiest to manipulate. If the disk needs to be grown or shrunk, its logical volume can be resized online and many modern filesystems support online-resizing (typically grow only) as well. Since the VMM has exported a block device to the guest operating system, writes from the guest operating system are committed directly to the exported block device. The guest virtual machine performance may also benefit in terms of caching at the hardware level since IO operations are writing to a real block device. Using a block device resolves the commit-delay problem described above. One drawback of using block device-based virtual disks is volume management. When the virtual disk that a virtual machine uses is a block device, it is much more difficult to move/copy/duplicate the disk image. To duplicate a virtual disk image, it would be necessary to block-copy the entire volume to a file3, transfer the file to the destination host and then block copy the file back onto the destination volume. Because of the difficulty in copying logical volumes, modern logical volume managers support snapshot and volume cloning which can make manipulating virtual disks easier; copying, duplicating and snapshotting to other hosts or physical volumes still requires block-by-block transfers of the logical volume.

Storage Attached Networks (Network Block Devices)

Data centers and organizations with large data storage needs have started the move toward Storage Area Networks (SAN) to separate data nodes from compute nodes. A SAN typically incorporates some concept of logical volume management and exports volumes to the network which can be connected to and used as if they were physically attached block storage devices. SAN devices often masquerade as a SCSI device to the operating system, thus providing additional options for VMMs to allocate for guest virtual machines. In the world of virtual machines, SANs offer an interesting twist on how virtual machines can access disks. Since a virtual machine can connect directly to the SAN via network (iSCSI uses TCP/IP), the virtual machine monitor does not need to manage the virtual disk images for the guest. In effect, the guest has access to real disk volumes via the network and the VMM only has to pass network traffic. Thus, the overhead on the virtual machine monitor is reduced by moving the burden of disk IO to a SAN device. Interestingly, when a virtual machine does not have a disk exported to it by the VMM, the VMM is free to migrate the virtual machine state between physical virtual machine monitors. This is performed without disturbing the virtual machine since the network stack is carried with it. This concept is explored further in the Live Migration section of this paper.

Copy-on-Write Disk Images

When executing multiple virtual machines on a single VMM, a disk image is generally required for each virtual machine in execution. In systems where the difference between one machine and another are simple configuration changes (as was the case in the Flat File Images example above), it is advantageous to have a single base image and boot several virtual machines from it. Consolidating boot images into one base image promotes manageability and saves disk space since the actual changes between the disk images are minimal. Copy-on-write is the concept of copying blocks from the base image and writing to a separate image or "CoW" file during a write. Subsequent reads of blocks which were previously written to the CoW file are read from the CoW file. Therefore, if only 10MB of blocks change between the original base image to the new CoW image, the CoW image is only 10MB in size4. We can use the CoW image to form an example where we have a base image and multiple virtual machines with changes directed to their own CoW file [13]:

/var/vm/disk0-base-image.raw

/var/vm/disk0-base-image.raw <== /var/vm/system1/disk0.cow

/var/vm/disk0-base-image.raw <== /var/vm/system2/disk0.cow

/var/vm/disk0-base-image.raw <== /var/vm/system3/disk0.cow

From the illustration above, you can see that systems 1, 2 and 3 have their own disk0.cow image which hold changes to the virtual disk. To back up the systems in this example, only one copy of the base image would be necessary in addition to the CoW files for each virtual machine.

High Availability

In high availability environments, long-running systems with high uptimes are necessary. Today's hardware contains many indicators for impending failure of a device. These failure indicators are included in memory, hard drives, processors, power supplies, fans, voltage regulators and other components as well. Temperature sensors are found in abundance on most any device with a heat sink, a fan or battery. These early-warning systems are designed to alert system administrators that a server, component or device on their network needs attention. Given the availability of monitoring devices to notify administrators, we are able to schedule maintenance on hardware in advance rather than managing hardware failures on an emergency basis. If a system is designed such that any node can fail and the operation of the system continues, even if in lesser capacity, we can be confident in its reliability. To design a system which incorporates failover and migration, this discussion now turns to virtual machine checkpointing [12].

Checkpointing

A complete virtual machine can be defined by the system processing state and its accompanying data storage. We refer to the system processing state as a virtual machine system image, or simply VM for short. By taking a snapshot of the VM image and disk image simultaneously, we can save the state and restore it to exactly the point of the snapshot. The general process of VM snapshotting is to pause the machine, save the VM image, save the disk image and then resume the virtual machine. The biggest issue of checkpointing is latency, since the entire virtual machine is paused during a checkpoint. If the system is paused long enough, users may notice that resources are no longer available; for extreme delays, network timeouts may cause connection failures [3]. During any virtual machine checkpoint operation, it is necessary to minimize the time a VM is suspended. This can be done by reducing the VM size or disk size to be checkpointed. Minimizing a VM image is not so simple; if a virtual machine is using 1GB of ram, then its VM image is 1GB in size, at a minimum. Even with arrays which can write to disk at ~125MB/s, we will still have an eight-second delay due to checkpointing. This does not take into account the time necessary to checkpoint the disk image [13].

Virtual Machine System Image Checkpointing

The concept of copy-on-write for disk images discussed above is applied here in terms of the virtual machine system image. At the moment that the VMM checkpoints a machine, it defers all memory writes to a separate memory space by forcing a page fault for each memory write. Memory blocks which have not been written to since the checkpoint can still be read from the original VM image. Since memory writes are deferred to a separate memory region, the VMM can begin a thread lazily writing the VM state to a file. Once the checkpointed VM image is written to disk (eight seconds in the example above), the memory which was being held for the checkpoint can be freed. Most importantly, the VM is in operation for the entire time of checkpoint. Xen uses a form of this checkpoint method which will be discussed below under Live Migration [7]. This form of checkpoint is also used in distributed multi-node applications to checkpoint an entire cluster, provided that all nodes checkpoint within the time of a TCP socket timeout period. This type of technology increases the stability of long-running applications, which are more prone to failure since all nodes are necessary for operation [3]. Disk Checkpointing

Disk checkpoints can be made in the same form by chaining copy-on-write files. Assume there is a disk /var/vm/disk0.raw, and a CoW file /var/vm/disk0-ck0.cow. At the moment a VMM initiates a checkpoint, writes to disk should be deferred to a separate copy-on-write file: /var/vm/disk0-ck1.cow. While the writes are deferred to the second CoW file, /var/vm/disk0-ck0.cow can be lazily copied to the checkpoint repository. Once the checkpoint has been made, the VMM can commit the second CoW file to the first and begin using the first in anticipation of the next checkpoint [13].

1. Normal Operation: Defer writes to disk0-ck0.cow for this system

disk0.raw <== disk0-ck0.cow 2. During Checkpoint: Defer writes to ck1; copy ck0 to backup

disk0.raw <== disk0-ck0.cow

disk0-ck0.cow <== disk0-ck1.cow

3. After Checkpoint: commit disk0-ck1 to disk0-ck0

disk0-ck0.cow <=COMMIT== disk0-ck1.cow

4. Return to Normal Operation

In the example above, the only time the VM must be paused is during commit of ck1 to ck0. Provided that the number of blocks changed in the last checkpoint interval are minimal, this process can take place quickly. If ck1 and ck0 are actually held in the same copy-on-write file, a commit may be as simple as changing the block relocation bitmap within the copy-on-write file; changing only metadata would allow much shorter commit time.

Continuous Checkpointing

One research project known as ReVirt is working to allow complete instruction-by-instruction replay of a virtual machine. The intent of this project is to travel back in time and replay from checkpoint history for forensic analysis. ReVirt has been implemented by modifying UMLinux. UMLinux is a kernel module which allows the paravirtualization of Linux under Linux. The attractiveness of UMLinux is that few syscalls between the guest and host kernel are required to virtualize a host. ReVirt patches the VMM kernel module to log all system calls and interrupt events. When an interrupt or system call is made, the instruction pointer as well as specific information about the interrupt or system call is saved to the log so that it may be replayed in the future. Network activity sent to the virtual machine is logged, but activity sent from the VM is not as it will be regenerated deterministically during replay. During the replay of a log, ReVirt uses the x86 branch counter to fault when the branch count matches the count recorded in the log. Once reaching a known branch count, ReVirt sets software breakpoints at each instruction it executes until the instruction pointer is exactly at the location of the call. At this point, it passes execution back to the virtual machine. In general, this minimizes the breakpoint-stepping to less than 128 instructions [4]. Continuous checkpoint technology is attractive for many reasons. The virtual machine is never paused during operation for longer than is necessary to record syscall and interrupt information into the VMM's replay log. As noted above, ReVirt can use the replay log for forensic analysis. Using predicates defined for specific exploits, a system administrator can determine if the machine has ever been exploited based on the checkpoint history, even if the exploit was not known at the time of deployment. The paper describes using the ptrace-suid race condition attack of the 2.2.19 kernel to escalate privileges, replace /bin/ls and place a backdoor in xinetd. The ReVirt paper further explains that every step of the attack could be replayed instruction by instruction, and be inspected at each point of the attack. Current research is under way toward implementing ReVirt for Xen [4].

Live Migration

As hardware reports warning of impending failure, system administrators scramble to replace hardware in order to keep systems running with 99.999% uptime. If systems could detect failure and move their executing code from a failing node to a functional node, the job of a system administrator would be simplified. We have seen how virtualization offers assurance and system-level backups by the use of checkpointing; this functionality can be extended by checkpointing a system and starting it on a separate node. One method of such live migration is currently implemented in Xen. Xen touts the ability to transfer a live virtual machine from one physical node to another across the network in as little as 60ms. If we quickly calculate the transfer rate of a 1GB VM system image in 60ms, this would require roughly 16GB/s of throughput; certainly this is not directly possible with network interconnection technology at the time of this writing. Xen implements migration similar to a copy-on-write checkpoint between the source and destination system in order to continue operation as long as possible on one node before finally cutting over to the new node. Thus, the time that the virtual machine is paused is minimal [7].

During migration, the VM system image is copied from one node to the other. At the time the VMM initiates the migration all pages are flagged read-only, forcing a page fault for each memory write. When a memory write is made, Xen marks the page written to as "dirty." At the moment that all "clean" pages have been transferred to the new node, Xen pauses the virtual machine and copies the (few) remaining pages which were marked "dirty" into the other node. As soon as the new node has incorporated the dirty pages into the VM system image, the virtual machine is resumed. If we recalculate the 60ms transfer time and assume gigabit network interfaces, the data transferred between the two nodes while the virtual machine was paused is only 7.5MB [7, 12].

Live data migration is not without its challenges; Xen only transferrers the VM system image and does not manage any disk transfer. SAN implementations were discussed earlier under Virtual Machine Disk Images. A SAN allows a virtual machine to use network-attached block devices as boot medium to offload IO operations from the virtual machine monitor. If all disk mediums are network-attached, migration becomes possible. Since Xen migrates the system state including network devices and the state of the virtual machine's network stack, network block devices are unaffected. This allows the virtual machine to be migrated without requiring Xen to manage migration of the disk image. Current research is directed toward managing disk image migration with the system VM image, however the results have not been published at the time of this writing.

Conclusion

Checkpointing and live migration offer concrete mechanisms for implementing assurance and integrity for systems requiring high availability. The built-in segmentation of hardware devices and memory between virtual machines offer separation of duty and help enforce confidentiality. With the proper architecture of a virtualized environment and networked infrastructure, virtualization can be part of a design to enforce security policies. Most importantly, issues of confidentiality, assurance and integrity are addressed with these advances in technology. The renewed excitement about virtualization among academics and researchers is producing innovative ideas which have become possible due to recent hardware and software developments. Virtualization research has exploded and great momentum is in place. Projects and areas of research such as continuous checkpointing, live virtual machine migration and other checkpoint methods are bolstering assurance and integrity for enterprise, academic and government applications. Strict separation of virtual machines and the ability of time travel offer increased security and new forensic methods. The forefront of technology in the world of virtualization is expanding and we are only at the beginning of the wave.

Annotated Bibliography [1]. K. Adams, O. Agesen. "A Comparison of Software and Hardware Techniques for x86 Virtualization" ASPLOS'06 October 21-25, 2006, San Jose, California ACM 1-59593-451-0/06/0010. Available w w w. VMware.c o m/pdf/asplos235_adams.pdf 26 November 2006 In this paper, Adams and Agesen discuss the differences between software and hardware-assisted virtualization; VMware and Intel's VT-x / AMD-V implementations are compared. VMware uses a form of virtualization known as binary translation to virtualize operating systems during runtime. The instruction stream is broken into intermediate objects and analyzed and emulated in terms of regular expression matches. It is interesting to note that page faults, page table modifications and IO requests are emulated in software. In addition, VMware incorporates adaptive binary translation to cache and reduce the emulation code necessary for commonly executed procedures, such as page faults and system calls.

By contrast, hardware assisted virtualization such as VT-x creates a new set of rings of execution as described in [10]. Intel's implementation creates a virtual page table and virtual machine control block which holds the virtualized state and control registers for the virtual machine. Page faults, IO operations, context switches, and page table modifications must trap into the VMM, have the trap analyzed by the VMM and return back to the guest machine. This trap-and-return (vmrun/vmexit instructions) mechanism is expensive and called many times by modern operating systems. However, execution which does not require trap to the VMM executes at full native speed. This paper expresses how the expensive vmrun/vmexit calls for kernel-mode routines actually cost more than the software emulated equivalent. In page fault scenarios, software emulation by adaptive binary translation is shown to be ~30x faster than hardware assisted virtualization. Hybrid hardware-assist and binary translation research is discussed in this paper as well.

[2]. Author Unknown "Comparison of Virtual Machines" Available en.wikipedia. o r g/wiki/Comparison_of_virtual_machines 26 November 2006. This Wikipedia entry offers a comprehensive matrix of currently available virtualization implementations and the architectures they support.

[3]. Emeneker, Stanzione, "Increasing Reliability through Dynamic Virtual Clustering", High Availability and Performance Computing Workshop, HAPCW'06, Oak Ridge National Lab. Available xcr.cenit.latech.edu/hapcw2006/program/papers/DVC_Reliability.pdf, 26 October 2006 Emenker and Stanzione discuss the implementation of virtualized clusters built upon physical clusters of many physical nodes. By distributing a virtual cluster of N nodes across M physical nodes where M>N, multiple clusters can be executing in parallel on the same physical cluster without interference from each other. The focus for this research is increasing the reliability of parallel application execution in a cluster. A single application distributed across multiple nodes is more likely to fail with larger and larger clusters since a single node failure will often crash the application requiring the computation to be restarted. This is particularly expensive when parallel applications are expected to execute for days or weeks at a time.

The solution often implemented by parallel applications is application-level checkpointing with special libraries that increases the time and development costs. Emenker and Stanzione propose that by simultaneously checkpointing each node in the cluster that a consistent state of the application can be made in case of node failure. The difficulty in this case is the precision of synchronicity for which checkpoints are made. If the checkpoints are too far apart, the application will fail and the checkpoint will be invalid. If the application is using a reliable communication transport between nodes (TCP) then packets sent at the moment of checkpoint which may be lost by the recipient. Provided that the checkpoints take place within the time of a TCP session timeout, retransmission will be made when the checkpoint is started and the parallel computation will continue without interruption. It is interesting to note that benchmark parallel applications show lower performance after restoring from a past checkpoint due to the apparent increased wall time; the nodes do not know that they were paused and restarted. Future research is working toward migration of virtual cluster nodes to move from a predicted-to-fail node to a good node without interruption of computation.

[4]. King, Dunlap, Cinar et al. "ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay", Proceedings of the 2002 Symposium on Operating Systems Design and Implementation (OSDI) , December 2002. Available w w w.eecs.umich.edu/virtual/papers/dunlap02.pdf, 26 October 2006 The goal of ReVirt is to replay the entire execution of a system from a checkpoint for forensic analysis. ReVirt is implemented using UMLinux which uses a kernel module as VMM and passes very few syscalls between the guest and host kernel to virtualize the guest. ReVirt patches the VMM kernel module and logs all system calls and interrupt events. When an interrupt is made, ReVirt saves the program counter into the replay log and the specific information about the interrupt so that it can be replayed later. Network activity being sent to the virtual machine is logged, but network activity returned from the VM is not. It is assumed that if data is going into the system, that any data leaving the system was generated by the VM and can be generated deterministically. ReVirt uses the x86 performance counters such as branch count, page fault counts and other information which it can use to synchronize events upon replay.

The difficult issue which ReVirt addresses is accurate replay of non-deterministic events to the guest system such as timing interrupts and network traffic. While replaying a log, ReVirt sets breakpoints at each instruction it exeutes until the program counter is exactly where it was originally before sending the external event to the guest system. After the interrupt is sent, breakpoints are removed and execution continues. The replay can be stopped at any point and inspected either by logging into the guest VM or by looking at the data in the block device that the VM is using as its disk. The paper describes using the ptrace-suid race condition attack of the 2.2.19 kernel to escelate priveleges, replace /bin/ls and place a backdoor in xinetd. The paper further explains that every step of the attack could be replayed, instruction by instruction, and inspected at each point of the attack.

[5]. King, Dunlap, Joshi et al. "Detecting Past and Present Intrusions through Vulnerability Specific Predicates", Proceedings of the 2005 Symposium on Operating Systems Principles (SOSP), October 2005. Available w w w. eecs.umich.edu/virtual/papers/joshi05.pdf, 26 October 2006 The IntroVirt project described in this paper explains how to detect attacks and dynamically patch, kill processes or simply notify administrators of the attack. In addition, IntroVirt can use ReVirt [4] to replay and detect if a previously unknown attack had been made on the system. IntroVirt is able to do this by the use of virtual machine introspection. Software breakpoints are set at syscalls such as exec and exit within the kernel and new process executions are watched for. The memory locations for these function calls are taken from the debug symbols of the introspected kernel or usermode program. Due to demand-paging, software breaks can not be placed in code which is paged to disk. IntroVirt resolves this issue by hooking page swapping functions in the guest system and setting breakpoints when the code is placed back into physical memory.

IntroVirt allows the writing of predicates which hook the live-executing code of the guest system and trap to the predicate engine for introspection. In an example from the paper, they were able to detect a heap-extension overflow in the Linux kernel via introspection on the vulnerable line of code within the kernel. Before its execution, the predicate code can check the parameters of the syscall and see if an exploit is being attempted. If so, it can either dynamically patch the kernel, kill the process or simply alert an administrator.

For more complex vulnerabilities, it is necessary for predicates to execute code and call usermode functions or kernel functions within the guest. Since this is done by directly modifying the instruction pointer and calling code within the virtual machine, it is possible to perturb the state of the system such that the changes affect the attack. IntroVirt addresses this by checkpointing the virtual machine before calling guest functions or code, saves any information needed from those calls into the predicate engine and finally rolls back to the checkpoint such that the virtual machine can operate as if introspection had not taken place. Currently IntroVirt supports only C and C++ predicate code. Future research is in adding debug wrappers to interpreted code such as php, java, python and perl.

[6]. Author Unknown "Lightweight virtual machines for distributed and networked systems" Available denali.cs.washington.edu/ 27 November 2006 [7]. E. Jul et a., "Live migration of Virtual Machines", NSDI'05: 2nd Symposium on Networked Systems Design & Implementation. 2005. Available w w w. usenix. o r g/events/nsdi05/tech/full_papers/clark/clark.pdf 27 November 2006. [8]. King, Chen. "SubVirt: Implementing malware with virtual machines", 2006. Proceedings of the 2006 IEEE Symposium on Security and Privacy , May 2006 Available w w w. eecs.umich.edu/virtual/papers/king06.pdf, 26 October 2006 King and Chen discuss the implementation of hoisting a direct-on-hardware into a virtualized environment transparently to the user, in what they refer to as a virtual machine based root kit (VMBR). The effect is that the user's operating system no longer executes directly on the hardware as it had before the attack. The attacker then controls the virtual machine monitor which can launch processes outside of the attacked OS space such that it can not be detected. Software such as spam relays, scanners and worms can be installed on the VMM without affecting the stability or operation of the now guest OS.

The VMBR implemented by King and Chen inserts itself into the system by hiding the VMBR-VMM disk within unallocated disk blocks of the NTFS filesystem. Subsequently, the blocks are marked as used such that he OS will not overwrite the VMM disk. Next the VMBR changes the MBR of the system to point at its own custom boot loader which boots Linux and a modified copy of VMware. When the system boots next, it will transparently boot into Linux, export the network and disk to the virtual machine and launch VMware in full-screen mode. To the user, nothing has changed except that their system takes slightly longer to boot. The VMBR uses introspection techniques to disable software such as redpill which detect whether the system is running under virtualization or not. Through introspection, the host OS can set breakpoints at system calls within the VM and receive keystrokes, network traffic or any other facility provided by the system calls of the guest operating system.

[9]. G. Popek, R. P. Goldberg, "Formal requirements for virtualizable third generation architectures." ACM 17, 7 (1974), 412-421. The Popek and Goldberg paper formalizes traps in terms of state transitions. Formal theorems are are discussed in to guarantee virtualization. In addition, the paper discusses concepts of recursive virtualization, hybrid virtual machines and offers directions for future research.

[10]. Uhlig, et al. "Intel Virtualization Technology" IEEE Computer Journal. p48-56, May 2005 Available via ftp: download.intel.c o m/technology/computing/vptech/vt-ieee-computer-final.pdf Uhlig explains in this paper how Intel's VT-x and VT-i for IA32 and Itanium resolve the virtualization problem. Prior to VT-x and VT-i extensions to Intel's processors, no all privileged instructions did not cause exceptions to be handled by a VMM in order to allow virtualization. Intel architecture uses a 2-bit privilege level (levels 0-3) separate levels of execution. Many of the privileged instructions can be executed to read processor status registers in all levels of privilege, allowing the operating system to know that it does not have full control of the hardware. To resolve this issue, Intel added a set of rings for each virtual machine so that all rings are available to the virtual machine. An exception bitmap allows the VMM to configure which instructions should trap to the VMM or should execute natively within the VMM. The VT-x implementation uses VMX-root mode for the VMM and a VMX-non-root mode for the guest. VMX-root operates in the same way as the processors without the VT-x extensions; it holds four rings of privilige. The VMX-non-root mode is a duplicate set of privilege rings setup by the VMM. In this environment, privileged instructions trap to the VMX-root where the VMM operates, and privileged instructions may be emulated by the VMM.

[11]. Author Unknown "The User-mode Linux Kernel Home Page" Available user-mode-linux.sourceforge . n e t/ 27 November 2006 [12]. Valle´e, Naughton, Ong, Scott et al. "Checkpoint/Restart of Virtual Machines Based on Xen", 2006

Available xcr.cenit.latech.edu/hapcw2006/program/papers/cr-xen-hapcw06-final.pdf, 26 October 2006 Valle´e et al. discusses the mechanism for check pointing and restoring a virtual machine as implemented by the Xen virtual machine. It is discussed that the advantages of system-level virtualization as being fault tolerance in terms of migration and pause/unpause of VM; load balancing in terms of migration of VM from hypervisor node to hypervisor node; hardware isolation as only the privileged guest VM has hardware access and the hypervisor is responsible for relay between guest VM and privileged VM.

This paper discusses x86 ring assignment as used by Xen. The hypervisor lives in ring 0 and each host/guest VM (including the priveleged VM) operates in ring 1. By placing VMs in ring 1, page faults cause a trap to the hypervisor to enable guest VMs to share a single address space. Migration between hypervisor nodes is accomplished by marking written pages as dirty after the migration process has started. The hypervisor marshals the VM system image to the new node while the current VM is still in operation. When only dirty pages remain, the VM is paused and dirty pages are marshaled to the new hypervisor node. Upon completion of the dirty-page move, the new hypervisor node unpauses the VM from its existing state. It is important to recognize that Xen does not migrate disk images and that network block devices must be used from within the VM to support VM migration.

[13]. Wheeler, Eric. "Checkpointed Failover of Virtual Machines in High Availability Environments." CS510: Malicious Code & Forensics. Portland State University, Portland. 11 Nov 2006. Available: w w w. media.pdx.edu/fall_2006/Chang/CS410_111606.asx In this lecture, Wheeler presents checkpointed failover of virtual machines between physical systems. The copy-on-write images are used as the checkpoint medium and Heartbeat is used to manage internode communication. Checkpoints are written to a shared disk to be resumed on the other node after fauilure. A demonstration is attempted.

[14]. Author Unknown "Xen Community" Available w w w. xensource.c o m/xen/ 27 November 2006

Related Articles
  All Aboard the “Neo-Millennial Learning” Train
  What Is A Virtual Assistant Business
  The Need for a Virtual Assistant Association
  What to Give the Exec Who Has Everything
  A Virtual Switchboard
  Could you work without a staff?
  Virtualization: A Small Business Perspective
  How To Hire A Virtual Assistant
  Outsource your marketing to a “Virtual Marketing Department” and set your mind at ease
  At your Assistance: The Benefits of Hiring a Virtual Assistant
  Six Advantages to Virtual Trade Shows
  Want to Start a Virtual World? Be Sure to Play Hockey
  U-Turn Vending Feature
  Virtual Collaboration is Not For Everyone: The Characteristics of Top Performing Virtual Leaders and Team Members
  Amazon Cloud Services
  How to Start a Virtual Assistant Business
  The Benefits of Working with a Virtual Assistant over a Temp
  Virtual Call center and its all round benefits
  Credit Card Processing 101
  Virtual Assistants - Getting Started

Home > Technology > Eric Wheeler > Virtual Machine Checkpointing in HighAvailability Environments >
Article Tags: high availability linux heartbeat zen kvm
Referred by: http://www.drjohnoda.com

About the Author: Eric Wheeler
RSS for Eric's articles - Visit Eric's website

Eric Wheeler has been involved in computer systems and networking since 1996 offering the business community many years of valuable experience in the field. Eric's focus is in systems, networking and security. In 2008 he was awarded his Master of Science degree in computer science focusing in security in addition to the security certificate covering the NSTISSC-4011 training standard as certified by the National Security Agency (NSA). With this past experience and passion for the industry, Eric has chosen to continue scholastically and earn a Doctor of Philosophy in computer science. The Ph.D. research is Internet redesign: by looking at where we have been and envisioning where we are headed, this research will produce a sound architecture for which to build the next generation of the Internet upon. Eric views systems security as an architectural construct. By designing the system with security in depth from the beginning, the future maintenance and total cost of systems security and IT management can be minimized.

Click here to visit Eric's website
Dashed Line

More from Eric Wheeler
Virtual Machine Checkpointing in HighAvailability Environments


Related Forum Posts
Re: Can Soemone Invent a Longer Day Re: Can Soemone Invent a Longer Day - Hi GT, I think time travel would be better. That way we could just go back a few days and do more work. We'd essentially be able to get more done, and in a sense would be making our days longer. However, do you think maybe that if we traveled to the past, then it would erase what we had previously done? Or, no matter what we do, the outcome remains the same, like in [i:1rij0ml4]The Time Machine[/i:1rij0ml4]?
Re: Traditonal Demographics Do NOT Work with Generation V Re: Traditonal Demographics Do NOT Work with Generation V - That article is the first time I ever heard the term. I found this definition - [i:189o61sy]The term Generation V (for Virtual) is used to describe the online culture in which relationships, services and communication are carried primarily and preferably through electronic media. Generation Virtual is made up of people from multiple demographic age groups who make their social connections primarily online - through virtual worlds in online games, in social networks, as bloggers, or through posting and reading user-generated content.[/i:189o61sy] I like the targeting options the internet offers much better than the typical demographics. Age is particular is not nearly as effective in finding a target market unless you want to target kids. Too many of the typical demographics rely on stereotypes and just seems outdated for many types of promotion. That's one of the reasons this article title caught my eye. Shri
Outsource your work? Outsource your work? - Hi BinaryGuy - do you really need someone 4 days a week to do accounting work? If you bring someone more experienced on you won't have to take them around with you and you can spend the time building your business. There are a number of people who you can outsource the work to - either a Virtual Assistant or a bookkeeper - have you thought about these options and not having someone actually come into the office?
Re: Feedback on a New Company Name Re: Feedback on a New Company Name - Hi Shri, In my honest opinion, I think your new business name is too long. Anything more than 3 words is already a tongue twister and difficult to remember. Why not just call it "Promo 101 Tours"? Another suggestion would be to brand the blog tour with your name. Besides, the word "Virtual" is useless since once you mention "Blog", it's already implied that the tour is conducted online.
Why should I hire a virtual assistant for SEO purposes? Why should I hire a virtual assistant for SEO purposes? - Do you find it hard to find time for rest because of promoting your business or your website all by yourself? Well, perhaps it is the right time for you to hire the services of an SEO virtual assistant. Virtual assistants for SEO services will help you on the process of improving your rankings within search engine results to achieve an increase in natural traffic. The higher your site in the search engine results page, the more traffic and sales conversions you will obtain. Using the services of virtual assistants (VAs) certainly is the best method to move in order immediate results and in reduced operational charges. This provides a great answer especially for a small business owner whoever company is actually considerably expanding, and the jobs have normally doubled. Here are some of the services that a virtual assistant can do for you: • Traffic Geyser • Main Street Marketing • Data Entry • Web Research • Market Samurai • Managing Facebook • Managing Twitter • SEO • Local Business Listing Robot • Traffic Revolution • Infusionsoft • Senuke • Angela’s Banklink • Linkjuicer • Blogging • SMM • Virtual Assistant • Personal Assistant • Article Writing • Spinning Article • Press release • Email Response Handling • Online Order Processing • Word Processing • Forum/Comment Posting • Social Bookmarking • Video Editor • Internet Marketing • Email Marketing • Link Building • Powerpoint Presentations • Web Development • Other Administrative Support


Share this article with your friends. Fund someone's dream.

Leave a comment below or share on the left and you'll help support entrepreneurs in Africa through our partnership with Kiva. Over $50,000 raised and counting - Please keep sharing! Learn more.



Featured Article

Bottom Footer



Newsletter

Get advice & tips from famous business
owners, new articles by entrepreneur
experts, my latest website updates, &
special sneak peaks at what's to come!
Name:
Email:
Popular Articles

Emotional Intelligence in Business

If I Were Starting A Network Marketing Company...

Tips for the Novice Traveler

Suggestions

Email us your ideas on how to make our
website more valuable! Thank you Sharon
from Toronto Salsa Lessons / Classes for
your suggestions to make the newsletter
look like the website and profile younger
entrepreneurs like Jennifer Lopez.