Virtual Machines
Intro to VMs
- Language VM, e.g. JVM
- Process VM
- System VM, like VMware
- Co-designed VM
Taxonomy:
- Process VMs
- Same ISA
- Threads
- Dynamic bin optimizers
- Diff ISA
- Dyn Translators
- HLL VMs -- Interpreter, Compilers
- Same ISA
- Sys VMs
- Same ISA
- Classic-Sys VMs -- VMM
- Hosted VMs
- Different ISA
- Whole-Sys VMs -- Emulation
- Co-designed VMs -- Hardware optimizations
- Same ISA
Portability? Functionality? Or even better performance?
Emulation: Interpretation and Binary Translation
Startup cost v.s. continuous cost
Interpreter: manages the state (mem, ctx, code ...)
- Decode-and-dispatch interpretation
- Threaded interpretation
- Pre-decoding
- Direct threaded -- code replacing
RISC v.s. CISC: Only some instructions in CISC are commonly used? Any pattern or skeleton of decoding (instruction template)? Partial decoding (dispatch on first byte)?
Precoding and portability?
Binary translation: how does the mapping work? Register mapping? Other special things?
Dynamic: Code discovering problem in static translation and dyn-translation (akin JIT), code cache (hit, replacement algo...), code map, any inconsistency?
Other issues: self-modifying code, self-ref code, precise traps (debug?)
Same-ISA VM: code management, program shepherding, monitoring (security)
Translation chaining (link translated blocks together)
SPC (Source PC), TPC (Translated PC)
Software indirect jump predication.
Shadow Stack: optimization for lookup overhead
ISA issues: number/property of registers (simulated registers), condition code (lazy eval, compatibility), alignment, endianess, addressing modes ...
Shade: A simulation tool
Process VMs
ABI: Application binary interface
Proc VM: Loader, signals, os call emulation, exception emulation, code caching, translation + interpretation, profiling..
Compatibility: state correspondence when control transfers (to and from user program and host os)
runtime binary: RT data, RT code ..
memory mapping: translation tables (indirect v.s. direct), arch(segmentation, page sizes ...), access privilege supporting difference, protection/allocation granularity ..
memory protection: mmap()
, mprotect()
self-mod, self-ref code: write-protect -> cache flashing
protect runtime -> two modes: RT mode, emu mode
staged emulation: mixed methods, profiling, hotspot etc.
- Linux: OS-app communication through ABI and signals
- Windows: callbacks, async call, exceptions
I/O simulation: side effect, irrelevant to Turing completeness
code cache v.s. HW cache: no fixed size, presence dep caused by chaining, no backing store (once deleted, only re-translation)
replacement algos: LRU, flush when full, preemptive flush (detect the program phase change), fine-grained FIFO, coarse-grained FIFO
FX!32: Transparent execution of IA-32 apps on Alpha platforms running on Windows.
Dynamic Binary Translation
Performance becomes the first concern.
Compiler techniques: code motion, reordering, blocked translation (trace, superblock, tree group).
Stages: Interpretation, basic block translation (with chaining or not), optimized translation (larger blocks), highly optimized translation (with profiling information)
Behaviors: backward branch tendency, same-value production ...
Profiling: HW or Software? Probes, interrupts. HotSpot detection (region based), control flow predicability (edge based). Instrumentation v.s. Sampling.
Larger blocks: Locality.
Super block formation: starting points, continuation, end points. Threshold. -> Code relayout
Compatibility issues: register consistency (renaming ... code motion ...), trap consistency (code reordering...), Compensate code
Code patching
Example: HP Dynamo system.
HLL VM arch
Think process VM is a "after-the-fact" method, while HLL VM is a well-designed plan aimed at portability.
Virtual ISA + APIs
Metadata in V-ISA => Data Set Architecture: describes the data structures, attributes, and relationships
HLL Program --- compiler ---> Portable Code --- VM loader ---> V-ISA memory image -- Interpreter/Translator --> Execution
- P-code: Pascal IR
- JVM: Spec, designed with Java in mind
- CLI: Spec, more HLL-general (Instruction part: MSIL)
P-code
- On-stack op -- short code, better register-mapping friendly
- Mem cells
- Mem stack & heap abstraction
- Common interface to OS (hiding diff OS)
OO HLL VM
Sandbox: Managed code, access control, auto GC, ref checking, OS interface, namespacing, etc.. => Loaders + Runtime
Native Interface: Convention, APIs, FFI, e.g. JNI
Performance -- JIT
Constant Pool
Operand Stack Tracking: At any given point in the program, the operands stack must have the same number and types of operands in the same order. (path insensitivity)
Binary format for distribution (Class
in Java, Module
in CLI) consists of: Magic num, version info, constant pool with size, access flags, this class, super class, interface count, field count, field info, methods count, methods, attributes count, attributes.
Java: J2EE, J2SE, J2ME
Java APIs: java.lang
-- core package, including types, system, security, process, and Class
for reflection; java.util
-- data structures and supporting utils ...; java.io
, java.net
, java,awt
...
Module-based programming -- Serializability and Reflection. RMI, platform-indep format for repr of internal ds. (For net or persistenet storage)
Multi-threading -- Thread
class, monitor
for lock, wait
and notify
for sync.
CLI: more flexible
Verifiable, cross-lang interpretation
HLL VM impl
JVM:
- Class loader subsys
- mem sys
- GC heap
- Native stack
- Java Stack
- Method area
- Execution engine
Dynamic Class Loading: access right, scoping property, access right.
Loading: parsing and translate into internal data structure. Format correctness sanity checking.
Casting: upcast is down statically, down-cast is checked dyn
Malicious resource-demanding detection: Turing halting
Security with binary hashing & pub/priv signing
Method call -- stack inspection, done by security manager.
JIT -- akin binary translation.
OO prog -- frequent use of addressing indirection, use of small methods.
Optimization: Profiling + interpreter/simple-compiler/optimizing-compiler
Howto: code-relayout, methods inlining (trap: virtual methods => profiling, multi-versioning, specialization), On-stack replacement, scalar replacement, null-check motion
Example: Jikes RVM
Whether to compile: cost-benefit analysis, heuristics & experimentally derived parameters.
Jikes three level optimization framework:
- Level 0: conventional, incluing copy/const prop, common-subexpr elim, dead-code elim, branch opt, trivial inlining/relayout ...
- Level 1: More aggressive inlining/relayout based on profile
- Level 2: global optimization based on SSA form, loop unrolling ...
Co-designed VM
VM tech => General purpose CPU design
Co-design: VM software and host hardware
high-level semantics
IBM Sys/38, IBM Daisy, AS/400, Transmeta Crusoe...
- Source-ISA, visible memory
- Target-ISA, canceled memory
VMM: VM monitor
Code translation methods:
- Context-free translation
- Context-sensitive translation
Register state mapping: guest sys regs + VMM used regs
Memory state mapping: concealed memory (VMM code & data, code cache ..)
VMM control from boot
Memory mapping schemes:
- Shared logical memory space (too large to fit)
- Separated logic memory space
- Concealed memory is real addressing
VMM part: Diskless, no paging, no secondary concealing needed
Issues:
- Self-modifying code: Fine-grained Write-protection methods for source code regions
- I/O to guest code memory: Caught and keep code cache's durability.
Indirect-jump: Probably the greatest source of performance loss in a software-only code cache system.
JTLB: Jump TLB -- Direct mapping from SPC to TPC.
Procedure return jump: Hard to predict => RAS (Return Address Stack), mimics soft-proc-stack => For VM ctx, a Dual-address RAS (DRAS) is used. (like hard-impl of shadow stack)
Precise trap: Hardware Checkpoints => relax restriction on code motion optimization
- Checkpoints are set at every translation block entry point
- When there is a trap, the checkpoint is restored, and interpretation begins at the beginning of the source ode that formed the trapping translation block
Checkpoint -- Gated stores methods
- At commit point, make shadow copy, release gated stores, and establish new gate stores
- On exception, restore from shadow copy, squash gated stores, and establish new gate for stores.
Page fault compatibility
- Active page fault detection
- Lazy ...: When the translated code is actually used?
Flush: page table mapping is modified. Flush both translation block and related side tables.
Guest's mem-mapped IO
VLIW: parallelism
Simplify inst issue logic
- Transmeta Crusoe
- VLIW = 4 insts
- Branch unit, FPU, CPU, load/store U
- IA-32 => RISC like micro-ops => Parallelize and rescheduling
- AS/400
- New ISA
- MI: Machine Interface
- Mem: objects (persistent or temporary)
- sys vm rather than proc vm
Sys VM
Src of sys vm: time-sharing
Host platform --shared by--> guest sys VMs, with a layer of software (VMM)
- Multi-processor virtualization
- Shared-memory multi-processors => Memory model => Memory coherence and memory consistency
Outward appearance: Multiple machine illusion; hardware switch; resource subset replication
reg file state maintenance: copying
Resource control: some resource sensitive instructions need special treatment (interval timer etc.); Prevent starvation and unfairness -- Override the req
system mode v.s. user mode
IBM VM/370 -- CP/CMS design -- separation of resource management and service function
Processor: direct native execution + emulation special instruction
S =
- E: Executable image
- M: Mode of operation
- P: PC counter
- R: Memory relocation bounds register
- Privileged instructions: Load PSW / Set CPU Timer
- Sensitive instruction: Control / Behavior
- Critical instruction: Sens - Pri, e.g.
LRA
andPOPF
VMM: dispatcher + allocator + interpreter routines
True VMM: efficiency, resource control, and equivalence (efficiency is always compromised for the latter two)
Theorem 1: 3rd computer, valid VMM <=> Sens \subst pri
Recursive virtualization: timing dep, resource shrinking
critical instructions => patching => caching the actions
memory's two-level mapping: virtual -> real -> physical, shadow page tables
To virtualize TLB, the VMM maintains a copy of TLB guest's contents and also manages the real TLB => keep copies up-to-date
ASID: Address Space ID, included in TPL entry (Allow TLB to be management-efficient, compared with TLB rewrite)
Difficulty in virtualizing IO devices: a lot of sorts + keep growing
- Dedicated devices
- Partitioned devices
- Spooled devices
- Nonexistent
I/O => Through instruction in
, out
; through sys-call interface; through driver-interface level
VMM-n (native), VMM-u (user), VMM-d (driver)
Emulation assists:
- Ctx switch
- Decoding of priv instructions
- Virtual interval timer
- Adding special instructions to the ISA
Performance improving:
- Non-paged mode
- Pseudo-page-fault handling
- Spool files
- IVC
Para-virtualization: interface
- V=R VM
- Shadow table bypass assist
- Preferred-machine assist
- Segment sharing
IBM's IEF: Support for VM