Intro to VMs
- Language VM, e.g. JVM
- Process VM
- System VM, like VMware
- Co-designed VM
- Process VMs
- Same ISA
- Dynamic bin optimizers
- Diff ISA
- Dyn Translators
- HLL VMs -- Interpreter, Compilers
- Same ISA
- Sys VMs
- Same ISA
- Classic-Sys VMs -- VMM
- Hosted VMs
- Different ISA
- Whole-Sys VMs -- Emulation
- Co-designed VMs -- Hardware optimizations
- Same ISA
Portability? Functionality? Or even better performance?
Emulation: Interpretation and Binary Translation
Startup cost v.s. continuous cost
Interpreter: manages the state (mem, ctx, code ...)
- Decode-and-dispatch interpretation
- Threaded interpretation
- Direct threaded -- code replacing
RISC v.s. CISC: Only some instructions in CISC are commonly used? Any pattern or skeleton of decoding (instruction template)? Partial decoding (dispatch on first byte)?
Precoding and portability?
Binary translation: how does the mapping work? Register mapping? Other special things?
Dynamic: Code discovering problem in static translation and dyn-translation (akin JIT), code cache (hit, replacement algo...), code map, any inconsistency?
Other issues: self-modifying code, self-ref code, precise traps (debug?)
Same-ISA VM: code management, program shepherding, monitoring (security)
Translation chaining (link translated blocks together)
SPC (Source PC), TPC (Translated PC)
Software indirect jump predication.
Shadow Stack: optimization for lookup overhead
ISA issues: number/property of registers (simulated registers), condition code (lazy eval, compatibility), alignment, endianess, addressing modes ...
Shade: A simulation tool
ABI: Application binary interface
Proc VM: Loader, signals, os call emulation, exception emulation, code caching, translation + interpretation, profiling..
Compatibility: state correspondence when control transfers (to and from user program and host os)
runtime binary: RT data, RT code ..
memory mapping: translation tables (indirect v.s. direct), arch(segmentation, page sizes ...), access privilege supporting difference, protection/allocation granularity ..
self-mod, self-ref code: write-protect -> cache flashing
protect runtime -> two modes: RT mode, emu mode
staged emulation: mixed methods, profiling, hotspot etc.
- Linux: OS-app communication through ABI and signals
- Windows: callbacks, async call, exceptions
I/O simulation: side effect, irrelevant to Turing completeness
code cache v.s. HW cache: no fixed size, presence dep caused by chaining, no backing store (once deleted, only re-translation)
replacement algos: LRU, flush when full, preemptive flush (detect the program phase change), fine-grained FIFO, coarse-grained FIFO
FX!32: Transparent execution of IA-32 apps on Alpha platforms running on Windows.
Dynamic Binary Translation
Performance becomes the first concern.
Compiler techniques: code motion, reordering, blocked translation (trace, superblock, tree group).
Stages: Interpretation, basic block translation (with chaining or not), optimized translation (larger blocks), highly optimized translation (with profiling information)
Behaviors: backward branch tendency, same-value production ...
Profiling: HW or Software? Probes, interrupts. HotSpot detection (region based), control flow predicability (edge based). Instrumentation v.s. Sampling.
Larger blocks: Locality.
Super block formation: starting points, continuation, end points. Threshold. -> Code relayout
Compatibility issues: register consistency (renaming ... code motion ...), trap consistency (code reordering...), Compensate code
Example: HP Dynamo system.
HLL VM arch
Think process VM is a "after-the-fact" method, while HLL VM is a well-designed plan aimed at portability.
Virtual ISA + APIs
Metadata in V-ISA => Data Set Architecture: describes the data structures, attributes, and relationships
HLL Program --- compiler ---> Portable Code --- VM loader ---> V-ISA memory image -- Interpreter/Translator --> Execution
- P-code: Pascal IR
- JVM: Spec, designed with Java in mind
- CLI: Spec, more HLL-general (Instruction part: MSIL)
- On-stack op -- short code, better register-mapping friendly
- Mem cells
- Mem stack & heap abstraction
- Common interface to OS (hiding diff OS)
OO HLL VM
Sandbox: Managed code, access control, auto GC, ref checking, OS interface, namespacing, etc.. => Loaders + Runtime
Native Interface: Convention, APIs, FFI, e.g. JNI
Performance -- JIT
Operand Stack Tracking: At any given point in the program, the operands stack must have the same number and types of operands in the same order. (path insensitivity)
Binary format for distribution (
Class in Java,
Module in CLI) consists of: Magic num, version info, constant pool with size, access flags, this class, super class, interface count, field count, field info, methods count, methods, attributes count, attributes.
Java: J2EE, J2SE, J2ME
java.lang -- core package, including types, system, security, process, and
Class for reflection;
java.util -- data structures and supporting utils ...;
Module-based programming -- Serializability and Reflection. RMI, platform-indep format for repr of internal ds. (For net or persistenet storage)
monitor for lock,
notify for sync.
CLI: more flexible
Verifiable, cross-lang interpretation
HLL VM impl
- Class loader subsys
- mem sys
- GC heap
- Native stack
- Java Stack
- Method area
- Execution engine
Dynamic Class Loading: access right, scoping property, access right.
Loading: parsing and translate into internal data structure. Format correctness sanity checking.
Casting: upcast is down statically, down-cast is checked dyn
Malicious resource-demanding detection: Turing halting
Security with binary hashing & pub/priv signing
Method call -- stack inspection, done by security manager.
JIT -- akin binary translation.
OO prog -- frequent use of addressing indirection, use of small methods.
Optimization: Profiling + interpreter/simple-compiler/optimizing-compiler
Howto: code-relayout, methods inlining (trap: virtual methods => profiling, multi-versioning, specialization), On-stack replacement, scalar replacement, null-check motion
Example: Jikes RVM
Whether to compile: cost-benefit analysis, heuristics & experimentally derived parameters.
Jikes three level optimization framework:
- Level 0: conventional, incluing copy/const prop, common-subexpr elim, dead-code elim, branch opt, trivial inlining/relayout ...
- Level 1: More aggressive inlining/relayout based on profile
- Level 2: global optimization based on SSA form, loop unrolling ...
VM tech => General purpose CPU design
Co-design: VM software and host hardware
IBM Sys/38, IBM Daisy, AS/400, Transmeta Crusoe...
- Source-ISA, visible memory
- Target-ISA, canceled memory
VMM: VM monitor
Code translation methods:
- Context-free translation
- Context-sensitive translation
Register state mapping: guest sys regs + VMM used regs
Memory state mapping: concealed memory (VMM code & data, code cache ..)
VMM control from boot
Memory mapping schemes:
- Shared logical memory space (too large to fit)
- Separated logic memory space
- Concealed memory is real addressing
VMM part: Diskless, no paging, no secondary concealing needed
- Self-modifying code: Fine-grained Write-protection methods for source code regions
- I/O to guest code memory: Caught and keep code cache's durability.
Indirect-jump: Probably the greatest source of performance loss in a software-only code cache system.
JTLB: Jump TLB -- Direct mapping from SPC to TPC.
Procedure return jump: Hard to predict => RAS (Return Address Stack), mimics soft-proc-stack => For VM ctx, a Dual-address RAS (DRAS) is used. (like hard-impl of shadow stack)
Precise trap: Hardware Checkpoints => relax restriction on code motion optimization
- Checkpoints are set at every translation block entry point
- When there is a trap, the checkpoint is restored, and interpretation begins at the beginning of the source ode that formed the trapping translation block
Checkpoint -- Gated stores methods
- At commit point, make shadow copy, release gated stores, and establish new gate stores
- On exception, restore from shadow copy, squash gated stores, and establish new gate for stores.
Page fault compatibility
- Active page fault detection
- Lazy ...: When the translated code is actually used?
Flush: page table mapping is modified. Flush both translation block and related side tables.
Guest's mem-mapped IO
Simplify inst issue logic
- Transmeta Crusoe
- VLIW = 4 insts
- Branch unit, FPU, CPU, load/store U
- IA-32 => RISC like micro-ops => Parallelize and rescheduling
- New ISA
- MI: Machine Interface
- Mem: objects (persistent or temporary)
- sys vm rather than proc vm
Src of sys vm: time-sharing
Host platform --shared by--> guest sys VMs, with a layer of software (VMM)
- Multi-processor virtualization
- Shared-memory multi-processors => Memory model => Memory coherence and memory consistency
Outward appearance: Multiple machine illusion; hardware switch; resource subset replication
reg file state maintenance: copying
Resource control: some resource sensitive instructions need special treatment (interval timer etc.); Prevent starvation and unfairness -- Override the req
system mode v.s. user mode
IBM VM/370 -- CP/CMS design -- separation of resource management and service function
Processor: direct native execution + emulation special instruction
- E: Executable image
- M: Mode of operation
- P: PC counter
- R: Memory relocation bounds register
- Privileged instructions: Load PSW / Set CPU Timer
- Sensitive instruction: Control / Behavior
- Critical instruction: Sens - Pri, e.g.
VMM: dispatcher + allocator + interpreter routines
True VMM: efficiency, resource control, and equivalence (efficiency is always compromised for the latter two)
Theorem 1: 3rd computer, valid VMM <=> Sens \subst pri
Recursive virtualization: timing dep, resource shrinking
critical instructions => patching => caching the actions
memory's two-level mapping: virtual -> real -> physical, shadow page tables
To virtualize TLB, the VMM maintains a copy of TLB guest's contents and also manages the real TLB => keep copies up-to-date
ASID: Address Space ID, included in TPL entry (Allow TLB to be management-efficient, compared with TLB rewrite)
Difficulty in virtualizing IO devices: a lot of sorts + keep growing
- Dedicated devices
- Partitioned devices
- Spooled devices
I/O => Through instruction
out; through sys-call interface; through driver-interface level
VMM-n (native), VMM-u (user), VMM-d (driver)
- Ctx switch
- Decoding of priv instructions
- Virtual interval timer
- Adding special instructions to the ISA
- Non-paged mode
- Pseudo-page-fault handling
- Spool files
- V=R VM
- Shadow table bypass assist
- Preferred-machine assist
- Segment sharing
IBM's IEF: Support for VM