Rl guided

we interpret state and current partial program as the environment. Seeking lines of code as action.

RISC-V assembly language (interesting part -- why not HPL?) -- no control flow, no memory read/writes.

Problem: what kind of programs are used for benchmark? What will the spec look like? random programs with specific attributes.