Paper Review SMOKE Scalable Path Sensitive Memory Leak Detection for Millions of Lines of Code

Can I briefly summarize this work and please correct me if anything is not correct?

SMOKE is a memory leak detector can scan millions of lines of code in one hour with a desktop computer. It is path-sensitive, which means that when entering an if-conditional or alike, it will include the corresponding constraint into the logical predicate, so as to prevent false positives. The scalability comes from a new intermediate program representation used, called, “use-flow graph”, which is similar to value-flow graph but also including the use sites of heap-objects. A finite-state analysis is performed on the path in the order of control-flow, which means that, memory leak is encoded as a state machine property and each heap object’s state is encoded as a predicate over the abstract machine state. All relevant statements will be a state-transition operator.

High-level

Summary:

SMOKE is a staged solution towards scalable & precise memory leak detection.

First stage is coarse - based on a new program representation (UFG).

Evaluation:

finish checking industrial-sized projects, up to 8MLoC, in forty minutes with an average false positive rate of 24.4% Besides, SMOKE is significantly faster than the state-of-the-art research techniques as well as the industrial tools, with the speedup ranging from 5.2X to 22.8X. In the twenty-nine mature and extensively checked benchmark projects, SMOKE has discovered thirty previously- unknown memory leaks which were confirmed by developers, and one even assigned a CVE ID.

Takeaways:

we believe that the sparse value-flow analysis, already widely adopted in finding memory leaks [8], [9], is the right direction as it tracks values along the data dependence relations on the value flow graph (VFG) instead of the control flow graph, skipping irrelevant program statements to achieve scalability.

RHS algorithm is IFDS/IDE algorithm by Reps.

Practical Value

What you can learn from this to make your research better?

Extensive evaluation and clearcut key idea is more important than anything.

  • It’s vetted
  • It’s novel
  • I learn something from it

Details and Problems From the presenters’ point of view, what questions might audience ask?

Detecting memory leak at industrial scale is still not well addressed, in spite of the tremendous effort from both industry and academia in the past decades.

It this true that scalability problem is not well-addressed? What else problem (except for ML) has a bad scalability?

???

What is the current state of incremental program analysis? What are other approaches towards the scalability problems?

  1. Novel intermediate representation (this work)
    1. Is this same as the general idea of doing abstraction?
  2. incremental/summarization

What does “@“ mean? It is the program state before S, or after?

Note the definition of “program points”.

There you have before- and after- PP. Which means that this is (1) there is an empty/zero state (2) each PP is a transition

This diagram seems confusing — what is the difference between the dotted blue line and the solid black line on the right?

Encoding the path conditions as logical conditions and find if some is not properly guarded with release op when exiting.

My problem: inter-procedural?

Yes, there is call edge. Well, is summarization done?

We create a function summary for each function in a demand-driven way. That is, when reaching a call site, we check if the callee has a usable summary so that we do not need to reanalyze the callee. Otherwise, we create a summary after the callee is analyzed. In our approach, a summary is a map between the input states of a parameter p and the states of p at the end of a function. Therefore, it is unnecessary to repeat the analysis of a function at different call sites if its summary has been created.

Misc ranting

The UFG slices away the unnecessary program statements

UFG explicitly models the life cycle of a value by creating an out-of-scope node (i.e., p@s8 in Figure 1(c)), which indicates that the heap object pointed to by p is no longer referenced.

branch correlation.

How is use-flow graph better in this example (fig. 3)?

What does "property-related control flows” mean?

Property is a state of program. Proper-related control-flows are CF that might change property. (Definition 2 (Use-Flow Graph (UFG)).)

What is the relationship between this and general program slicing technique?

General program slicing is defined by "value of interest”. However, UFG is a slice of a category of behaviors.

What does path-sensitive mean?

To reduce the false positives, we introduce a path-sensitive verification step where we collect and solve path conditions of each memory-leak candidate path.

As I understand, this is just adding a guarding condition to each state. Essentially, we can filter out infeasible paths. (Analysis is always an approximation to what paths might be possible).

It is fair to compare a specialized tool with general purpose static analyzers?

Why are there two analysis stages? (TABLE IV STATISTICS OF TWO ANALYSIS STAGES)

In the first stage, S M O K E prunes 98.6% (1 − 610/43, 282) of the candidates, leaving twenty-one paths on average for each subject to be checked further. In the second stage, SMOKE further prunes 65.7% (1 − 209/610 ) of the candidates by detecting infeasible paths.

  1. Pruning out irrelevant paths 98.6%
  2. Pruning out infeasible paths 65.7%

Will path prune introduce false negative? Is that discussed?