Paper Review Predicting Program Properties from “Big Code”

High-level

Summary:

formulating the problem of inferring program properties as structured prediction

classic semantic properties of programs (e.g. type annotations) as well as syntactic program elements (e.g. identifiers or code).

Evaluation:

Real developers
JSNICE successfully pre- dicts correct names for 63.4% of program identifiers and 81.6% of the guessed type annotations are correct (NOTE: the amount of examples is in thousands)

The way precision and recall is defined is weird: Such a system produces a type for every variable (100% recall), but its precision is only 37.8%.

Our models contain 7,627,484 features for names and 70,052 features for types.

Takeaways:

Design Decisions:

Because of the assumption that the property exists in the training data, we create one random variable per local variable of a program with the name to predict and the feature functions as described in Section 4.4 2.

Clustering vs. Probabilistic Models It is instructive to understand that our approach is fundamentally not based on clustering. Given a program x, we do not try to find a similar (for some definition of similarity) program s in the training corpus and then extract useful information from s and integrate that information into x. Such an approach would be limiting as often there is not even a single program in the training corpus which contains all of the information we need to predict for program x. In contrast, with the approach presented here, it is possible to build a probabilistic model from multiple programs and then use that information to predict properties about a single program x. CRF is just a model for describing conditional probability.

Practical Value

What you can learn from this to make your research better?

I think CRF is very useful to make SeGuard more powerful.

Details and Problems From the presenters’ point of view, what questions might audience ask?

What does “joint” mean in “joint prediction”? (CRF)

When predicting facts and properties of programs, it is important to observe that these properties are usually dependent on one another. This means that any predictions of these properties should be done jointly and not independently in isolation.

What is “arc”?

What is the relationship between MAP and CRF? MAP is just conditional prediction. They deliberately invented a hard to decode name.

It felt like the feature is just the distributions of edge types.