Paper Review A Survey of Machine Learning for Big Code and Naturalness

High-level

Summary:

  • Different characteristics of natural language and source code
    • Executability: Robustness
    • Formality: Reuse, code completion easier
    • Cross-channel interaction: code has higher neologism
  • Taxonomy of probabilistic models of source code
  • PL/SE applications of these models

Evaluation:

Takeaways:

In addition to generating code, by definition, code generating probabilistic models act as a scoring function, assigning a non-zero probability to ev- ery possible snippet of code. This score, sometimes referred to as “naturalness” of the code [87], suggests how probable the code is under a learned model.

Three kinds of models: language, transducer, and multi-modal models

Code-generating models: token sequence, code tree, semantic graph

About “Distributed vector representations”: http://www.cs.toronto.edu/~bonner/courses/2014s/csc321/lectures/lec5.pdf "“Distributed representation” means a many-to- many relationship between two types of representation (such as concepts and neurons)."

Distributed representations refer to arithmetic vectors or metrics where the meaning of an element is distributed across multiple components.

It is relatively computationally expensive (to generate syntax tree), especially for neural models, given the variable shape and size of the trees that inhibit efficient batching.

So it is not that PCFGs do not capture long-range dependencies (n-gram-based models do not either), but that they do not even capture close-range dependencies that matter [29]

To our knowledge, there are no generative models that directly generate graph representations of realistic code (e.g., dataflow graphs).

Cross-entropy: https://towardsdatascience.com/demystifying-cross-entropy-e80e3ad54a8 Essentially, it is the minimum encoding size expectation when using the estimated distribution as variable w.r.t true distribution.

Statistical machine translation (SMT)

Structured prediction: dependency structure among the outputs.

As in NLP and NL in general, evaluating pattern-mining models is hard, since the quality of the discovered latent structure is subjective.

Code defect detecting is similar to anomaly detection in general ML.

Using distributed vector representations allows them to learn a continuous similarity metric between code locations rather than using edit distance.

The third wave promises new machine-learning models informed by programming language semantics. In some cases, a machine-learning model may not be required (e.g., when the prob- lem is deterministic), and, in other cases, simple models can outperform advanced, off-the-self deep-learning methods, designed for non-code domains

Introducing better representations that bridge the gap between machine learning and source code will allow the probabilistic models of code to reason about the rich structure and semantics of code, without resorting to proxies.

modular neural network architectures. Such architectures decompose the network into components that are combined based on the problem instance

An early example is the work of Allamanis et al. [8], who design a neural net- work based on the output of data flow analysis.

Furthermore, some metrics that are widely used in NLP are not suited for source code. For exam- ple, BLEU score is not suitable for measuring the quality of output source code (e.g., in transducer models), because it fails to account for the fact that the syntax is known in all programming lan- guages, so the BLEU score may be artificially “inflated” for predicting deterministic syntax.

Practical Value

What you can learn from this to make your research better?

Oh et al. [147] and Mangal et al. [127] use machine-learning models to statistically parameterize program analyses to reduce the false-positive ratio while maintaining high precision.

Chae et al. [38] reduce automatically (with- out machine learning) a program to a set of dataflow graphs and manually extract features from them.

New research idea: giving weight to insufficient code data by global reuse counting

The MSR field’s early successes date back to work by Zimmermann et al. [201], Williams and Hollingsworth [189], Gabel and Su [66], and Acharya et al. [1] on mining API protocols from source code bases

I might be able to use both approach, or better, actually combining them, in my Amazon internship days. However, I should actively think about the access problem.

Details and Problems From the presenters’ point of view, what questions might audience ask?