Paper Reading A Language Agnostic Model for Semantic Source Code Labeling

High-level

Summary:

Built a DCNN for labelling arbitrary length code documents with relevant meta-words such as “java”, “parsing” etc. (for indexing purpose)

Evaluation:

On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags.

The characters from a given code snippet are converted to real-valued vectors using a character embedding

Takeaways:

Kuhn, Ducasse, and Gírba [18] apply Latent Se- mantic Indexing (LSI) and hierarchical clustering in order to analyze source code vocabulary without the use of external documentation.

We found punctuation to be a good indicator of code usefulness in short snippets. What are other possibilities? How do they compare?

Use length as threshold directly. The threshold solution is the eventually picked solution. I think while intuitively some heuristic might work, but there is no simple metric for evaluation the code snippet quality w.r.t. to the tag, unless being trained as well. But the dimension here is too high.

Practical Value

What you can learn from this to make your research better?

How would these semantic labels be further adapted for other purposes? Like directing the heuristic for code synthesis?

Details and Problems From the presenters’ point of view, what questions might audience ask?

What does this long tail list mean? What does it imply?

It means that some tags are used very frequently, while most of the tags are only used with a few samples.

What kind of knowledge is transferred? What is it?

The key knowledge is statistical distribution of co-occurence between code and label.

The most suspicious thing is about “sum-over-time” pooling.

I think the key problem in code search is “This is not what I intend to find”. Using google search, you rarely scroll aways from the first page, and in a lot of cases, the first answer is what you want. Why isn’t this experience possible in code platform? Aren’t we just searching in a smaller dimension and volume in this case?

Why deep CNN? How does it help? What are baselines?

CNNs are able to achieve state-of-the-art performance without the training time and data required for LSTMs

Embedding CNN is compared with Embedding LR and n-gram LR. What is the difference between embedding LR and n-gram LR? (I am afraid that the n-gram is per word — if the word segmentation is not reasonable, it is very hard to motivate char embedding). Also, the dimension of comparison is weird — why not n-gram CNN? Isn’t the proposed solution embedding-CNN?

Where does language-agnostic property come from?

I think character embedding makes it language-agnostic, since any reasonable parsing of code, even just into reasonable words, requires some language-specific engineering. However, that kind of work is done once and for all so I am suspicious of the value here. (Also, existing token stream generator is sufficient, isn’t it?)

While author introduced the architecture of the neural network used, the problem is about the motivation — why some layer is used?

The convolutions are able to preserve information about words and sequences by sliding over the embedding vectors of consecutive characters.

Using sum-over- time pooling on the stacked convolution matrix allows us to obtain a fixed-length vector regardless of the initial input size.