Paper Review Hierarchical Learning of Cross Language Mappings through Distributed Vector Representations for Code



  1. normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings)
  2. Hierarchical from bottom up, we construct shared embeddings for code elements of higher levels of granularity form the embeddings of their constituents, and then build mapping among code elements across languages based on similarities among embeddings


40K java and C# form 9 projects. Identity their cross-language mappings with reasonable Mean Average Precision scores.

achieve around 50% precision in recommending top-10 cross-language code mappings at various levels of granularity


NMT models use distributed vector representations of words as the basic unit to compose representations for more complex language elements, such as sentences and paragraphs. Skip-gram: Before this there was a bi-gram model which uses the most adjacent word to train the model. But in this case the word can be any word inside the window. So you can use any of the words inside the window skipping the most adjacent word. Hence skip-gram. How to compose from lower granularity to higher one? simply averaging word embeddings of all words in a text can be a strong baseline for representing the whole text for the task of short text similarity comparison. Variants of this simple averaging strategy exist, such as averaging the embeddings with their weights measured in terms of term-frequency/inverse- document-frequency (TF-IDF) to decrease the influence of the most common words.

Practical Value

What you can learn from this to make your research better?

Why don’t we combine both the code structures and the API/operator sequence information? For a programmer, he will go back and forth between these two different levels of abstractions.

Details and Problems From the presenters’ point of view, what questions might audience ask?

What is the ground-truth?

How is the training data looking like? How much do we have of them?

A parallel corpus is a collection of source code in one language and their translation into another language We utilize the similarity among file names to identify files in different languages that implement a same functionality

We then normalize the token streams in the files to remove semantic-irrelevant information (e.g., some variable names) and add more structural and some semantic information

What does “mapping” mean in evaluation? Does this tool generate the translated code from source code? There is not example of mapping? Why is this mapping only a list of API pairs? Shouldn’t the mapping be source-to-source. Disappointed.