Login | Register

Statistical Machine Translation of English Text to API Code Usages: A comparison of Word Map, Contextual Graph Ordering, Phrase-based, and Neural Network Translations

Title:

Statistical Machine Translation of English Text to API Code Usages: A comparison of Word Map, Contextual Graph Ordering, Phrase-based, and Neural Network Translations

Palani, Dharani Kumar ORCID: https://orcid.org/0000-0002-7454-6028 (2018) Statistical Machine Translation of English Text to API Code Usages: A comparison of Word Map, Contextual Graph Ordering, Phrase-based, and Neural Network Translations. Masters thesis, Concordia University.

[img]
Preview
Text (application/pdf)
Palani_MCompSc_S2018.pdf - Accepted Version
Available under License Spectrum Terms of Access.
1MB

Abstract

Statistical Machine Translation (SMT) has gained enormous popularity in recent years as natural
language translations have become increasingly accurate. In this thesis we apply SMT techniques in
the context of translating English descriptions of programming tasks to source code. We evaluate
four existing approaches: maximum likelihood word maps, ContextualExpansion, phrase-based, and
neural network translation. As a training and test (i.e. reference translation) data set we clean and
align the popular developer discussion forum StackOverflow.
Our baseline approach, WordMapK, uses a simple maximum likelihood word map model which
is then ordered using existing code usage graphs. The approach is quite effective, with a precision
and recall of 20 and 50, respectively. Adding context to the word map model, ContextualExpansion,
is able to increase the precision to 25 with a recall of 40. The traditional phrase-based translation
model, Moses, achieves a similar precision and recall also incorporating the context of the input text
by mapping English sequences to code sequences. The final approach is neural network translation,
OpenNMT. While the median precision is 100 the recall is only 20. When manually examining the
output of the neural translation, the code usages are very small and obvious. Our results represent
an application of existing natural language strategies in the context of software engineering. We
make our scripts, corpus, and reference translations in the hope that future work will adapt these
techniques to further increase the quality of English to code statistical machine translation.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Palani, Dharani Kumar
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:30 July 2018
Thesis Supervisor(s):Rigby, Peter C
Keywords:Language Model, n-gram, Graph, Statistical Machine Translation, Neural Networks, StackOverflow, Probability
ID Code:984139
Deposited By: DHARANI KUMAR PALANI
Deposited On:02 Nov 2018 20:06
Last Modified:02 Nov 2018 20:06
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Back to top Back to top