Palani, Dharani Kumar ORCID: https://orcid.org/0000-0002-7454-6028 (2018) Statistical Machine Translation of English Text to API Code Usages: A comparison of Word Map, Contextual Graph Ordering, Phrase-based, and Neural Network Translations. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
1MBPalani_MCompSc_S2018.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Statistical Machine Translation (SMT) has gained enormous popularity in recent years as natural
language translations have become increasingly accurate. In this thesis we apply SMT techniques in
the context of translating English descriptions of programming tasks to source code. We evaluate
four existing approaches: maximum likelihood word maps, ContextualExpansion, phrase-based, and
neural network translation. As a training and test (i.e. reference translation) data set we clean and
align the popular developer discussion forum StackOverflow.
Our baseline approach, WordMapK, uses a simple maximum likelihood word map model which
is then ordered using existing code usage graphs. The approach is quite effective, with a precision
and recall of 20 and 50, respectively. Adding context to the word map model, ContextualExpansion,
is able to increase the precision to 25 with a recall of 40. The traditional phrase-based translation
model, Moses, achieves a similar precision and recall also incorporating the context of the input text
by mapping English sequences to code sequences. The final approach is neural network translation,
OpenNMT. While the median precision is 100 the recall is only 20. When manually examining the
output of the neural translation, the code usages are very small and obvious. Our results represent
an application of existing natural language strategies in the context of software engineering. We
make our scripts, corpus, and reference translations in the hope that future work will adapt these
techniques to further increase the quality of English to code statistical machine translation.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Palani, Dharani Kumar |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science |
Date: | 30 July 2018 |
Thesis Supervisor(s): | Rigby, Peter C |
Keywords: | Language Model, n-gram, Graph, Statistical Machine Translation, Neural Networks, StackOverflow, Probability |
ID Code: | 984139 |
Deposited By: | DHARANI KUMAR PALANI |
Deposited On: | 02 Nov 2018 20:06 |
Last Modified: | 02 Nov 2018 20:06 |
Repository Staff Only: item control page