We look forward to presenting Transform 2022 in person again on July 19 and virtually from July 20 to 28. Join us for insightful conversations and exciting networking opportunities. Register today!
As artificial intelligence expands its horizons and breaks new ground, it increasingly challenges people’s imaginations to open new frontiers. As new algorithms or models help address a growing number and type of business problems, advances in natural language processing (NLP) and language models are making programmers think about how they can revolutionize the world of programming.
With the development of several programming languages, the job of a programmer has become increasingly complex. While a good programmer may be able to define a good algorithm, converting it to a relevant programming language requires knowledge of its syntax and the available libraries, which limits a programmer’s ability across different languages.
Traditionally, programmers have relied on their knowledge, experience, and repositories to create these code components across languages. IntelliSense helped them with appropriate syntactical prompts. Advanced IntelliSense went a step further with auto-completion of statements based on syntax. Google (code) search/GitHub code search even listed similar code snippets but the responsibility of finding the right pieces of code or scripting the code from scratch, assembling them and then for a specific need contextualizing is on the shoulders of programmers alone.
We are now seeing the development of intelligent systems that can understand the goal of an atomic task, understand the context, and generate appropriate code in the required language. This generation of contextual and relevant code can only be done when there is a proper understanding of programming languages and natural language. Algorithms can now understand these nuances across languages, opening up a number of possibilities:
- code conversion: Understand code in one language and generate equivalent code in another language.
- code documentation: Generate the textual representation of a given piece of code.
- code generation: Generate appropriate code based on text input.
- code validation: Validation of the alignment of the code with the given specification.
The evolution of code conversion can be better understood if we look at Google Translate, which we use quite often for natural language translations. Unlike traditional systems that relied on translation rules between source and target language, Google Translate learned the nuances of translation from a vast corpus of parallel data sets – source language statements and their equivalent target language statements.
Because it’s easier to collect data than write rules, Google Translate has been scaled to translate between over 100 natural languages. Neural Machine Translation (NMT), a type of machine learning model, made Google Translate possible to learn from a huge data set of translation pairs. The efficiency of Google Translate inspired the first generation of machine learning based translators for programming languages to adopt NMT. However, the success of NMT-based programming language translators was limited due to the unavailability of large parallel datasets (supervised learning) in programming languages.
This has led to unsupervised machine translation models that leverage a large-scale monolingual code base that is publicly available. These models learn from the monolingual code of the source programming language, then from the monolingual code of the target programming language, and are then armed to translate the code from the source to the target. Facebook’s TransCoder, built on top of this approach, is an unsupervised machine translation model, trained on multiple monolingual codebases from open-source GitHub projects, that can efficiently translate functions between C++, Java, and Python.
Code generation is currently evolving in different avatars – as a plain code generator or as a pair programmer that auto-completes a developer’s code.
The key technique used in the NLP models is transfer learning, where the models are pre-trained on large amounts of data and then refined based on targeted limited data sets. These are largely based on recurrent neural networks. Recently, models based on the Transformer architecture are proving to be more effective because they lend themselves to parallelization and speed up computation. Models optimized in this way for programming language generation can then be used for various coding tasks, including code generation and generating unit test scripts for code validation.
We can also reverse this approach, applying the same algorithms to understand the code and generate relevant documentation. The traditional documentation systems focus on translating the legacy code into English line by line, giving us pseudo code. But this new approach can help to consolidate the code modules into comprehensive code documentation.
Programming language generation models available today are CodeBERT, CuBERT, GraphCodeBERT, CodeT5, PLBART, CodeGPT, CodeParrot, GPT-Neo, GPT-J, GPT-NeoX, Codex, etc.
DeepMind’s AlphaCode goes one step further and generates multiple code examples for the given descriptions while ensuring approval of the given test conditions.
The code auto-completion follows the same approach as Gmail Smart Compose. So many Smart Compose gives the user real-time, context-specific suggestions to help compose emails faster. This is essentially supported by a neural language model trained on a large set of emails from the Gmail domain.
A model that can predict the next set of lines in a program based on the last lines of code is an ideal pair programmer when extended to the programming domain. This significantly speeds up the development lifecycle, increases developer productivity, and ensures better code quality.
Not only can CoPilot auto-complete blocks of code, but it can also edit or paste content into existing code, making it a very powerful programmer with refactoring abilities. CoPilot is powered by Codex, which has trained billions of parameters using a large volume of code from public repositories, including Github.
An important point to note is that we are likely to be in a transition phase where pair programming is essentially human-in-the-loop, which in itself is a significant milestone. But the ultimate goal is undoubtedly autonomous code generation. However, the development of AI models that inspire trust and responsibility will define this journey.
Code generation for complex scenarios that require more problem solving and logical thinking is still a challenge as it might justify generating code that didn’t occur before.
Understanding the current context to generate the appropriate code is limited by the size of the model’s context window. The current set of programming language models supports a context size of 2,048 tokens; Codex supports 4,096 tokens. The samples in low-shot learning models consume a portion of these tokens, and only the remaining tokens are available for developer input and model-generated output, while zero-shot learning/fine-tuned models reserve the entire context window for input and output.
Most language models require high computational power because they are built on billions of parameters. Introducing these in different business contexts could place higher demands on computing budgets. There is currently a lot of focus on optimizing these models to allow for easier adoption.
In order for these code generation models to work in pair programming mode, the inference time of these models must be faster so that their predictions are rendered in less than 0.1 seconds for developers in their IDE to provide a seamless experience.
Kamalkumar Rathinasamy leads the machine learning-based machine programming group at Infosys and focuses on building machine learning models to extend coding tasks.
Vamsi Krishna Oruganti is an automation enthusiast and leads the deployment of AI and automation solutions for financial services clients at Infosys.
data decision maker
Welcome to the VentureBeat community!
DataDecisionMakers is the place where experts, including technical staff, working with data can share data-related insights and innovations.
If you want to read about innovative ideas and up-to-date information, best practices and the future of data and data technology, visit us at DataDecisionMakers.
You might even consider contributing an article of your own!
Read more from DataDecisionMakers