the The Transform Technology Summits begin on October 13th with Low-Code / No Code: Enabling Enterprise Agility. Join Now!
Let them OSS Enterprise newsletter accompany you on your open source journey! Sign up here.
In June, OpenAI partnered with GitHub to launch Copilot, a service that provides suggestions for entire lines of code in development environments like Microsoft Visual Studio. Driven by an AI model called Codex – which OpenAI later made available via an API – Copilot can translate natural language into code in more than a dozen programming languages, interpret commands in plain English, and execute them.
A community effort is now underway to create a freely available, open source alternative to Copilot and OpenAI’s Codex model. Called GPT Code Clippy, contributors hope to create an AI pair programmer that will enable researchers to examine large AI models that have been code-trained to better understand their capabilities – and limitations.
Open source models
Trained on billions of lines of public code, Codex works with a wide range of frameworks and languages ââand adapts to the changes developers make to suit their coding styles. Similarly, GPT Code Clippy learned from hundreds of millions of examples of code bases to generate code similar to that of a human programmer.
The contributors to the GPT Code Clippy project used GPT-Neo as the basis of their AI models. GPT-NEo was developed by the basic research collective EleutherAI and is a so-called transformer model. This means that the influence of different parts of the input data is weighted instead of treating all input data equally. Transformers don’t need to process the beginning of a sentence before the end. Instead, they identify the context that gives meaning to a word in the sentence and allow them to process input data in parallel.
GPT-Neo has been “pre-trained” on The Pile, an 835 GB collection of 22 smaller data sets, including academic sources (e.g. Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github) and more . By fine-tuning, GPT Code Clippy contributors improved their code understanding skills by making their models available to repositories on GitHub that met certain search criteria (e.g. had more than 10 GitHub stars and two commits) after duplicates Files were filtered.
âWe used Hugging Face’s Transformers library … to refine our model[s] on various code datasets, including one of our own that we scrapped from GitHub, âexplain the contributors on the GPT Code Clippy project page. âWe decided to fine-tune rather than retrain from scratch, as OpenAI’s GPT Codex paper reports that retrained from scratch and refined the model [result in equivalent] Power. However, fine-tuning allowed the model[s] converge faster than training from scratch. Therefore, all versions of our models are finely tuned. “
GPT Code Clippy contributors have so far trained several models with third generation Tensor Processing Units (TPUs), Google’s custom AI accelerator chip, available through Google Cloud. Although it’s just getting started, they have created a plugin for Visual Studio and plan to extend the capabilities of GPT Code Clippy to other languages ââ- especially underrepresented ones.
“Our ultimate goal is not just to develop an open source version of Github’s Copilot, but to offer comparable performance and ease of use,” write the contributors. “[We hope to eventually] To develop ways of updating versions and updates of programming languages. “
Promises and setbacks
AI-powered coding models are valuable not only when writing code, but also when it comes to lower hanging fruits like updating existing code. Migrating an existing code base to a modern or more efficient language such as Java or C ++, for example, requires specialist knowledge of both the source and target languages ââ- and is often costly. The Commonwealth Bank of Australia has spent approximately $ 750 million over five years migrating its platform from COBOL to Java.
However, there are many potential pitfalls, such as: B. Bias and unwanted code suggestions. In a recent article, the Salesforce researchers behind CodeT5, a Codex-like system that code can understand and generate, affirm that the datasets used to train CodeT5 include some stereotypes such as race and gender from the text comments – or even from the source code – could encode. In addition, CodeT5 could contain sensitive information such as personal addresses and identification numbers. And it could generate vulnerable code that negatively impacts the software.
OpenAI similarly found that Codex could suggest compromised packages, call functions insecurely, and produce programming solutions that appear correct but do not do the intended job. The model can also be made to generate racist and harmful output as code, such as the word “terrorist” and “violent” when writing code comments prompting “Islam”.
The GPT Code Clippy team didn’t say how it could mitigate biases that might appear in its open source models, but the challenges are clear. For example, while the models could eventually reduce Q&A sessions and repetitive code review feedback, they could do harm if not carefully scrutinized – especially given research showing coding models fall short of human accuracy.
If you’d like to cover AI, send news tips to Kyle Wiggers – and subscribe to the AI ââWeekly newsletter and bookmark our AI channel, The Machine.
Thank you for reading,
Author of AI staff
VentureBeat’s mission is to be a digital marketplace for tech decision makers to gain knowledge of transformative technologies and transactions. Our website provides important information on data technologies and strategies to help you run your organization. We invite you to become a member of our community to gain access:
- up-to-date information on the topics of interest to you
- our newsletters
- closed thought leadership content and discounted access to our award-winning events such as Transform 2021: Learn more
- Network functions and more
become a member