Google AI introduces ‘FLAN’: an instruction-adapted generalizable language model (NLP) for the execution of zero-shot tasks


In order to generate meaningful text, a machine learning model needs a lot of knowledge about the world and should be able to abstract it. While language models trained for this are increasingly able to acquire this knowledge automatically as they grow, it is unclear how this knowledge can be developed and applied to certain real activities.

Fine tuning is a good practice for doing this. It involves training a pre-trained model such as BERT or T5 on a marked data set in order to adapt it to a downstream job. However, it has a large number of training instances and stored model weights for each downstream job, which is not always feasible, especially with large models.

A recent Google study examines a simple technique known as instruction fine-tuning, sometimes known as instruction tuning. This involves fine-tuning a model to make it more receptive to performing Natural Language Processing (NLP) tasks in general, rather than a specific task.

The researchers used instruction tuning to train a model called a fine-tuned LANguage Net (FLAN). The instruction tuning phase of FLAN requires some updates compared to the huge amount of computation involved in pre-training the model. This enables FLAN to perform a wide variety of invisible tasks.

An illustration of how FLAN works: The model is tailored to different sets of instructions and generalized to invisible instructions. The more task types that are added to fine-tune the data model, the better the performance.

Zero shot prompt

When using language models to solve problems, zero-shot or low-shot prompting is a popular strategy. This method creates a task based on the text that a language model encountered during training, and then the language model completes the text to generate the solution. For example, the line “The movie review ‘greatest RomCom since Pretty Woman’ is _” could be given to a language model to classify the mood of a film review. and be given the opportunity to end the sentence with either the word “positive” or “negative”.

However, this method requires careful and immediate development in order to design tasks that will resemble the model’s data during training. This approach works well for some, but not all, tasks, and can also confuse practitioners with interacting with the model. For example, according to the developers of GPT-3 (one of the most widely used language models today), such prompting strategies did not perform well on natural language inference (NLI) tasks.

Instructions tuning

In contrast, FLAN refines the model with a variety of different instructions that use a simple and intuitive task description, such as: B. “Classify this movie review as favorable or negative” or “Translate this sentence into Danish”.

The team used templates to turn existing records into instructions to fine-tune the model. They suggest that training models on these instructions improves not only their ability to solve the types of instructions they saw during training, but also their general ability to follow instructions.


Model evaluation and performance

The team used established benchmark data sets to evaluate the model’s performance against current models. In addition, they assessed the performance of FLAN without seeing instances from the data set during training.

The researchers point out that if the training dataset is too similar to the assessment dataset, performance results may be skewed. As a result, they organize all data sets in task clusters and keep the training data of the data set and the complete task cluster to which the data set belongs, ready.


They tested FLAN on 25 different tasks and found that it beats the zero-shot prompt in all but four of them. For 20 of the 25 tasks, the FLAN results were better than for zero-shot GPT-3 and for some tasks even better than for low-shot GPT-3.


Their results show that the model scale is critical to whether or not it can benefit from customizing the instructions. The FLAN approach reduces performance at smaller sizes, and only at larger scales can the model generalize from instructions in the training data to unknown tasks. This could be because models with too few parameters cannot perform many tasks.

The FLAN model is the first to use scaled command reconciliation and improve the generalization capacity of the model. The team hopes that their proposed model will stimulate future research on models that perform previously undiscovered tasks and can learn from very little input.





Comments are closed.