At the time of its installation in summer 2018, Tetralith was more than just the fastest of the six traditional supercomputers in the National Supercomputer Center (NSC) at Linköping University. It was the most powerful supercomputer in the Nordic region.
But just three years later, Tetralith had to be supplemented with a new system – one that is specially tailored to the requirements of the rapidly developing algorithms for artificial intelligence (AI) and machine learning (ML). Tetralith wasn’t designed for machine learning – it didn’t have the parallel computing power needed to process the ever-growing data sets used to train artificial intelligence algorithms.
To support research programs in Sweden that rely on AI, the Knut and Alice Wallenberg Foundation donated 29.5 million euros to build the larger supercomputer. Berzelius was delivered in 2021 and put into operation in the summer. The supercomputer, which has more than twice the processing power of Tetralith, takes its name from the renowned scientist Jacob Berzelius, who comes from Östergötland, the region of Sweden in which the NSC is located.
Atos supplied and installed Berzelius, which includes 60 of the newest and most powerful servers from Nvidia – the DGX systems, each with eight graphics processors (GPUs). Nvidia networks connect the servers with each other – and with 1.5 PB (petabyte) storage hardware. Atos also supplied its Codex AI Suite, an application toolset to support researchers. The entire system is housed in 17 racks that extend side by side for about 10 meters.
The system is to be used for AI research – not only the major programs of the Knut and Alice Wallenberg Foundation, but also other scientific users who apply for a period of time on the system. Most of the users will be in Sweden, but some will be researchers in other parts of the world working with Swedish scientists. TThe largest areas of Swedish research that will use the system in the near future are autonomous systems and data-driven life sciences. In both cases, a lot of machine learning with huge data sets is required.
NSC intends to hire people to help users – not so much core programmers, but more to help users put together parts they already have. There are many software libraries for AI and they need to be understood and used properly. The researchers who use the system usually either program themselves, let wizards do it for them, or simply adapt good open source projects to their needs.
“So far, around 50 projects have been approved on the Berzelius,” says Niclas Andresson, NSC’s technology manager. “The system is not yet fully utilized, but the utilization is increasing. Some problems use a large part of the system. For example, we had a hackathon on NLP [natural language processing], and the system used that quite well. Nvidia has provided a toolbox for NLP that can be scaled to the large machine. “
In fact, one of the biggest challenges researchers now face is adapting the software they use to the new computing power. Many of them have one or a small number of GPUs that they use on their desktop computers. However, scaling their algorithms to a system with hundreds of GPUs is a challenge.
Now Swedish researchers have the opportunity to think big.
AI researchers in Sweden have been using supercomputer resources for several years. In the early days they used systems based on CPUs. But in recent years, as GPUs evolved from the game industry and into supercomputing, their massively parallel structures have taken number processing to a new level. The earlier GPUs were designed for rendering images, but are now being tailored for other applications, such as machine learning, where they have already become indispensable tools for researchers.
“Without the availability of supercomputing resources for machine learning, we could not be successful in our experiments,” says Michael Felsberg, professor at the Computer Vision Laboratory at Linköping University. “The supercomputer alone does not solve our problems, but it is an essential part. Without the supercomputer, there would be nowhere to go. It would be like a chemist without a petri dish or a physicist without a clock. “
Michael Felsberg, Linköping University
Felsberg was part of the group that helped define the requirements for Berzelius. He is also part of the award committee that decides which projects will be given time in this cluster, how the time will be allocated and how usage will be counted.
He insists that not only does it need a big supercomputer, it has to be the right supercomputer. “We have enormous amounts of data – terabytes – and we have to process them thousands of times. In all processing steps we have a very cohesive arithmetic structure, which means we can use a single instruction and process multiple pieces of data, and that’s the typical scenario in which GPUs are very powerful, “says Felsberg.
“The structure of the calculations is also more important than the number of calculations. Here, too, modern GPUs do exactly what is needed – they easily perform calculations on huge matrix products, ”he says. “GPU-based systems were introduced in Sweden a few years ago, but in the beginning they were relatively small and difficult to access. Now we have what we need. “
Massive parallel processing and huge data transfers
“Our research doesn’t require just a single run that takes over a month. Instead, we could have up to 100 runs, each lasting two days. During these two days, enormous memory bandwidth is used and local file systems are essential, ”says Felsberg.
“When machine learning algorithms run on modern supercomputers with GPUs, a lot of calculations are carried out. However, enormous amounts of data are also transferred. The bandwidth and throughput from the storage system to the computing node must be very high. Machine learning requires terabyte data sets and a given data set must be read up to 1,000 times over a two day run. So all nodes and the storage must be on the same bus.
“Modern GPUs have thousands of cores,” adds Felsberg. “They all run in parallel on different data, but with the same instruction. So this is the single-instruction-multi-data concept. We have that on every chip. And then you have sets of chips on the same boards and you have sets of boards in the same machine, so you have enormous resources on the same bus. And we need that because we often split our machine learning into several nodes.
“We’re using a large number of GPUs at the same time and sharing the data and learning with all of these resources. This gives you a real acceleration. Imagine doing this on a single chip – it would take over a month. But if you split it up, a massively parallel architecture – let’s say 128 chips – you get the machine learning result much, much faster, which means you can analyze the result and see the result. Based on the result, you carry out the next experiment, ”he says.
“Another challenge is that the parameter spaces are so large that we cannot afford to cover the whole thing in our experiments. Instead, we need to employ smarter search strategies and heuristics in the parameter spaces to find what we need. This often requires that you know the result of previous runs, which makes this a chain of experiments rather than a series of experiments that you can run in parallel. It is therefore very important that each run is as short as possible in order to get as many runs as possible in a row. “
“With Berzelius in place, this is the first time in the 20 years I’ve been working on machine learning for computer vision that we in Sweden really have enough resources for our experiments,” says Felsberg. “In the past, the computer was always a bottleneck. Now the bottleneck lies elsewhere – a bug in the code, a faulty algorithm or a problem with the data set. “
The beginning of a new era in life science research
“We do research in structural biology,” says Bjorn Wallner, professor at Linköping University and head of the Boinformatics department. “The aim is to find out how the various elements that make up a molecule are arranged in three-dimensional space. Once you understand this, you can develop drugs that target and bind to specific molecules. ”
Most of the time, research is linked to a disease, because then you can solve an immediate problem. Sometimes the bioinformatics department at Linköping also does basic research in order to gain a better understanding of biological structures and their mechanisms.
The group uses AI to make predictions about specific protein structures. DeepMind, a company owned by Google, has done a job that has revolutionized structural biology – and it relies on supercomputers.
DeepMind developed AlphaFold, an AI algorithm that was trained on very large data sets from biological experiments. The monitored training led to “weights” or a neural network that can then be used to make predictions. AlphaFold is now open source and available to research organizations such as Bjorn Wallner’s team at Linköping University.
Björn Wallner, Linköping University
There is still a lot of uncharted territory in structural biology. AlphaFold offers a new way to find the 3-D structure of proteins, but it’s only the tip of the iceberg – and digging deeper also requires supercomputing. It is one thing to understand a protein in isolation or a protein in a static state. But figuring out how different proteins interact and what happens when they move is a whole different thing.
Every human cell contains around 20,000 proteins – and they interact. You are also flexible. Pushing out a molecule and binding a protein to something else are all actions that regulate the machinery of the cell. Proteins are also made in cells. Understanding the basic machines is important and can lead to breakthroughs.
“Now we can achieve significantly more throughput with Berzelius and break new ground in our research,” says Wallner. “The new supercomputer even gives us the opportunity to retrain the AlphaFold algorithm. Google has lots of resources and can do lots of great things, but maybe now we can keep up a bit.
“We have only just started using the new supercomputer and need to adapt our algorithms to this huge machine in order to get the most out of it. We have to develop new methods, new software, new libraries, new training data so that we can actually get the most out of the machine, ”he says.
“Researchers will expand what DeepMind has done and train new models to make predictions. We can deal with protein interactions, beyond just individual proteins, and the question of how proteins interact and how they change. “