Imagine for a moment that the millions of computer chips in the servers that power the world’s largest data centers suffer from rare, almost undetectable errors. And the only way to find the bugs was to throw those chips at huge computing problems that would have been unthinkable just a decade ago.
As the tiny switches in computer chips have shrunk to the width of a few atoms, chip reliability has become another concern for the people who operate the world’s largest networks. Companies like Amazon, Facebook, Twitter and many other websites have experienced surprise outages over the last year.
The failures have several causes, such as programming errors and network overload. But there are growing concerns that cloud computing networks, which have grown larger and more complex, still depend at the most fundamental level on computer chips that are now less reliable and, in some cases, less predictable.
Over the past year, researchers from both Facebook and Google have published studies describing computer hardware failures whose causes have not been easily identified. The problem, they argued, was not in the software but somewhere in the computer hardware, which was made by different companies. Google declined to comment on its study, while Facebook did not respond to requests for comment on its study.
“You see these silent bugs, which basically come from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specializes in computer hardware testing. according to dr Mitra, people increasingly believe that manufacturing flaws are related to these so-called silent flaws that are not easy to spot.
Researchers worry they’ll find rare bugs as they try to solve bigger and bigger computer problems, which is taxing their systems in unexpected ways.
Companies running large data centers started reporting systematic problems more than a decade ago. In 2015 in the IEEE Spectrum technical publication, A group of computer scientists studying hardware reliability at the University of Toronto reported that each year up to 4 percent of Google’s millions of computers encountered bugs that couldn’t be detected, causing them to run unexpectedly have been shut down.
In a microprocessor with billions of transistors — or a computer memory board made up of trillions of tiny switches, each capable of storing a 1 or 0 — even the smallest error can disrupt systems that now routinely perform billions of calculations every second.
At the dawn of the semiconductor age, engineers worried about the possibility that cosmic rays could occasionally flip a single transistor and change the outcome of a calculation. Now they fear that the switches themselves are becoming increasingly less reliable. The Facebook researchers even argue that the switches are becoming more prone to wear and tear and the lifespan of computer memories or processors may be shorter than previously thought.
There is increasing evidence that the problem is getting worse with each new generation of chips. A 2020 report by chipmaker Advanced Micro Devices found that the most advanced computer memory chips of the time were about 5.5 times less reliable than the previous generation. AMD did not respond to requests for comment on the report.
Tracking down these bugs is challenging, said David Ditzel, a veteran hardware engineer who is chairman and founder of Esperanto Technologies, a maker of a new type of processor designed for artificial intelligence applications in Mountain View, California. He said his company’s new chip, which is about to hit the market, had 1,000 processors made up of 28 billion transistors.
He compares the chip to an apartment building that would span the entire United States. Using the metaphor of Mr. Ditzel, Dr. Mitra said finding new bugs is a little like finding a single running faucet in an apartment in this building that only doesn’t work when a bedroom light is on and the apartment door is open.
Until now, computer designers have tried to deal with hardware errors by building special circuits into chips that correct errors. The circuits automatically detect and correct erroneous data. It was once considered an extremely rare problem. But a few years ago, Google production teams started reporting bugs that were incredibly difficult to diagnose. Calculation errors were intermittent and difficult to reproduce, according to their report.
A team of researchers tried to track down the problem and published their findings last year. They concluded that the company’s huge data centers, which consist of computer systems based on millions of processor cores, were experiencing new failures that were probably a combination of several factors: smaller transistors, approaching physical limits, and insufficient testing.
In their Cores That Don’t Count article, Google researchers noted that the problem was so challenging that they had already spent the equivalent of several decades of development trying to solve it.
Modern processor chips consist of dozens of processor cores, computing cores that enable tasks to be divided up and solved in parallel. The researchers found that a tiny subset of the nuclei rarely gave inaccurate results, and only under certain conditions. They described the behavior as sporadic. In some cases, the cores would only produce errors when the computing speed or temperature was changed.
According to Google, the increasing complexity of the processor design was a major cause of the error. But engineers also said smaller transistors, three-dimensional chips, and new designs that only cause errors in certain cases have all contributed to the problem.
In a similar paper published last year, a group of Facebook researchers found that some processors passed manufacturers’ tests, but then failed in the field.
Intel executives said they are familiar with the research reports from Google and Facebook and are working with both companies to develop new methods of detecting and correcting hardware failures.
Bryan Jorgensen, vice president of Intel’s data platforms group, said the researchers’ claims are correct and that “the challenge they are putting to the industry is in the right place.”
He said Intel recently started a project to help develop standard open source software for data center operators. The software would allow them to find and correct hardware errors that were not detected by the built-in circuitry in chips.
The challenge was underscored last year when several Intel customers issued quiet warnings about undetected errors caused by their systems. Lenovo, the world’s largest maker of personal computers, informed its customers that design changes in several generations of Intel’s Xeon processors meant that the chips may generate a greater number of uncorrectable bugs than previous Intel microprocessors.
Intel has not spoken publicly about the issue, but Mr. Jorgensen acknowledged the issue and said it has now been fixed. Since then, the company has changed its design.
Computer engineers are divided on how to respond to this challenge. A common response is the demand for novel software that proactively monitors for hardware failures and allows system operators to remove hardware when it begins to degrade. That has created an opportunity for new startups that offer software that monitors the health of underlying chips in data centers.
One such company is TidalScale, a Los Gatos, California company that makes specialized software for companies trying to minimize hardware failures. Its CEO, Gary Smerdon, suggested that TidalScale and others faced a formidable challenge.
“It’s going to be a bit like changing an engine while an airplane is still flying,” he said.