Matrix factorization is an algorithm for matrices in linear algebra. It breaks down the matrices into a product of matrices, although the simplicity of matrix factorization can solve high-level problems of recommender systems. Furthermore, in collaborative filtering, matrix factorization algorithms work by decomposing the user-item interactions into the product of two rectangular matrices as user and item.
One type of algorithm in matrix factorization is Alternating Least Square (ALS), which helps to learn the parameters of matrix factorization. Because of the high efficiency and scaling linearity in both the number of rows and columns and non-zeros of ALS, studies on large-scale problems need to be done and they can help with these problems. Although ALS has high efficiency, a single machine implementation may not suffice for real matrix factorization. Therefore, researchers need a comprehensive distribution system. Due to the inherently sparse nature of the problem, most distributed implementations of matrix factorization using ALS rely on off-the-shelf CPU devices.
Recent breakthrough learnings have sparked a new wave of research and advancement in hardware accelerators. As the data for training sets and model sizes grew, new computational and model weighting strategies were explored. Additionally, to make this appropriate, domain-specific hardware acceleration was considered. In addition, Tensor Processing Units (TPUs) are notable hardware accelerators. A current-generation TPU v3 pod can provide more than 100 petaflops of processing power and 32 TiB of high-bandwidth storage, distributed across 2048 individual devices connected in a 2D ring network over high-speed links.
TPUs are very attractive for Stochastic Gradient Descent (SGD) based methods and it is not clear that the high performance implementation of ALS can be developed for a large cluster of TPU devices. TPUs can afford domain-specific speedups that can help with deep learning and include many dense matrix factorizations. Traditional data-parallel applications benefit from significant accelerations.
The problem now is to create an ALS design that can efficiently use the TPU architecture and scale the matrix factorization problems.
Because the distributed implementations of matrix factorization are off-the-shelf CPU devices, a high-performance implementation can be developed on a large cluster of hardware accelerators. So, the facts that help with these problems are:
- A TPU pod has enough distributed memory to store massive fragmented embedding tables.
- TPUs are designed for workloads that can benefit from data parallelism, which can help solve large stacks of the system of linear equations.
- TPU chips are directly connected to dedicated high-bandwidth, low-latency links, which helps in storing large distributed embedding tables in TPU memory.
- In TPU: Since any node failure can cause the training process to fail, traditional ML workloads require a highly reliable distributed setup that requires a cluster of TPUs to fulfill.
To solve the problems based on these facts, researchers at Google have developed a new method that implemented matrix factorization using ALS, which has high performance in terms of speed and scalability. They discussed various design options for architecture. The researchers proposed an open-source library called ALX for distributed matrix factorization using alternating least squares. Written in JAX, this new software allows solving big problems more efficiently with fewer resources than before!
The method proposed by the researchers uses both model and data parallelism. Any future improvement in matrix factorization should be evaluated based on scalability. Each TPU core stores a part of the embedding table and trains in mini-batches with a different data part. To further investigate large-scale matrix factorization algorithms and demonstrate the scalability of their implementation, they also created and published a real web link prediction dataset called WebGraph. The WebGraph data set is an extensive web link pedicion data set. This dataset helps with the scaling properties of ALX. Increase the size of real problems. After evaluating the result of ALX, it is shown that all variants of the WebGraph dataset with scaling analysis demonstrate the high parallel efficiency of the proposed implementation.
WebGraph Dataset: https://www.tensorflow.org/datasets/catalog/web_graph