RELAX Doctoral Network

Why Numerical Accuracy Matters for Reproducibility in Deep Learning Systems

By Hue Dang

When I first started learning about machine learning and deep learning, I often followed project tutorials, coding along with the steps the author had outlined, using the same inputs and algorithms. However, sometimes my model would perform worse than the tutorial author’s model. Maybe you’ve been in a similar situation, too. Another scenario where we might care about the perfect reproducibility is when we want to share an experiment with our colleagues, ensuring they can achieve the same results. Yet, they may not reproduce the high performance you did. So, what causes these differences? In this post, we will explore the factors affecting the reproducibility of machine learning and deep learning models in such situations and discuss what we can do to address them.

Numerical Errors: A Challenge to Reproducibility

You may wonder, what exactly does reproducibility mean in deep learning? Reproducibility in machine learning means being able to repeatedly run your algorithm on specific datasets and obtain the same or similar results for a given project.

Recently, applications of deep neural networks (DNNs) have grown rapidly, as they are able to solve computationally hard problems. A recent report by McKinsey estimates that AI-based applications have the potential market values ranging from $3.5 and $5.8 trillion annually[1]. These technologies are increasingly used in safety-critical applications such as medical imaging [2] and self-driving cars [3], where user safety and information security are paramount. Unlike traditional software systems, which are programmed based on deterministic rules (e.g., if/else), the deep learning (DL) models within AI-based systems are constructed in a stochastic way due to the underlying DL algorithms, whose behaviour may not be reproducible and trustworthy [4].

Though significant research effort has been made in verifying the behaviour of machine learning systems and developing tailored training techniques, most of these methods developed so far do not account for slight behavioural differences arising from different numerical implementations. Unfortunately, these small differences after propagating hundreds of neurons and dozens of layers cause significant changes in system outputs and have been identified as causes of differences in the behaviour of neural networks. For instance, Sun et al. [5] showed that instabilities arising from floating-point arithmetic errors during training, even at the lowest precision, can significantly amplify, leading to variances in test accuracy comparable to those caused by stochastic gradient descent (SGD).

Sources of numerical errors

Let’s look into primary sources that cause numerical instabilities in deep learning and why it matters for the reproducibility in deep learning systems.

Different software implementations

The first source that affects numerical accuracy and causes numerical differences between neural networks is the differences originating from different software implementations [6]. For example, operation orders are changed when implementing the algorithms. In some cases, we might think that the order doesn’t matter, because mathematically the result should be the same. But when we are using float with limited precision, we do not get the exact same results, as shown in the example below. During training, these tiny differences will accumulate and we may end up with significantly different outputs in the end.

Figure 1: Different operation orders lead to different results.

Different numerical precision

Using different representations for continuous numbers across different hardware and software platforms is a popular source that leads to variances in the performance of neural networks [7]. Often, neural networks are trained using servers with high computational capacities, but then deployed in personal devices without the same computational capabilities. The trick is often done by reducing the precision size, or pruning neural networks; however, in most cases the loss of precision comes without any guarantees on the resulting neural network’s behaviour.

Different hardware specifications

Deploying neural networks on hardware with different specifications (e.g., those with higher soft-error rates) may yield differences in the behaviour of deep learning systems [8].

Figure 2: A single bit-flip error leads to misclassification of image by the DNN [8]

Guanpeng Li et al. [8] simulated the consequences of soft errors that occur in DNN systems in the above example. They use the same object detection model in two scenarios. The left image illustrates the behavior of a deep neural network (DNN) under perfect simulation conditions, with fault-free execution. In this scenario, the model classifies an upcoming object as a transport truck and applies the brakes in time to avoid a collision. The right image simulates the model’s behavior when deploying on hardware with a high error rate. Due to a soft error in the deep learning system, the truck is misclassified as a bird, and the braking action may not be applied in time to avoid the collision, especially when the car is operating at high speed.

Next steps

These situations motivate the development of new methods that ensure the reliability and robustness of deep learning systems against numerical errors in different scenarios. This is also the problem defined in my project. We aim to understand what happens in a deep learning system when the network is deployed in different hardware, has different software implementations, or encounters perturbations in inputs, weights and architectures.

Below are some future directions to address the problem of numerical errors in deep learning systems:

Conclusion

In this blog, we examined the issue of reproducibility in deep learning, which is affected by numerical accuracy problems. By addressing the sources of numerical errors, we can develop more reliable and robust deep learning models, thereby improving their trustworthiness and effectiveness across various applications.

Stay tuned for the next blog post, where we will explore some of the methods that have been developed to tackle this issue. See you next time!

References
[1] 2021 (accessed August, 2021). Notes from the AI Frontier Insights from Hundreds of Use Cases. https://www.mckinsey.com/featured-insights/artificial-intelligence/ notes-from-the-ai-frontier-applications-and-value-of-deep-learning [2] Kenji Suzuki. Overview of deep learning in medical imaging. Radiological physics and technology, 10(3):257–273, 2017 [3] Qing Rao and Jelena Frtunikj. Deep learning for self-driving cars: Chances and challenges. In Proceedings of the 1st international workshop on software engineering for AI in autonomous systems, pages 35–38, 2018 [4] Boyuan Chen. Towards Training Reproducible Deep Learning Models [5] Sun, Dong Lao, Ganesh Sundaramoorthi, and Anthony Yezzi. Surprising instabilities in training deep networks and a theoretical analysis. Advances in Neural Information Processing Systems, 35:19567–19578, 2022 [6] Karner, C., Kazeev, V., and Petersen, P. C. (2024). Limitations of neural network training due to numerical instability of backpropagation. Advances in Computational Mathematics, 50(1):14. [7] Tambe, T., Yang, E.-Y., Wan, Z., Deng, Y., Reddi, V. J., Rush, A., Brooks, D., and Wei, G.-Y. (2019). Adaptivfloat: A floating-point based data type for resilient deep learning inference. arXiv preprint arXiv:1909.13271. [8] Guanpeng Li, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel Emer, and Stephen W. Keckler. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, Denver Colorado, November 2017. ACM
Exit mobile version