Lipschitzness effect of a loss function on generalization performance of deep neural networks trained by Adam and AdamW optimizers

Document Type : Original Article

Authors

Department of Mathematics and Computer Science, Amirkabir University of Technology (Tehran Polytechnic), Iran

Abstract

The generalization performance of deep neural networks with regard to the optimization algorithm is one of the major concerns in machine learning. This performance can be affected by various factors. In this paper, we theoretically prove that the Lipschitz constant of a loss function is an important factor to diminish the generalization error of the output model obtained by Adam or AdamW. The results can be used as a guideline for choosing the loss function when the optimization algorithm is Adam or AdamW. In addition, to evaluate the theoretical bound in a practical setting, we choose the human age estimation problem in computer vision. For assessing the generalization better, the training and test datasets are drawn from different distributions. Our experimental evaluation shows that the loss function with a lower Lipschitz constant and maximum value improves the generalization of the model trained by Adam or AdamW.

Keywords


[1] A. Akbari, M. Awais, M. Bashar, and J. Kittler, How does loss function affect generalization per[1]formance of deep learning application to human age estimation, in International Conference on Machine Learning, PMLR, 2021, pp. 141–151.
[2] A. Banerjee, T. Chen, X. Li, and Y. Zhou, Stability based generalization bounds for exponential family Langevin dynamics, in Proceedings of the 39th International Conference on Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, eds., vol. 162 of Proceedings of Machine Learning Research, PMLR, 17–23 Jul 2022, pp. 1412–1449.
[3] S. Basu, S. Mukhopadhyay, M. Karki, R. DiBiano, S. Ganguly, R. Nemani, and S. Gayaka, Deep neural networks for texture classification—a theoretical analysis, Neural Networks, 97 (2018), pp. 173–182.
[4] O. Bousquet and A. Elisseeff, Stability and generalization, The Journal of Machine Learning Research, 2 (2002), pp. 499–526.
[5] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, Vggface2: A dataset for recognising faces across pose and age, in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), IEEE, 2018, pp. 67–74.
[6] K. Chen, S. Gong, T. Xiang, and C. Change Loy, Cumulative attribute space for age and crowd density estimation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2467– 2474.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255. [8] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization., Journal of machine learning research, 12 (2011).
[9] X. Geng, Label distribution learning, IEEE Transactions on Knowledge and Data Engineering, 28 (2016), pp. 1734–1748.
[10] J. Guan and Z. Lu, Fast-rate PAC-Bayesian generalization bounds for meta-learning, in Proceedings of the 39th International Conference on Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, eds., vol. 162 of Proceedings of Machine Learning Research, PMLR, 17–23 Jul 2022, pp. 7930–7948.
[11] , Fast-rate pac-bayesian generalization bounds for meta-learning., in ICML, 2022, pp. 7930–7948.
[12] M. Hardt, B. Recht, and Y. Singer, Train faster, generalize better: Stability of stochastic gradient descent, in International conference on machine learning, PMLR, 2016, pp. 1225–1234.
[13] N. Harvey, C. Liaw, and A. Mehrabian, Nearly-tight VC-dimension bounds for piecewise linear neural networks, in Proceedings of the 2017 Conference on Learning Theory, S. Kale and O. Shamir, eds., vol. 65 of Proceedings of Machine Learning Research, PMLR, 07–10 Jul 2017, pp. 1064–1068.
[14] E. Hazan, K. Levy, and S. Shalev-Shwartz, Beyond convexity: Stochastic quasi-convex optimization, Advances in neural information processing systems, 28 (2015).
[15] J. Hu, L. Shen, and G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[16] D. Jakubovitz, R. Giryes, and M. R. Rodrigues, Generalization error in deep learning, arXiv preprint arXiv:1808.01174, (2018).
[17] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
[18] I. Loshchilov and F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, (2017).
[19] C. McDiarmid, On the method of bounded differences, in “survey in combinatorics,”(j. simons, ed.) london mathematical society lecture notes, vol. 141, 1989.
[20] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou, Agedb: the first manually collected, in-the-wild age database, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, vol. 2, 2017, p. 5.
[21] B. Neyshabur, S. Bhojanapalli, and N. Srebro, A pac-bayesian approach to spectrally-normalized margin bounds for neural networks, in International Conference on Learning Representations, 2018.
[22] O. M. Parkhi, A. Vedaldi, and A. Zisserman, Deep face recognition, in Proceedings of the British Machine Vision Conference (BMVC), M. W. J. Xianghua Xie and G. K. L. Tam, eds., BMVA Press, September 2015, pp. 41.1–41.12.
[23] M. Ren, W. Zeng, B. Yang, and R. Urtasun, Learning to reweight examples for robust deep learning, in International conference on machine learning, PMLR, 2018, pp. 4334–4343.
[24] R. Rothe, R. Timofte, and L. Van Gool, Deep expectation of real and apparent age from a single image without facial landmarks, International Journal of Computer Vision, 126 (2018), pp. 144–158.
[25] F. Scarselli, A. C. Tsoi, and M. Hagenbuchner, The vapnik–chervonenkis dimension of graph and recursive neural networks, Neural Networks, 108 (2018), pp. 248–259.
[26] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, Learnability, stability and uniform convergence, The Journal of Machine Learning Research, 11 (2010), pp. 2635–2670.
[27] W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, and A. L. Yuille, Deep regression forests for age estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[28] T. Tieleman, G. Hinton, et al., Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning, 4 (2012), pp. 26–31.
[29] T. Zahavy, B. Kang, A. Sivak, J. Feng, H. Xu, and S. Mannor, Ensemble robustness and generalization of stochastic deep learning algorithms, arXiv preprint arXiv:1602.02389, (2016).
[30] Y. Zhang, L. Liu, C. Li, and C. C. Loy, Quantifying facial age by posterior of age comparisons, in British Machine Vision Conference (BMVC), 2017.
[31] Z. Zhang, Y. Song, and H. Qi, Age progression/regression by conditional adversarial autoencoder, in Pro[1]ceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5810–5818.