Inexact Generalized Gauss--Newton--CG for Binary Cross-Entropy Minimization

Mohammad Jamhuri, Silvi Puspita Sari, Siti Amiroch, Juhari Juhari, Vivi Aida Fitria

Abstract


Binary cross-entropy (BCE) minimization is a standard objective in probabilistic binary classification, yet practical training pipelines often rely on first-order methods whose performance can be sensitive to step-size choices and may require many iterations to reach low-loss solutions. This paper studies an inexact curvature-based solver that combines a (generalized) Gauss–Newton approximation with conjugate gradient (CG) inner iterations for minimizing the regularized BCE objective in full-batch logistic regression. At each outer iteration, the method computes a descent direction by approximately solving a damped Gauss–Newton system in a matrix-free manner via repeated products with X and X⊤, and terminates CG according to a relative-residual inexactness rule. Numerical experiments on three benchmark datasets show that the proposed Inexact GGN–CG can substantially reduce the number of outer iterations on smaller numerical data, while remaining competitive in predictive performance, and can improve both validation and test mean BCE on larger mixed-type data after one-hot encoding. In particular, on Adult Census Income the method achieves lower test mean BCE (0.3176 ± 0.0044) and higher F1-score (0.6623 ± 0.0066) than Adam and gradient descent under the same regularization-selection protocol, at the cost of additional CG work. These results highlight how damping and inexactness jointly govern the trade-off between curvature-solve effort, wall-clock time, and achieved BCE values in deterministic logistic-regression training.

Keywords


generalized Gauss--Newton; conjugate gradient; inexact methods; binary cross-entropy; logistic regression; second-order optimization

Full Text:

PDF

References


[1] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

[2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY, USA: Springer, 2009.

[3] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK: Cambridge University Press, 2004.

[4] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. International Conference on Learning Representations (ICLR), 2015.

[5] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. Springer, 2006.

[6] B. A. Pearlmutter, “Fast exact multiplication by the Hessian,” Neural Computation, vol. 6, no. 1, pp. 147–160, 1994, doi: 10.1162/neco.1994.6.1.147.

[7] N. N. Schraudolph, “Fast curvature matrix-vector products for second-order gradient descent,” Neural Computation, vol. 14, no. 7, pp. 1723–1738, 2002.

[8] J. Martens, “Deep learning via Hessian-free optimization,” in Proc. International Conference on Machine Learning (ICML), 2010, pp. 735–742.

[9] A. Botev, “The Gauss-Newton matrix for deep learning models and its applications,” Ph.D. dissertation, University College London (UCL), 2020.

[10] D. Buffelli et al., “Exact, tractable Gauss-Newton optimization in deep reversible architectures reveal poor generalization,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024, pp. 133541–133570, doi: 10.48550/arXiv.2411.07979.

[11] J. Zhao, S. P. Singh, and A. Lucchi, “Theoretical characterisation of the Gauss–Newton conditioning in neural networks,” arXiv preprint arXiv:2411.02139, 2024, doi: 10.48550/arXiv.2411.02139.

[12] Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. W. Mahoney, “AdaHessian: An adaptive second order optimizer for machine learning,” in Proc. AAAI Conference on Artificial Intelligence (AAAI), vol. 35, 2021, pp. 10665–10673, doi: 10.1609/aaai.v35i12.17275.

[13] H. Liu, Z. Li, D. Hall, P. Liang, and T. Ma, “Sophia: A scalable stochastic second-order optimizer for language model pre-training,” arXiv preprint arXiv:2305.14342, 2023, doi: 10.48550/arXiv.2305.14342.

[14] D. Shin, D. Lee, J. Chung, and N. Lee, “Sassha: Sharpness-aware adaptive second-order optimization with stable Hessian approximation,” arXiv preprint arXiv:2502.18153, 2025, doi: 10.48550/arXiv.2502.18153.

[15] M. Jamhuri, M. I. Irawan, I. Mukhlash, M. Iqbal, and N. N. T. Puspaningsih, “Neural networks optimization via Gauss–Newton based QR factorization on SARS-CoV-2 variant classification,” Systems and Soft Computing, vol. 7, p. 200195, 2025, doi: 10.1016/j.sasc.2025.200195.

[16] M. Jamhuri, “Optimasi model deep learning menggunakan metode Gauss-Newton terdistribusi untuk prediksi mutasi sekuen protein spike virus SARS-CoV-2,” Ph.D. dissertation, Institut Teknologi Sepuluh Nopember, 2025.

[17] M. Jamhuri, I. Mukhlash, and M. I. Irawan, “Performance improvement of logistic regression for binary classification by Gauss-Newton method,” in Proc. 2022 5th International Conference on Mathematics and Statistics, 2022, pp. 12–16, doi: 10.1145/3545839.3545842.

[18] X. Li, S. Wang, and Z. Zhang, “Do subsampled Newton methods work for high-dimensional data?” in Proc. AAAI Conference on Artificial Intelligence (AAAI), vol. 34, 2020, pp. 4723–4730, doi: 10.1609/aaai.v34i04.5905.




DOI: https://doi.org/10.18860/jrmm.v5i2.34739

Refbacks

  • There are currently no refbacks.