A Damped Hessian-Free Newton--Conjugate Gradient Method for Weighted Multiclass Neural Classification

Andy Irawan; Zainal Abidin; Mohammad Jamhuri

doi:10.18860/cauchy.v11i1.40243

A Damped Hessian-Free Newton--Conjugate Gradient Method for Weighted Multiclass Neural Classification

Andy Irawan, Zainal Abidin, Mohammad Jamhuri

Abstract

This study presents a deterministic damped Hessian-free Newton--CG method for weighted multiclass neural classification. The method is built from a weighted categorical cross-entropy objective, a damped local quadratic model, and a matrix-free curvature representation through Hessian--vector products. The search direction is computed by an inexact conjugate gradient solve, while Armijo backtracking and adaptive damping are used to improve stability. The method is implemented for the classification of academic predicate categories using preprocessed student data with mixed categorical and numerical features. Its numerical behavior is compared with SGD with momentum, RMSProp, and Adam under the same loss, initialization, and network architecture. The proposed method is computationally feasible, attains the best overall weighted test-set performance among the compared methods, and exhibits a distinct optimization trajectory driven by curvature-informed updates. These results show that a damped Hessian-free formulation provides a mathematically transparent, reproducible, and practically competitive framework for second-order optimization in multiclass neural classification.

Keywords

conjugate gradient; Hessian-free optimization; multiclass classification; neural networks; second-order methods

Full Text:

PDF

References

[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. Official book site. No DOI is listed on the official citation page. Cambridge, MA: MIT Press, 2016. https://www.deeplearningbook.org.

[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep Learning”. In: Nature 521.7553 (2015), pp. 436–444. doi: 10.1038/nature14539.

[3] B. T. Polyak. “Some Methods of Speeding up the Convergence of Iteration Methods”. In: USSR Computational Mathematics and Mathematical Physics 4.5 (1964), pp. 1–17. doi: 10.1016/0041-5553(64)90137-5.

[4] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5—RMSProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning. 2012. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.

[5] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. Published as a conference paper at ICLR 2015. 2015. doi: 10.48550/arXiv.1412.6980. arXiv: 1412.6980 [cs.LG]. https://arxiv.org/abs/1412.6980.

[6] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: Springer, 2006. doi: 10.1007/978-0-387-40065-5. https://link.springer.com/book/10.1007/978-0-387-40065-5.

[7] Nikita Doikov, El Mahdi Chayti, and Martin Jaggi. “Second-Order Optimization with Lazy Hessians”. In: Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. 2023, pp. 8111–8148.

[8] Satoki Ishikawa and Rio Yokota. “When Does Second-Order Optimization Speed Up Training?” In: The Twelfth International Conference on Learning Representations. Tiny Paper. 2024. https://openreview.net/forum?id=NLrfEsSZNb.

[9] James Martens. “Deep Learning via Hessian-Free Optimization”. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010, pp. 735–742. doi: 10.5555/3104322.3104416. https://dl.acm.org/doi/10.5555/3104322.3104416.

[10] Barak A. Pearlmutter. “Fast Exact Multiplication by the Hessian”. In: Neural Computation 6.1 (1994), pp. 147–160. doi: 10.1162/neco.1994.6.1.147.

[11] Ruichen Jiang et al. “Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence Rate”. In: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics. Vol. 238. Proceedings of Machine Learning Research. 2024, pp. 1–20.

[12] Magnus R. Hestenes and Eduard Stiefel. “Methods of Conjugate Gradients for Solving Linear Systems”. In: Journal of Research of the National Bureau of Standards 49.6 (1952), pp. 409–436. doi: 10.6028/jres.049.044.

[13] Mohammad Jamhuri et al. “Inexact Generalized Gauss–Newton–CG for Binary Cross-Entropy Minimization”. In: Jurnal Riset Mahasiswa Matematika 5.2 (2025), pp. 102–122. doi: 10.18860/jrmm.v5i2.34739.

[14] Mohammad Jamhuri et al. “Neural networks optimization via Gauss–Newton based QR factorization on SARS-CoV-2 variant classification”. In: Systems and Soft Computing 7 (2025), p. 200195. doi: 10.1016/j.sasc.2025.200195.

[15] Mohammad Jamhuri, Imam Mukhlash, and Mohammad Isa Irawan. “Performance Improvement of Logistic Regression for Binary Classification by Gauss-Newton Method”. In: Proceedings of the 2022 5th International Conference on Mathematics and Statistics. ACM, 2022, pp. 12–16. doi: 10.1145/3545839.3545842.

[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep Sparse Rectifier Neural Networks”. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS). Vol. 15. Proceedings of Machine Learning Research. 2011, pp. 315–323. https://proceedings.mlr.press/v15/glorot11a.html.

[17] Kaiming He et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2015, pp. 1026–1034. doi: 10.1109/ICCV.2015.123.

[18] Haibo He and Edwardo A. Garcia. “Learning from Imbalanced Data”. In: IEEE Transactions on Knowledge and Data Engineering 21.9 (2009), pp. 1263–1284. doi: 10.1109/TKDE.2008.239.

[19] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer, 2009. doi: 10.1007/978-0-387-84858-7.

[20] Christopher M. Bishop. Pattern Recognition and Machine Learning. New York: Springer, 2006.

[21] Larry Armijo. “Minimization of Functions Having Lipschitz Continuous First Partial Derivatives”. In: Pacific Journal of Mathematics 16.1 (1966), pp. 1–3. doi: 10.2140/pjm.1966.16.1.

DOI: https://doi.org/10.18860/cauchy.v11i1.40243

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Maulana Malik Ibrahim State Islamic University of Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
e-mail: cauchy@uin-malang.ac.id

CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me