A Damped Hessian-Free Newton--Conjugate Gradient Method for Weighted Multiclass Neural Classification

Andy Irawan; Zainal Abidin; Mohammad Jamhuri

doi:10.18860/cauchy.v11i1.40243

A Damped Hessian-Free Newton--Conjugate Gradient Method for Weighted Multiclass Neural Classification

Andy Irawan, Zainal Abidin, Mohammad Jamhuri

Abstract

This study presents a deterministic damped Hessian-free Newton--CG method for weighted multiclass neural classification. The method is built from a weighted categorical cross-entropy objective, a damped local quadratic model, and a matrix-free curvature representation through Hessian--vector products. The search direction is computed by an inexact conjugate gradient solve, while Armijo backtracking and adaptive damping are used to improve stability. The method is implemented for the classification of academic predicate categories using preprocessed student data with mixed categorical and numerical features. Its numerical behavior is compared with SGD with momentum, RMSProp, and Adam under the same loss, initialization, and network architecture. The proposed method is computationally feasible, attains the best overall weighted test-set performance among the compared methods, and exhibits a distinct optimization trajectory driven by curvature-informed updates. These results show that a damped Hessian-free formulation provides a mathematically transparent, reproducible, and practically competitive framework for second-order optimization in multiclass neural classification.

Keywords

conjugate gradient; Hessian-free optimization; multiclass classification; neural networks; second-order methods

Full Text:

PDF

References

[1] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. Cambridge, MA: MIT Press, 2016. Official book site. No DOI is listed on the official citation page. https://www.deeplearningbook.org.

[2] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep Learning.” Nature 521, no. 7553 (2015): 436–444. https://doi.org/10.1038/nature14539.

[3] Polyak, B. T. “Some Methods of Speeding up the Convergence of Iteration Methods.” USSR Computational Mathematics and Mathematical Physics 4, no. 5 (1964): 1–17. https://doi.org/10.1016/0041-5553(64)90137-5.

[4] Tieleman, Tijmen, and Geoffrey Hinton. Lecture 6.5—RMSProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 2012. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.

[5] Kingma, Diederik P., and Jimmy Ba. Adam: A Method for Stochastic Optimization. Published as a conference paper at ICLR 2015, 2015. arXiv:1412.6980 [cs.LG]. https://doi.org/10.48550/arXiv.1412.6980. https://arxiv.org/abs/1412.6980.

[6] Nocedal, Jorge, and Stephen J. Wright. Numerical Optimization. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: Springer, 2006. https://doi.org/10.1007/978-0-387-40065-5. https://link.springer.com/book/10.1007/978-0-387-40065-5.

[7] Doikov, Nikita, El Mahdi Chayti, and Martin Jaggi. “Second-Order Optimization with Lazy Hessians.” In Proceedings of the 40th International Conference on Machine Learning, vol. 202 of Proceedings of Machine Learning Research, 8111–8148, 2023.

[8] Ishikawa, Satoki, and Rio Yokota. “When Does Second-Order Optimization Speed Up Training?” In The Twelfth International Conference on Learning Representations. Tiny Paper, 2024. https://openreview.net/forum?id=NLrfEsSZNb.

[9] Martens, James. “Deep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 735–742, 2010. https://doi.org/10.5555/3104322.3104416. https://dl.acm.org/doi/10.5555/3104322.3104416.

[10] Pearlmutter, Barak A. “Fast Exact Multiplication by the Hessian.” Neural Computation 6, no. 1 (1994): 147–160. https://doi.org/10.1162/neco.1994.6.1.147.

[11] Jiang, Ruichen, et al. “Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence Rate.” In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics, vol. 238 of Proceedings of Machine Learning Research, 1–20, 2024.

[12] Hestenes, Magnus R., and Eduard Stiefel. “Methods of Conjugate Gradients for Solving Linear Systems.” Journal of Research of the National Bureau of Standards 49, no. 6 (1952): 409–436. https://doi.org/10.6028/jres.049.044.

[13] Jamhuri, Mohammad, et al. “Inexact Generalized Gauss–Newton–CG for Binary Cross-Entropy Minimization.” Jurnal Riset Mahasiswa Matematika 5, no. 2 (2025): 102–122. https://doi.org/10.18860/jrmm.v5i2.34739.

[14] Jamhuri, Mohammad, et al. “Neural Networks Optimization via Gauss–Newton Based QR Factorization on SARS-CoV-2 Variant Classification.” Systems and Soft Computing 7 (2025): 200195. https://doi.org/10.1016/j.sasc.2025.200195.

[15] Jamhuri, Mohammad, Imam Mukhlash, and Mohammad Isa Irawan. “Performance Improvement of Logistic Regression for Binary Classification by Gauss-Newton Method.” In Proceedings of the 2022 5th International Conference on Mathematics and Statistics, 12–16. ACM, 2022. https://doi.org/10.1145/3545839.3545842.

[16] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. “Deep Sparse Rectifier Neural Networks.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 15 of Proceedings of Machine Learning Research, 315–323, 2011. https://proceedings.mlr.press/v15/glorot11a.html.

[17] He, Kaiming, et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.” In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1026–1034, 2015. https://doi.org/10.1109/ICCV.2015.123.

[18] He, Haibo, and Edwardo A. Garcia. “Learning from Imbalanced Data.” IEEE Transactions on Knowledge and Data Engineering 21, no. 9 (2009): 1263–1284. https://doi.org/10.1109/TKDE.2008.239.

[19] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer, 2009. https://doi.org/10.1007/978-0-387-84858-7.

[20] Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer, 2006.

[21] Armijo, Larry. “Minimization of Functions Having Lipschitz Continuous First Partial Derivatives.” Pacific Journal of Mathematics 16, no. 1 (1966): 1–3. https://doi.org/10.2140/pjm.1966.16.1.

DOI: https://doi.org/10.18860/cauchy.v11i1.40243

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me