From Risk-Neutral to Risk-Sensitive Reinforcement Learning: Actor–Critic vs REINFORCE with Tail-Based Risk Measures

Aprida Siska Lestia, Adhitya Ronnie Effendie, Made Tantrawan, Muhammad Rafli Azrarsyah

Abstract


This study investigates risk-sensitive reinforcement learning (RL) for portfolio decision-making under empirically heavy-tailed return distributions. We compare two policy-gradient architectures—REINFORCE with baseline (REINFORCE-BL) and batched Advantage Actor–Critic (A2C-B)—and examine how tail-based risk measures modify learning dynamics and robustness. Quantitative diagnostics confirm substantial excess kurtosis and strong rejection of normality in daily NASDAQ returns, motivating the integration of tail-sensitive objectives.
Risk sensitivity is introduced at the episodic level through penalties based on Value at Risk (VaR), Conditional Value at Risk (CVaR), and Entropic Value at Risk (EVaR) at the 95% confidence level. Experiments are conducted in a multi-asset portfolio exposure-control environment, with performance evaluated across multiple random seeds using both training dynamics and out-of-sample financial metrics (CAGR, volatility, Sharpe ratio, drawdown, and realized tail risk).
Results show that while both architectures perform comparably under the risk-neutral objective, actor–critic learning exhibits greater stability and lower dispersion under coherent tail penalties. In particular, CVaR and EVaR objectives lead to smoother convergence and reduced instability compared to VaR, especially for A2C-B. Statistical tests indicate that performance differences become more pronounced under coherent tail-risk objectives.
These findings highlight the interaction between heavy-tailed environments, coherent risk measures, and algorithmic architecture, suggesting that actor–critic methods provide a more robust foundation for risk-sensitive RL in financial settings exposed to extreme events.

Keywords


Risk-sensitive Reinforcement Learning; Actor--Critic; Coherent Risk Measures; CVaR; EVaR; Heavy-tailed Returns; Portfolio Optimization.

Full Text:

PDF

References


[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning Series), 2nd ed. Cambridge, MA: MIT Press, 2018.

[2] P. Bossaerts, S. Huang, and N. Yadav, “Exploiting distributional temporal difference learning to deal with tail risk,” Risks, vol. 8, no. 4, p. 113, 2020, Open Access under CC BY 4.0 License. doi: 10.3390/risks8040113. https://www.mdpi.com/2227-9091/8/4/113.

[3] A. Charpentier, R. Élie, and C. Remlinger, Reinforcement learning in economics and finance, 2023. doi: 10.1007/s10614-021-10119-4.

[4] B. Hambly, R. Xu, and H. Yang, “Recent advances in reinforcement learning in finance,” Mathematical Finance, vol. 33, no. 3, pp. 437–503, doi: 10.1111/mafi.12382.

[5] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 70, PMLR, Aug. 2017, pp. 449–458. https://proceedings.mlr.press/v70/bellemare17a.html.

[6] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 203–228, 1999. doi: 10.1111/1467-9965.00068.

[7] A. Ahmadi-Javid and M. Fallah-Tafti, “Portfolio optimization with entropic value-at-risk,” European Journal of Operational Research, vol. 279, no. 1, pp. 225–241, 2019. doi: 10.1016/j.ejor.2019.02.007.

[8] S. A. Klugman, H. H. Panjer, and G. E. Willmot, Loss Models: From Data to Decisions, 5th ed. John Wiley and Sons, Inc., 2019.

[9] A. Sani, A. Lazaric, and R. Munos, “Risk-aversion in multi-armed bandits,” vol. 25, pp. 1–9, 2012. https://proceedings.neurips.cc/paper_files/paper/2012/file/83f2550373f2f19492aa30fbd5b57512-Paper.pdf.

[10] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, vol. 12, MIT Press, 1999, pp. 1008–1014. https://proceedings.neurips.cc/paper_files/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.

[11] Y. Chow, A. Tamar, S. Mannor, and M. Pavone, “Risk-sensitive and robust decision-making: A cvar optimization approach,” in Advances in Neural Information Processing Systems, vol. 28, Curran Associates, Inc., 2015, pp. 1–9. https://proceedings.neurips.cc/paper_files/paper/2015/file/64223ccf70bbb65a3a4aceac37e21016-Paper.pdf.

[12] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” Proceedings of Machine Learning Research, vol. 48, pp. 1928–1937, Jun. 2016. https://proceedings.mlr.press/v48/mniha16.html.

[13] Y. Gao, “Policy gradient methods in deep reinforcement learning,” in Proceedings of CONF-SEML 2025 Symposium: Machine Learning Theory and Applications, 2025, pp. 27–34. doi: 10.54254/2755-2721/2025.TJ23321.

[14] A. Tamar, Y. Glassner, and S. Mannor, “Optimizing the cvar via sampling,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015. doi: 10.1609/aaai.v29i1.9561.

[15] R. T. Rockafellar and S. Uryasev, “Optimization of conditional value-at-risk,” Journal of Risk, vol. 2, no. 3, pp. 21–41, 2000. doi: 10.21314/JOR.2000.038.




DOI: https://doi.org/10.18860/cauchy.v11i1.40309

Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Adhitya Ronnie Effendie

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.