From Risk-Neutral to Risk-Sensitive Reinforcement Learning: Actor–Critic vs REINFORCE with Tail-Based Risk Measures
Abstract
Risk sensitivity is introduced at the episodic level through penalties based on Value at Risk (VaR), Conditional Value at Risk (CVaR), and Entropic Value at Risk (EVaR) at the 95% confidence level. Experiments are conducted in a multi-asset portfolio exposure-control environment, with performance evaluated across multiple random seeds using both training dynamics and out-of-sample financial metrics (CAGR, volatility, Sharpe ratio, drawdown, and realized tail risk).
Results show that while both architectures perform comparably under the risk-neutral objective, actor–critic learning exhibits greater stability and lower dispersion under coherent tail penalties. In particular, CVaR and EVaR objectives lead to smoother convergence and reduced instability compared to VaR, especially for A2C-B. Statistical tests indicate that performance differences become more pronounced under coherent tail-risk objectives.
These findings highlight the interaction between heavy-tailed environments, coherent risk measures, and algorithmic architecture, suggesting that actor–critic methods provide a more robust foundation for risk-sensitive RL in financial settings exposed to extreme events.
Keywords
Full Text:
PDFReferences
[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning Series), 2nd ed. Cambridge, MA: MIT Press, 2018.
[2] P. Bossaerts, S. Huang, and N. Yadav, “Exploiting distributional temporal difference learning to deal with tail risk,” Risks, vol. 8, no. 4, p. 113, 2020, Open Access under CC BY 4.0 License. doi: 10.3390/risks8040113. https://www.mdpi.com/2227-9091/8/4/113.
[3] A. Charpentier, R. Élie, and C. Remlinger, Reinforcement learning in economics and finance, 2023. doi: 10.1007/s10614-021-10119-4.
[4] B. Hambly, R. Xu, and H. Yang, “Recent advances in reinforcement learning in finance,” Mathematical Finance, vol. 33, no. 3, pp. 437–503, doi: 10.1111/mafi.12382.
[5] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 70, PMLR, Aug. 2017, pp. 449–458. https://proceedings.mlr.press/v70/bellemare17a.html.
[6] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 203–228, 1999. doi: 10.1111/1467-9965.00068.
[7] A. Ahmadi-Javid and M. Fallah-Tafti, “Portfolio optimization with entropic value-at-risk,” European Journal of Operational Research, vol. 279, no. 1, pp. 225–241, 2019. doi: 10.1016/j.ejor.2019.02.007.
[8] S. A. Klugman, H. H. Panjer, and G. E. Willmot, Loss Models: From Data to Decisions, 5th ed. John Wiley and Sons, Inc., 2019.
[9] A. Sani, A. Lazaric, and R. Munos, “Risk-aversion in multi-armed bandits,” vol. 25, pp. 1–9, 2012. https://proceedings.neurips.cc/paper_files/paper/2012/file/83f2550373f2f19492aa30fbd5b57512-Paper.pdf.
[10] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, vol. 12, MIT Press, 1999, pp. 1008–1014. https://proceedings.neurips.cc/paper_files/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
[11] Y. Chow, A. Tamar, S. Mannor, and M. Pavone, “Risk-sensitive and robust decision-making: A cvar optimization approach,” in Advances in Neural Information Processing Systems, vol. 28, Curran Associates, Inc., 2015, pp. 1–9. https://proceedings.neurips.cc/paper_files/paper/2015/file/64223ccf70bbb65a3a4aceac37e21016-Paper.pdf.
[12] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” Proceedings of Machine Learning Research, vol. 48, pp. 1928–1937, Jun. 2016. https://proceedings.mlr.press/v48/mniha16.html.
[13] Y. Gao, “Policy gradient methods in deep reinforcement learning,” in Proceedings of CONF-SEML 2025 Symposium: Machine Learning Theory and Applications, 2025, pp. 27–34. doi: 10.54254/2755-2721/2025.TJ23321.
[14] A. Tamar, Y. Glassner, and S. Mannor, “Optimizing the cvar via sampling,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015. doi: 10.1609/aaai.v29i1.9561.
[15] R. T. Rockafellar and S. Uryasev, “Optimization of conditional value-at-risk,” Journal of Risk, vol. 2, no. 3, pp. 21–41, 2000. doi: 10.21314/JOR.2000.038.
DOI: https://doi.org/10.18860/cauchy.v11i1.40309
Refbacks
- There are currently no refbacks.
Copyright (c) 2026 Adhitya Ronnie Effendie

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.







