10.30495/ijm2c.2022.1958403.1252

Risk Classification of Imbalanced Data for Car Insurance Companies: Machine Learning Approaches

  1. Insurance Research Center, Tehran, Iran
  2. Department of Accounting and Finance, Faculty of Business and Law, De Montfort University, Leicester, UK

Received: 09-05-2022

Accepted: 19-08-2022

Published in Issue 01-09-2022

How to Cite

Khamesian, F., Esna-Ashari, M., Dei Ofosu-Hene, E., & Khanizadeh, F. (2022). Risk Classification of Imbalanced Data for Car Insurance Companies: Machine Learning Approaches. International Journal of Mathematical Modelling & Computations, 12(3), 153-162. https://doi.org/10.30495/ijm2c.2022.1958403.1252

Abstract

This paper presents a mechanism for insurance companies to assess the most effective features to classify the risk of their customers for third party liability (TPL) car insurance. Basically, the process of underwriting is carried out based on the expert experiences and the industry suffers from lack of a systematic method to categorize their policyholders with respect to the risk level. We analyzed 13,388 observations of an insurance claim dataset from body injury reports provided by an Iranian insurance company. The main challenge is the imbalanced dataset. Here we employ logistic regression and random forest with different resampling of the original data in order to increase the performance of models. Results indicate that the random forest with the hybrid resampling methods is the best classifier and furthermore, victim age, premium, car age and insured age are the most important factors for claims prediction.

References

  1. P. Baecke and L. Bocca, The value of vehicle telematics data in insurance risk selection processes,
  2. Decision Support Systems, 98 (2017) 69–79.
  3. R. Barandela, R. M.Valdovinos, J. S. Snchez and F. J. Ferri, The imbalanced training sample problem: Under or over sampling?, In Joint IAPR international workshops on statistical techniques in
  4. pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), Springer, Berlin,
  5. Heidelberg, (2004) 806–814.
  6. Y. Bian, C. Yang, J. L. Zhao and L. Liang, Good drivers pay less: A study of usage-based vehicle
  7. insurance models, Transportation research part A: policy and practice, 107 (2018) 20–34.
  8. N. Boodhun and M. Jayabalan, Risk prediction in life insurance industry using supervised learning
  9. algorithms, Complex & Intelligent Systems, 4 (2) (2018) 145–154.
  10. R. L. Brown, D. Charters, S. Gunz and N. Haddow, Age as an Insurance Rate Class Variable,
  11. University of Waterloo, (2004) 103–114.
  12. L. Cao and H. Shen, Imbalanced data classification using improved clustering algorithm and undersampling method, In 2019 20th International Conference on Parallel and Distributed Computing,
  13. Applications and Technologies (PDCAT), IEEE, (2019) 358–363.
  14. N. V. Chawla, Data mining for imbalanced datasets: An overview, Data Mining and Knowledge
  15. Discovery Handbook, Springer, Boston, MA, (2009) 875–886.
  16. D. Devi, S. K. Biswas and B. Purkayastha, A review on solution to class imbalance problem: Undersampling approaches, In 2020 International Conference on Computational Performance Evaluation
  17. (ComPE), IEEE, (2020) 626–631.
  18. G. Dionne and C. Vanasse, Automobile insurance ratemaking in the presence of asymmetrical information, Journal of Applied Econometrics, 7 (2) (1992) 149–165.
  19. K. Divakar and K. Chitharanjan, Performance evaluation of credit card fraud transactions using
  20. boosting algorithms, Int. J. Electron. Commun. Comput. Eng. IJECCE, 10 (6) (2019) 262–270.
  21. G. Douzas, F. Bacao and F. Last, Improving imbalanced learning through a heuristic oversampling
  22. method based on k-means and SMOTE, Information Sciences, 465 (2018) 1–20.
  23. A. Fernndez, S.Garca, M. Galar, R. C. Prati, B. Krawczyk and F. Herrera, Cost-sensitive learning,
  24. In Learning from Imbalanced Data Sets, Springer, Cham, (2018) 63–78.
  25. F. Khamesian et al./ IJM2C, 12 - 03 (2022) 153-162.
  26. Y. L. Grize, W. Fischer and C. Ltzelschwab, Machine learning applications in nonlife insurance,
  27. Applied Stochastic Models in Business and Industry, 36 (4) (2020) 523–537.
  28. H. Han, W. Y. Wang and B. H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, In International Conference on Intelligent Computing, Springer, Berlin,
  29. Heidelberg, (2005) 878–887.
  30. S. E. Harrington and H. I. Doerpinghaus, The economics and politics of automobile insurance rate
  31. classification, Journal of Risk and Insurance, 60 (1) (1993) 59–84.
  32. P. Hart, The condensed nearest neighbor rule (corresp.), IEEE Transactions on Information Theory,
  33. (3) (1968) 515–516.
  34. J. Hegde and B. Rokseth, Applications of machine learning methods for engineering risk assessmentA
  35. review, Safety Science, 122 (2020) 104492.
  36. Y. Huang and S. Meng, Automobile insurance classification ratemaking based on telematics driving
  37. data, Decision Support Systems, 127 (2019) 113156.
  38. R. Jain, J. A. Alzubi, N. Jain and P. Joshi, Assessing risk in life insurance using ensemble learning,
  39. Journal of Intelligent & Fuzzy Systems, 37 (2) (2019) 2969–2980.
  40. M. Kelly and N. Nielson, Age as a variable in insurance pricing and risk classification, The Geneva
  41. Papers on Risk and Insurance-Issues and Practice, 31 (2) (2006) 212–232.
  42. S. B. Khakbaz, N. Hajiheydari and M. Pourestarabadi, Car insurance risk assessment with data
  43. mining for an Iranian leading insurance company, International Journal of Business and Economics
  44. Research, 3 (3) (2014) 128–134.
  45. M. Kubat and S. Matwin, Addressing the curse of imbalanced training sets: one-sided selection, In
  46. Icml, 97 (1) (1997) 197.
  47. R. Malhotra and J. Jain, Handling imbalanced data using ensemble learning in software defect
  48. prediction, In 2020 10th International Conference on Cloud Computing, Data Science & Engineering
  49. (Confluence), IEEE, (2020) 300–304.
  50. H. M. Nguyen, E. W. Cooper and K. Kamei, Borderline over-sampling for imbalanced data classification, International Journal of Knowledge Engineering and Soft Data Paradigms, 3 (1) (2011)
  51. –21.
  52. N. Paltrinieri, L. Comfort and G. Reniers, Learning about risk: Machine learning for risk assessment,
  53. Safety Science, 118 (2019) 475–486.
  54. C. V. Priscilla and D. P. Prabha, Influence of optimizing XGBoost to handle class imbalance in
  55. credit card fraud detection, In 2020 Third International Conference on Smart Systems and Inventive
  56. Technology (ICSSIT), IEEE, (2020) 1309–1315.
  57. S. Rawat, A. Rawat, D. Kumar and A. S. Sabitha, Application of machine learning and data visualization techniques for decision support in the insurance sector, International Journal of Information
  58. Management Data Insights, 1 (2) (2021) 100012.
  59. D. Samson and H. Thomas, Linear models as aids in insurance decision making: the estimation of
  60. automobile insurance claims, Journal of Business Research, 15 (3) (1987) 247–256.
  61. Z. Shams Esfandabadi and M. M. Seyyed Esfahani, Identifying and classifying the factors affecting
  62. risk in automobile hull insurance in Iran using fuzzy Delphi method and factor analysis, Journal of
  63. Industrial Engineering and Management Studies, 5 (2) (2018) 84–96.
  64. V. Sobanadevi and G. Ravi, Handling data imbalance using a heterogeneous bagging-based stacked
  65. ensemble (HBSE) for credit card fraud detection, In Intelligence in Big Data TechnologiesBeyond
  66. the Hype, Springer, Singapore, (2021) 517–525.
  67. Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu and Y. Zhou, A novel ensemble method for classifying
  68. imbalanced data, Pattern Recognition, 48 (5) (2015) 1623–1637.
  69. Y. Tang, Y. Q. Zhang, N. V. Chawla and S. Krasser, SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39 (1) (2008)
  70. –288.
  71. J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi and M. Asadpour, Boosting methods for multi-class
  72. imbalanced data classification: an experimental review, Journal of Big Data, 7 (1) (2020) 1–47.
  73. N. Thai-Nghe, Z. Gantner and L. Schmidt-Thieme, Cost-sensitive learning methods for imbalanced
  74. data, In The 2010 International Joint Conference on Neural Networks (IJCNN), IEEE, (2010) 1–8.
  75. I. Tomek, Two modifications of CNN, IEEE Trans. Systems, Man and Cybernetics, 6 (1976) 769–772.
  76. P. Tryfos, On classification in automobile insurance, The Journal of Risk and Insurance, 47 (2)
  77. (1980) 331–337.
  78. C. F. Tsai, W. C. Lin, Y. H. Hu and G. T. Yao, Under-sampling class imbalanced datasets by
  79. combining clustering analysis and instance selection, Information Sciences, 477 (2019) 47–54.
  80. W. A. Wiegers, The use of age, sex, and marital status as rating variables in automobile insurance,
  81. The University of Toronto Law journal, 39 (2) (1989) 149–210.
  82. S. J. Yen and Y.S. Lee, Cluster-based under-sampling approaches for imbalanced data distributions,
  83. Expert Systems with Applications, 36 (3) (2009) 5718–5727.
  84. J. L. Yin and B. H. Chen, An advanced driver risk measurement system for usage-based insurance
  85. on big driving data, IEEE Transactions on Intelligent Vehicles, 3 (4) (2018) 585–594.
  86. M. Zareapoor and P. Shamsolmoali, Application of credit card fraud detection: Based on bagging
  87. ensemble classifier, Procedia Computer Science, 48 (2015) (2015) 679–685.
  88. S. Zhang, Cost-sensitive KNN classification, Neurocomputing,391 (2020) 234–242.
  89. Z. Zheng, Y. Cai and Y. Li, Oversampling method for imbalanced classification, Computing and
  90. Informatics, 34 (5) (2015) 1017–1037.
  91. K. Zhuang, S. Wu and X. Gao, Auto insurance business analytics approach for customer segmentation
  92. using multiple mixed-type data clustering algorithms, Tehniki vjesnik, 25 (6) (2018) 1783–1791