Make Knowledge Veritable, Visible and Valuable.

Designing NLP applications to support ICD coding: an impact analysis and guidelines to enhance baseline performance when processing patient discharge notes

Jessica Jha 1 , Mario Almagro 2 , Hegler Tissot 3 *

  • 1. Data Science, Drexel University Philadelphia, USA
  • 2. Computer Science, UNED Madrid, Spain
  • 3. Information Science, Drexel University Philadelphia, USA



  • Received

    07 August 2023

  • Revised

    09 October 2023

  • Accepted

    12 October 2023

  • Published

    30 October 2023

Clinical coding Natural language processing Machine learning Baseline models Concept extraction Bag of words

Show More



[1]Paz KB, Halverstam C, Rzepecki AK, McLellan BN. A National Survey of Medical Coding and Billing Training in United States Dermatology Residency Programs. Journal of drugs in dermatology. 2018; 17(6):678-682. Available from:

[2]Dong H, Falis M, Whiteley W, Alex B, Matterson J, Ji S, Chen J, Wu H. Automated clinical coding: what, why, and where we are? npj Digital Medicine. 2022 5(1):159. doi: 10.1038/s41746-022-00705-7.

[3]W. H. O. WHO, ICD-10 : international statistical classification of diseases and related health problems. World Health Organization, 10th ed. World Health Organization, Geneva, 2004.

[4]Adams DL, Norman H, Burroughs VJ. Addressing medical coding and billing part II: a strategy for achieving compliance. A risk management approach for reducing coding and billing errors. Journal of the National Medical Association. 2002; 94(6):430-47. Available from:

[5]Raghavendra Chalapathy, Ehsan Zare Borzeshi, Massimo Piccardi. Bidirectional LSTM-CRF for clinical concept extraction. In: Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP). Osaka, Japan: The COLING 2016 Organizing Committee; 2016. p.7-12. Available from: W16-4202.

[6]R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, N. Elhadad. Intelligible models for Health Care. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2015. p.1721-1730. doi: 10.1145/2783258.2788613.

[7]Rasheed K, Qayyum A, Ghaly M, Al-Fuqaha A, Razi A, Qadir J. Explainable, trustworthy, and ethical machine learning for healthcare: A survey. Computers in Biology and Medicine. 2022;149:106043. doi:10.1016/j.compbiomed.2022.106043.

[8]Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Briefings in Bioinformatics. 2017; 19(6): 1236-1246. doi:10.1093/bib/bbx044.

[9]Zhao L, Bao J, Qiao X, Jin P, Ji Y, Li Z, Zhang J, Su Y, Ji L, Shen J, Zhang Y, Niu L, Xie W, Hu C, Shen H, Wang X, Liu J, Tian J. Predicting clinically significant prostate cancer with a deep learning approach: a multicentre retrospective study. European Journal of Nuclear Medicine and Molecular Imaging. 2022; 50(3):727-741. doi:10.1007/s00259-022-06036-9.

[10]Weiss J, Raghu VK, Bontempi D, Christiani DC, Mak RH, Lu MT, Aerts HJWL. Deep learning to estimate lung disease mortality from chest radiographs. Nature Communications. 2023; 14(1): 2797. doi:10.1038/s41467-023-37758-5.

[11]Teo K, Yong CW, Chuah JH, Hum YC, Tee YK, Xia K, Lai KW. Current Trends in Readmission Prediction: An Overview of Approaches. Arabian Journal for Science and Engineering. 2021; 16:1-18. doi:10.1007/s13369-021-06040-5.

[12]Kessler S, Schroeder D, Korlakov S, Hettlich V, Kalkhoff S, Moazemi S, Lichtenberg A, Schmid F, Aubin H. Predicting readmission to the cardiovascular intensive care unit using recurrent neural networks. Digital Health. 2023; 9;9:20552076221149529. doi:10.1177/20552076221149529.

[13]Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M, Sundberg P, Yee H, Zhang K, Zhang Y, Flores G, Duggan GE, Irvine J, Le Q, Litsch K, Mossin A, Tansuwan J, Wang D, Wexler J, Wilson J, Ludwig D, Volchenboum SL, Chou K, Pearson M, Madabushi S, Shah NH, Butte AJ, Howell MD, Cui C, Corrado GS, Dean J. Scalable and accurate deep learning with electronic health records. npj Digital Medicine. 2018; 1:18. doi: 10.1038/s41746-018-0029-1.

[14]Luo J, Wu M, Gopukumar D, Zhao Y. Big Data Application in Biomedical Research and Health Care: A Literature Review. Biomedical informatics insights. 2016; 8:1-10. doi: 10.4137/BII.S31559.

[15]Cowie JM, Wanger KM, Cartwright A, Bailey H, Millar JA, Price S, Henry M. A review of Clinical Terms Version 3 (Read Codes) for speech and language record keeping. International Journal of Language & Communication Disorders. 2001; 36(1): 117-126. doi:10.1080/13682820150217608.

[16]Häyrinen K, Saranto K, Nykänen P. Definition, structure, content, use and impacts of electronic health records: a review of the research literature. International journal of medical informatics. 2008; 77(5): 291-304. doi: 10.1016/j.ijmedinf.2007.09.001.

[17]Stuart-Buttle CD, Read JD, Sanderson HF, Sutton YM. A language of health in action: Read Codes, classifications and groupings. Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium; 1996. p. 75-79.

[18]Vardy DA, Gill RP, Israeli A. Coding medical information: classification versus nomenclature and implications to the Israeli medical system. Journal of Medical Systems. 1988; 22(4): 203-210. doi:10.1023/A:1022643216122.

[19]Read JD, Sanderson HF, Drennan YM. Terming, encoding, and grouping. MEDINFO. 1995; 8(1):56-64.

[20]Mannion R, Marini G, Street A. Implementing payment by results in the English NHS: changing incentives and the role of information. Journal of Health Organization and Management. 2008; 22(1): 79-88. doi:10.1108/14777260810862425

[21]De Silva TS, MacDonald D, Paterson G, Sikdar KC, Cochrane B. Systematized nomenclature of medicine clinical terms (SNOMED CT) to represent computed tomography procedures. Computer Methods and Programs in Biomedicine. 2011; 101(3):324-329. doi:10.1016/j.cmpb.2011.01.002.

[22]Campbell JR, Carpenter P, Sneiderman C, Cohn S, Chute CG, Warren J. Phase II evaluation of clinical coding schemes: completeness, taxonomy, mapping, definitions, and clarity. Journal of the American Medical Informatics Association. 1997; 4(3): 238-251. doi: 10.1136/jamia.1997.0040238.

[23]Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research. 2004; 32(1): D267-D270. doi: 10.1093/nar/gkh061.

[24]H. S. C. I. C. HSCIC, OPCS Classification of Interventions and Procedures Version 4.7 combined Volumes I & II / Health and Social Care Information Centre, 4th ed. TSO (The Stationery Office), 2014.

[25]Robert B. Fetter. Diagnosis related groups: Understanding hospital performance. Interfaces.1991; 21(1): 6-26. doi:

[26]Street A, Dawson D. Costing hospital activity: the experience with healthcare resource groups in England. The European Journal of Health Economics. 2002; 3(1): 3-9. doi: 10.1007/s10198-001-0086-1.

[27]Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? Journal of biomedical informatics. 2009; 42(5):760-772. doi: 10.1016/j.jbi.2009.08.007.

[28]Chenwei Yan, Xiangling Fu, Xien Liu, Yuanqiu Zhang, Yue Gao, Ji Wu, Qiang Li. A survey of automated International Classification of Diseases coding: development, challenges, and applications. Intelligent Medicine. 2022; 2(3):161-173. doi: 10.1016/j.imed.2022.03.003.

[29]Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association. 2004; 11(5): 392-402. doi: 10.1197/jamia.M1552

[30]M. T. Chiaravalloti, R. Guarasci, V. Lagani, E. Pasceri, R. Trunfio. A Coding Support System for the ICD-9-CM Standard. 2014 IEEE International Conference on Healthcare Informatics. Verona, Italy. 2014; p. 71-78, doi: 10.1109/ICHI.2014.17.

[31]S. G. Rizzo, D. Montesi, A. Fabbri, G. Marchesini. Icd code retrieval: Novel approach for assisted disease classification. In: International Conference on Data Integration in the Life Sciences. Springer. 2015; p. 147-161.

[32]D. Zhang, D. He, S. Zhao, L. Li. Enhancing automatic icd-9-cm code assignment for medical texts with pubmed. BioNLP, Association for Computational Linguistics. 2017; p. 263-271. doi:10.18653/v1/W17-2333.

[33]Chen Y, Lu H, Li L. Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity. PloS One. 2017; 12(3): e0173410. doi: 10.1371/journal.pone.0173410.

[34]Ning W, Yu M, Zhang R. A hierarchical method to automatically encodeChinese diagnoses through semantic similarity estimation. BMC Medical Informatics and Decision Making. 2016; 16(1): 30. doi: 10.1186/s12911-016-0269-4.

[35]Damla Arifoğlu, Onur Deniz, Kemal Aleçakır, Meltem Yöndem. Codemagic: semi-automatic assignment of icd-10-am codes to patient records. In: Information Sciences and Systems 2014. Springer, 2014; p. 259-268.

[36]Sheng-Wei Chen, Po-Ting Lai, Yi-Lin Tsai, Jay Kuan-Chieh Chung, Sherry Shih-Huan Hsiao, Richard Tzong-Han Tsai. NCU IISR System for NTCIR-11 MedNLP-2 Task. In: Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies. National Institute of Informatics. Tokyo, Japan. 2014; 9-12.

[37]S. Boytcheva. Automatic matching of icd-10 codes to diagnoses in discharge letters. In: Proceedings of the Second Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics. 2011; p. 11-18.

[38]P. Zweigenbaum and T. Lavergne. Hybrid methods for icd-10 coding of death certificates. In: Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics. 2016;p. 96-105. doi: 10.18653/v1/W16-6113.

[39]P. Jatunarapit, K. Piromsopa, and C. Charoeanlap. Development of thai text-mining model for classifying icd-10 tm. In: 2016 8th International Conference on Electronics, Computers and Artificial Intelligence (ECAI). Ploiesti, Romania. IEEE. 2016; p. 1-6. doi: 10.1109/ECAI.2016.7861163.

[40]J. Seva, M. Kittner, R. Roller, and U. Leser. Multi-lingual icd-10 coding using a hybrid rule-based and supervised classification approach at clef ehealth 2017. In: Conference and Labs of the Evaluation Forum (Working Notes). 2017.

[41]E. M. Van Mulligen, Z. Afzal, S. Akhondi, D. Dang, and J. Kors. Erasmus mc at clef ehealth 2016: Concept recognition and coding in french texts. In: Conference and Labs of the Evaluation Forum (Working Notes). 2016.

[42]Schmidt D, Budde K, Sonntag D, Profitlich HJ, Ihle M, Staeck O. A novel tool for the identification of correlations in medical data by faceted search. Computers in biology and medicine. 2017; 85: 98-105. doi: 10.1016/j.compbiomed.2017.04.011.

[43]L.-M. Ho-Dac, C. Fabre, A. Birski, I. Boudraa, A. Bourriot, M. Cassier, L. Delvenne, C. Garcia-Gonzalez, E.-B. Kang, E. Piccinini et al. Litl at clef ehealth2017: automatic classication of death reports. In: CLEF eHealth 2017, 2017.

[44]M. Subotin and A. Davis. A system for predicting icd-10-pcs codes from electronic health records. In: Proceedings of BioNLP. 2014: 59-67. doi: 10.3115/v1/W14-3409.

[45]Z. Miftahutdinov and E. Tutubalina. Kfu at clef ehealth 2017 task 1: Icd-10 coding of english death certificates with recurrent neural networks. In: Conference and Labs of the Evaluation Forum (Working Notes). 2017.

[46]Byung-Hak Kim, Varun Ganapathi. Read, attend, and code: Pushing the limits of medical codes prediction from clinical notes by machines. In: Proceedings of the 6th Machine Learning for Healthcare Conference. 2021; 149: 196-208. Available from:

[47]Jinmiao Huang, Cesar Osorio, Luke Wicent Sy. An empirical evaluation of deep learning for icd-9 code assignment using mimic-iii clinical notes. Computer Methods and Programs in Biomedicine. 2019; 177:141-153. doi: 10.1016/j.cmpb.2019.05.024.

[48]J. Edin, A. Junge, J. D. Havtorn, L. Borgholt, M. Maistro, T. Ruotsalo, and L. Maaløe. Automated medical coding on MIMIC-III and MIMIC-IV: A critical review and replicability study. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2023. doi: 10.1145/ 3539618.3591918.

[49]R. Kavuluru, A. Rios, and Y. Lu. An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Artificial Intelligence in Medicine. 2015; 65(2):155-166. doi: 10.1016/j.artmed.2015.04.007.

[50]L. Liu, O. Perez-Concha, A. Nguyen, V. Bennett, and L. Jorm. Hierarchical label-wise attention transformer model for explainable ICD coding. Journal of Biomedical Informatics. 2022; 133: 104161. doi: 10.1016/j.jbi.2022.104161

[51]A. H. Peden. An overview of coding and its relationship to standardized clinical terminology. Topics in Health Information Management. 2000; 21(2):1-9.

[52]J. R. Campbell, H. Brear, R. Scichilone, S. White, K. Giannangelo, B. Carlsen, H. R. Solbrig, and K. W. Fung. Semantic interoperation and electronic health records: context sensitive mapping from snomed ct to icd-10. Studies in Health Technology and Informatics. 2013; 192: 603-607.

[53]J. A. Feinstein, P. J. Gill, and B. R. Anderson. Preparing for the international classification of diseases, 11th revision (ICD-11) in the US health care system. JAMA Health Forum. 2023; 4(7): e232253. doi: 10.1001/jamahealthforum.2023.2253.

[54]W. R. Hersh, M. G. Weiner, P. J. Embi, J. R. Logan, P. R. Payne, E. V. Bernstam, H. P. Lehmann, G. Hripcsak, T. H. Hartzog, J. J. Cimino, and J. H. Saltz. Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical Care. 2013; 51: S30–S37. doi: 10.1097/mlr.0b013e31829b1dbd.

[55]R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports. 2016; 6(1). doi:10.1038/srep26094.

[56]A. Johnson, T. Pollard, and R. Mark. Mimic-iii clinical database. PhysioNet. 2016. doi: 10.13026/cd7z-wg25.

[57]D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant. Applied Logistic Regression. Wiley, 2013. doi: 10.1002/9781118548387.

[58]T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794. doi: 10.1145/2939672.2939785.

[59]Z. Tran, A. Verma, T. Wurdeman, S. Burruss, K. Mukherjee, and P. Benharash. ICD-10 based machine learning models outperform the trauma and injury severity score (TRISS) in survival prediction. PLOS ONE. 2022; 17(10): e0276624. doi: 10.1371/journal.pone. 0276624.

[60]Z. Tran, W. Zhang, A. Verma, A. Cook, D. Kim, S. Burruss, R. Ramezani, and P. Benharash. The derivation of an international classification of diseases, tenth revision-based trauma-related mortality model using machine learning. The journal of trauma and acute care surgery. 2022; 92(3): 561-566. doi: 10.1097/TA.0000000000003416.

[61]A. Aronson. Effective mapping of biomedical text to the umls metathesaurus: the metamap program. Proceedings. AMIA Symposium. 2001:17-21.

[62]X. Schmitt, S. Kubler, J. Robert, M. Papadakis, and Y. LeTraon. A replicable comparison study of ner software: Stanfordnlp, nltk, opennlp, spacy, gate. In: 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS). Granada, Spain. 2019:338-343. doi: 10.1109/SNAMS.2019.8931850.

[63]V. Balakrishnan and L.-Y. Ethel. Stemming and lemmatization: A comparison of retrieval performances. Lecture Notes on Software Engineering. 2014; 2(3): 262-267. doi:10.7763/lnse.2014.v2.134.

[64]S. Godbole and S. Sarawagi. Discriminative methods for multi-labeled classification. In: Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg. 2004:22-30. doi: 5

[65]H.-F. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. In: Proceedings of the 31st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research. E. P. Xing and T. Jebara, Eds. Bejing, China. 2014; 32(1): 593-601. Available from:

[66]N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16: 321-357. doi:10.1613/jair.953.

[67]K. Duan, S. Keerthi, and A. N. Poo. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing. 2003;51:41-59. doi: 10.1016/s0925-2312(02)00601-x.

[68]D. Pascual, S. Luck, and R. Wattenhofer. Towards BERT-based automatic ICD coding: Limitations and opportunities. In: Proceedings of the 20th Workshop on Biomedical Language Processing. Association for Computational Linguistics. 2021:54-63. Available from:

[69]F. Li and H. Yu. ICD coding from clinical text using multi-filter residual convolutional neural network. Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34(05):8180-8187. doi: 10.1609/aaai.v34i05.6331.

[70]Z. Yang, S. Wang, B. P. S. Rawat, A. Mitra, and H. Yu. Knowledge injected prompt based fine-tuning for multi-label few-shot ICD coding. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.2022: 1767-1781. Available from:

[71]K. Xu, M. Lam, J. Pang, X. Gao, C. Band, P. Mathur, F. Papay, A. K. Khanna, J. B. Cywinski, K. Maheshwari, P. Xie, and E. P. Xing. Multimodal machine learning for automated icd coding. In: Proceedings of the 4th Machine Learning for Healthcare Conference, ser. Proceedings of Machine Learning Research. F. Doshi-Velez, J. Fackler, K. Jung, D. Kale, R. Ranganath, B. Wallace, and J. Wiens, Eds. 2019;106: 197-215. Available from:

[72]C. Song, S. Zhang, N. Sadoughi, P. Xie, and E. Xing. Generalized zero-shot text classification for ICD coding. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization. 2020: 4018-4024. doi: 10.24963/ijcai.2020/556.

[73]H. Schafer and C. M. Friedrich. UMLS mapping and word embeddings for ICD code assignment using the MIMIC-III intensive care database. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Berlin, Germany. IEEE. 2019: 6089-6092. doi: 10.1109/embc.2019.8856442.

[74]J. Huang and C. Ling. Using auc and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering. 2005; 17(3): 299-310. doi: 10.1109/TKDE.2005.50.

[75]F. Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learning. 2001; 42(3) : 203-231. doi: 10.1023/A:1007601015854.

[76]J. Davis and M. Goadrich. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd international conference onMachine learning-ICML'06. ACM Press. 2006. doi:10.1145/1143844.1143874.

[77]T. Saito and M. Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE. 2015;10(3):e0118432. doi: 10.1371/journal.pone.0118432.

[78]P. A. Flach and M. Kull. Precision-recall-gain curves: Pr analysis done right. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015;1(NIPS’15): 838-846. Cambridge, MA, USA: MIT Press.

[79]G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. 2004; 6(1): 20-29. doi: 10.1145/1007730.1007735.

[80]L. A. Jeni, J. F. Cohn, and F. D. L. Torre. Facing imbalanced data–recommendations for the use of performance metrics. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. IEEE, 2013: 245-251. doi: 10.1109/acii.2013.47.

[81]T. Saito and M. Rehmsmeier. Precrec: fast and accurate precision–recall and ROC curve calculations in r. Bioinformatics. 2016; 33(1): 145-147. doi: 10.1093/bioinformatics/btw570.

[82]G. Gorrell, X. Song, and A. Roberts. Bio-yodie: A named entity linking system for biomedical text. arXiv preprint. 2018. doi: 10.48550/arXiv.1811.04860.

[83]G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G. Chute. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association. 2010; 17(5): 507-513. doi: 10.1136/jamia.2009.001560.

[84]M. Längkvist, L. Karlsson, and A. Loutfi. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters. 2014; 42: 11-24. doi:10.1016/j.patrec.2014.01.008.

[85]H. He and Y. Ma, Eds. Imbalanced learning: Foundations, Algorithms, and Applications. Hoboken, NJ: Wiley-Blackwell. 2013. doi:10.1002/9781118646106.

How to Cite

Jha, J., M. Almagro, and H. Tissot. “Designing NLP Applications to Support ICD Coding: An Impact Analysis and Guidelines to Enhance Baseline Performance When Processing Patient Discharge Notes”. Journal of Digital Health, vol. 2, no. 1, Oct. 2023, pp. 63-81, doi:10.55976/jdh.22023119463-81.

Scan QR code to follow us by Wechat


Luminescience press is based in Hong Kong with offices in Wuhan and Xi'an, China.


鄂公网安备 42018502004928号 网站备案号:鄂ICP备2020021880号-1 Copyright © 2021 Luminescience Press. All rights reserved.