While working on data for the National Institute of Health, improving visibility of medical provider services and patient outcomes with the goal of increasing system-wide efficiency, we led the process to leverage machine learning on financial and administrative pharmaceutical data enabling data-driven business decisions at NIH.
This project involved a comprehensive approach to normalizing or standardizing pharmaceutical medications within the Executive Information System (EIS) environment. The key advancement is the introduction of a Python-based Drug Information Normalization System, designed to assist and refine the process of managing pharmaceutical names. This system sources data from varied platforms like MedlinePlus, NHS, Wikipedia, MeSH, RxNorm, and DrugBank. It focuses on extracting, normalizing, and managing drug names, handling synonyms, and facilitating approximate drug searches.
The methodology includes scraping and extracting data sources using Python scripts, employing advanced regex for detailed data extraction, and a multi-step normalization process for medication names. It starts with initial token searches in medication lists, employs text similarity techniques for approximate matches in RxNorm, and selects the most probable candidate based on contextual analysis. In cases where RxNorm (primary source) and DrugBank (secondary source) databases don’t yield sufficient matches, drugs are marked as “UNKNOWN.”
The system also includes a comprehensive Data Dictionary for the Normalized Medication Reference Table. This dictionary catalogs various aspects like drug_id, drug_id_type, and normalized_medication_name, among others, providing a structured and standardized format for medication data. Finally, the EIS Team integrates this data into their warehouse and develops reporting objects with it, enhancing data accessibility, usability, and reliability of reporting on cost of pharmaceuticals.
Initially, there were 8,251 unique medication names, indicating a wide variety of entries. Post-normalization, only 2,915 unique names remained, indicating a vast reduction in redundancy and an enhanced clarity in medication naming. A high percentage (87.7%) of the original names were successfully normalized, indicating the efficiency of the normalization process. The system effectively extracted the form (e.g., tablet, liquid) of medications in 81.9% of the entries, enhancing understanding of how medications are administered. Units of measure were accurately identified in 76.2% of the entries, which is crucial for dosage and prescription accuracy. The system was equally effective in extracting numerical values for units, maintaining a 76.2% success rate, which is vital for dosage calculations and standardization.
Recommendations
As part of the next phase of this work, we recommended the following strategies:
- Integration of RxNorm as a Primary Key: Incorporate RxNorm into the pharmacy system as the primary identifier. This integration is essential as RxNorm provides a standardized nomenclature for medications, facilitating better analytics for pharmaceutical cost analysis. The data files generated will include various reference tables, with RxNorm serving as the central reference point. This will greatly enhance the reliability and accuracy of data for ongoing pharmaceutical analysis activities.
- Minimization of Data Entry Variability: To address data entry inconsistencies, implement a more controlled entry system. Develop a comprehensive dictionary lookup table, which will be integrated into the pharmacy entry system. This table should be designed to generate a dynamic drop-down menu, leveraging the EIS Normalized data along with the RxNorm field. This approach will significantly reduce entry variations and errors, ensuring more consistent and accurate data capture.
- Training and Documentation Enhancement: Utilize the expertise of the pharmacy department trainer to develop and document comprehensive processes for using RxNorm and adding new medications to the system. This documentation should be user-friendly and detailed, assisting users who are involved in manual entry processes. By enhancing the training materials and procedures, we can ensure a smoother transition for users and improve overall system proficiency.
- Creation of an EIS Dashboard for Novel Entries: Develop a specialized EIS dashboard designed to compile and review all new entries monthly. This dashboard will serve as a critical tool for the EIS and unit teams during their monthly integrated clinical data team leadership meetings. It will act as a feedback loop, allowing data stewards and entry specialists to pinpoint the origins of data mismatches and expedite the resolution of data entry issues.
- Automated Supply Chain Management: Implement a modern, automated method for managing the supply chain. As supplies are delivered and offloaded, they should be systematically organized and recorded into the system through advanced scanning technologies. Employ AI and barcode scanning to streamline this process. This automation will not only improve efficiency but also significantly reduce data entry errors, ensuring a more accurate and reliable supply chain management system.
These recommendations aim to streamline operations, reduce errors, and improve the overall efficiency and accuracy of the pharmaceutical data management system, ultimately leading to better decision-making and cost management in pharmaceutical services.
The primary step of normalizing the data is currently 87.7% complete, the remaining activities to create a complete project will continue while NIH unit leaders engage with the pharmacy department. In order to improve future activities, we recommend adding the RxNorm to the Pharmacy system as a primary key, updating the pharmacy supply system to allow for aligning with reference dictionaries, updating training to leverage use of RxNorm search, creating a monthly report to identify newly added pharmaceuticals, and automating future entry process to decrease or eliminate manual entry activities.
References
- DrugBank. (2023). View drug alternatives for a product concept. Retrieved from https://docs.drugbank.com/v1/#view-drug-alternatives-for-a-product-concept
- Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
- McKinney, W., & others. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56).
- MedlinePlus. (2023). Drug Information. Retrieved from https://medlineplus.gov/druginformation.html
- National Center for Biotechnology Information. (2023). MeSH Database. Retrieved from https://www.ncbi.nlm.nih.gov/mesh/
- National Library of Medicine. (n.d.). RxNav APIs: RxNorm APIs. Retrieved from https://lhncbc.nlm.nih.gov/RxNav/APIs/RxNormAPIs.html
- NHS. (2023). Medicines. Retrieved from https://www.nhs.uk/medicines/
- Wikimedia Meta-Wiki. (2023). Data dump torrents: English Wikipedia. Retrieved from https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia
