Natural language processing is a branch of artificial intelligence that aims to enable computers to process, understand, interpret and manipulate human language. After decades of exploring, current state-of-the-art NLP solutions are based on deep learning models implemented using various types of neural networks.
“Among the deep learning models, the transformer-based models implemented using a self-attention mechanism, such as BERT and GPT, are the current state-of-the-art solutions,” explained Yonghui Wu, director of NLP at the Clinical and Translational Science Institute at Gainesville-based University of Florida Health and an assistant professor in the department of health outcomes and biomedical informatics at the University of Florida.
“The transformer-based NLP models split the training procedure into two stages, including pretraining and fine-tuning,” he continued. “In pretraining, the transformer models adopted unsupervised learning strategies to train language models from large unlabeled corpora (for example, Wikipedia, Pubmed articles and clinical notes).”
In fine-tuning, transformer models fine-tune the pretrained models for specific downstream tasks using supervised learning.
“The key step is pretraining, where transformer-based models learn task-independent linguistic knowledge from massive text data, which can be applied to solve many downstream NLP tasks,” Wu said. “However, to make the transformer-based models effective, they are usually very huge with billions of parameters, which cannot fit into a single GPU memory, be trained with a single computer node, and apply traditional training strategies.
“Training these large models requires massive computing power, efficient memory management and advanced distributed training techniques such as data and/or model parallelisms to reduce training time,” he added. “Therefore, even though there are big transformer models in the general English domain, there are no comparable transformer models in the medical domain.”
For example, if an organization trained a BERT model with 345 million parameters on a single GPU, it would take months to complete.
“Models like GPT-2 with billions of parameters are not even able to fit into a single GPU memory for training,” Wu said. “Thus, for a long time, we cannot take the advantage of big transformer models even though we have massive clinical text data at UF Health.”
For software, NLP vendor Nvidia developed the Megatron-LM package, which adopted an efficient intra-layer model parallel approach that can significantly reduce the distributed training communication time while keeping the GPUs compute bound, said Jiang Bian, associate director of the biomedical informatics program at the Clinical and Translational Science Institute at UF Health and an associate professor in the department of health outcomes and biomedical informatics at the University of Florida.
“This model parallel technique is orthogonal to data parallelism, which could enable us to take advantage of distributed training from both model parallel and data parallel,” Bian explained. “Further, Nvidia also developed and provided a conversational AI toolkit, NeMo, for using these large language models for downstream tasks. These software packages greatly simplified the steps in building and using large transformer-based models like our GatorTron.
“For hardware, Nvidia provided the HiPerGator AI NVIDIA DGX A100 SuperPod cluster, recently deployed at the University of Florida, featuring 140 Nvidia DGX A100 nodes with 1120 Nvidia Ampere A100 GPUs,” he continued. “The software solved the bottleneck in distributed training algorithms and the hardware solved the bottleneck in computing power.”
MEETING THE CHALLENGE
The team at UF Health developed GatorTron, the world’s largest transformer-based NLP model – with around 9 billion parameters – in the medical domain and trained it using more than 197 million notes with more than three billion sentences and more than 82 billion words of clinical text from UF Health.
“GatorTron adopted the architecture of Megatron-LM – the software provided by Nvidia,” Wu said. “We trained GatorTron using HiPerGator AI NVIDIA DGX A100 SuperPod cluster, recently deployed at the University of Florida, featuring 140 Nvidia DGX A100 nodes with 1120 Nvidia Ampere A100 GPUs. With the HiPerGator AI cluster, the computing resource is no longer a bottleneck.
“We trained GatorTron using 70 HiPerGator nodes with 560 GPUs, with both data and model parallel training strategy,” he added. “Without Nvidia’s Megatron-LM, we would not be able to train such a large transformer model in the clinical domain. We also leveraged Nvidia’s NeMo-toolkit, which provides the flexibility to fine-tune GatorTron for various NLP downstream tasks using easy-to-use application programming interfaces.”
GatorTron currently is being evaluated for downstream tasks such as named-entity recognition, relation extraction, semantic similarity of text, and question and answering with electronic health record data in a research setting. The team is working to apply GatorTron to real-world healthcare applications such as patient cohort identification, text de-identification and information extraction.
UF Health evaluated the GatorTron model on four important NLP tasks, including clinical concept extraction, clinical relation extraction, medical natural language inference, and medical question and answering.
“For clinical concept extraction, GatorTron model achieved state-of-the-art performances on all three benchmarks, including publicly available 2010 i2b2, 2012 i2b2 and 2018 n2c2 datasets,” Bian noted. “For relation extraction, GatorTron significantly outperformed other BERT models pre-trained in the clinical or biomedical domain such as clinicalBERT, BioBERT and BioMegatron.
“For medical natural language inference and question and answering, GatorTron achieved new state-of-the-art performances on both benchmark datasets – medNLI and emrQA,” he added.
ADVICE FOR OTHERS
There are increasing interests in applying NLP models to help extract patient information from clinical narratives, where state-of-the-art pretrained language models are key components.
“A well-trained large language model could improve many downstream NLP tasks through fine-tuning, such as medical chatbots, automated summarization, medical question and answering, and clinical decision support systems,” Wu advised. “When developing large transformer-based NLP models, it’s recommended to explore various model sizes – number of parameters – based on their local clinical data.
“When applying these large transformer-based NLP models, healthcare providers have to think about the real-world configurations,” he concluded. “For example, these large transformer-based NLP models are very powerful solutions for high-performance servers, but are not feasible to deploy on personal computers.”