BiomedGPT: Applications of Multimodal Large Language Models in Various Biomedical Tasks
- Wang Suhong
- Dec 27, 2024
- 3 min read
With the rapid growth of AI, the medical field increasingly relies on multimodal large language models to integrate visual and linguistic data, enabling precise diagnosis, treatment, and patient care. Traditional models often lack flexibility across diverse tasks, but BiomedGPT overcomes these challenges. This lightweight, open-source vision-language model excels in adaptability and performance, achieving breakthroughs in biomedical applications through advanced architecture, pretraining strategies, and fine-tuning methods.
Jump to ↓

Distinct Advantages of BiomedGPT: Architecture and Features of Multimodal Large Language Models
1. Unified Multimodal Representation for Biomedical Tasks
BiomedGPT is a Transformer-based encoder-decoder architecture designed for multimodal biomedical tasks. It processes text, images, and multimodal data seamlessly through a unified input-output representation. Key features include:
Text: Processed using BPE (Byte Pair Encoding) tokenization.
Images: Encoded with a pretrained VQ-GAN discretization mechanism.
2. Comprehensive Multimodal Task Support
BiomedGPT supports a wide range of biomedical tasks:
Vision tasks: Image classification, masked image modeling (MIM), object detection.
Text tasks: Masked language modeling (MLM), summarization, natural language inference.
Multimodal tasks: Biomedical visual question answering (VQA), image captioning.
3. Lightweight Design for Multimodal Large Language Models
BiomedGPT offers three versions (small, medium, and base) with 33 million, 93 million, and 182 million parameters, respectively, catering to diverse computational needs. Despite having significantly fewer parameters than the commercial Med-PaLM M model (12 billion parameters), it excels in a variety of biomedical tasks.
BiomedGPT Methodology: Core Technologies of Multimodal Large Language Models in Biomedical Tasks
1. Architecture Design: Encoder-Decoder Model
BiomedGPT combines a BERT-style encoder and GPT-style decoder within its architecture. It enhances sequence understanding with relative positional biases.
2. Pretraining Strategies: Large-Scale Biomedical Multimodal Data
BiomedGPT is pretrained on 14 public datasets, encompassing 592,567 images, 183 million text sentences, and 271,804 image-text pairs. Key pretraining tasks include:
Masked image modeling (MIM).
Masked language modeling (MLM).
Image captioning.
Biomedical visual question answering (VQA).
3. Fine-Tuning and Task Adaptation
BiomedGPT achieves outstanding adaptability across tasks, including medical image classification, report generation, and natural language inference, by fine-tuning existing weights without requiring additional components.
4. Instruction Tuning: Enhancing Multimodal Task Performance
Using natural language instructions, BiomedGPT generates accurate answers in zero-shot scenarios. For example:
Image description: "What does the image describe?"
Text summarization: "What is the summary of the text '{Text}'?"
5. Exceptional Zero-Shot Prediction Capability
BiomedGPT exhibits robust zero-shot reasoning capabilities in biomedical tasks, achieving a 54.7% accuracy rate on the VQA-RAD dataset, surpassing GPT-4V's 53.0%.
Performance Advantages and Application Scenarios of BiomedGPT in Biomedical Tasks
1. Biomedical Visual Question Answering (VQA): Task-Specific Capabilities of Multimodal Large Language Models
BiomedGPT achieved an 86.1% accuracy rate on the SLAKE dataset, setting a new record (previously 85.4%). This highlights its strong ability to interpret medical images and answer related questions, facilitating the rapid extraction of key information.
2. Medical Image Description: Language Generation in Multimodal Tasks
BiomedGPT improved the ROUGE-L score by 8.1% on the Peir Gross dataset and achieved a METEOR score of 15.9% on the MIMIC-CXR dataset. This makes it a powerful tool for radiologists in describing medical images.
3. Medical Image Classification: Enhancing Accuracy in Dermatology and Other Tasks
In seven classification tasks on the MedMNIST-Raw dataset, BiomedGPT outperformed other models in five tasks. Notably, it achieved a 14% higher accuracy rate than baseline models on the dermoscopy dataset, demonstrating its potential in detecting dermatological conditions.
In zero-shot settings, BiomedGPT successfully handled complex disease diagnosis tasks, performing on par with Med-PaLM M. Its capabilities are particularly valuable for diagnosing rare diseases and emerging conditions.
5. In-Hospital Mortality Prediction: Accurate Evaluation in Medical Tasks
BiomedGPT outperformed other models in predicting in-hospital mortality using the MIMIC-III database, aiding in identifying high-risk patients and optimizing resource allocation.
6. Medical Report Generation and Summarization: Document Creation in Multimodal Tasks
On the MIMIC-CXR dataset, BiomedGPT-generated reports were favorably rated by medical experts, achieving a preference score comparable to expert-authored reports (48% vs. 52%). This reduces the documentation burden in high-intensity scenarios like radiology.
AIExPro: Multimodal Large Language Model Solutions for Biomedical Tasks
AIExPro focuses on the innovative application of multimodal large language models in various biomedical tasks, providing precise and efficient solutions for the healthcare industry. By integrating medical imaging and text data, AIExPro demonstrates exceptional vision-language interaction capabilities, driving breakthroughs in fields such as medical imaging diagnostics, clinical report generation, drug discovery, and medical text analysis. Its unified input-output representation ensures task flexibility, while the lightweight design lowers deployment barriers, enabling healthcare institutions to implement high-performance models without relying on extensive computational resources. AIExPro offers intelligent support for medical teams, advancing healthcare services toward precision and intelligence.