Bracci J, Capobianco N, Shah V, Spottiswoode B, Giobergia F (2024)
Publication Language: English
Publication Type: Conference contribution, Conference Contribution
Publication year: 2024
Book Volume: 52
Pages Range: 193-193
Conference Proceedings Title: EANM’24 Abstract Book Congress
Event location: Hamburg, Germany
DOI: 10.1007/s00259-024-06939-9
Aim/Introduction: PET/CT clinical reports are typically written in free-text format. Consequently, while they contain a wealth of valuable and reliable information, it is challenging to store and structure this information in a way that it can be searched, modeled and analyzed in a consistent and efcient manner. This study aims to assess the feasibility of extracting structured data from textual reports by employing the latest Large Language Models (LLMs). Materials and Methods: We reviewed PET/CT reports from 31 patients with confrmed lung cancer. A human reader synthesized these reports into a structured format, categorizing each anatomical location (AL) as either “Presence” or “Absence” of a lesion. We then prompted publicly available LLMs to structure given reports, using Few-Shot (FS) learning [1] with up to 4 shots (reports and expected output fed to the model as examples), to assess the impact on performance. The evaluation was conducted on 27 held-out reports, using the F1-score metric. The dataset’s AL distribution included an average of 23.4 ALs for “Absence” (80 unique ALs) with the 5 most frequent being: [bladder: 31, spleen: 31, liver: 29, kidneys: 24, aorta: 22] and 6.9 ALs for “Presence” (143 unique ALs) with the 5 most frequent being: [right-upper-lobe: 4, right-lung-parenchyma: 4, right-lower-lobe: 3, right-middle-lobe: 3, left-upper-lobe: 3]. We evaluated six LLMs: GPT-4-Turbo, Llama3-70B, Llama3-8B, Mistral-7B, Mixtral-8x22B, Mixtral-8x7B. Results: The results for the “Absence” class showed that Llama3-70B performed best, with an average F1-score of 93.02% when prompted with 4-shots. All models demonstrated a signifcant increase in performance from 0-shots (i.e., no examples provided) to 1-shot, with an average F1-score increase of 30.12% (from 35.20% to 65.32%). The performance plateaued with 2 or more shots, averaging 80.24% across all models. For the “Presence” class, Mistral-7B had the highest average F1-score of 69.00% with 3 shots. A similar trend was observed with an increase from 0 to 1-shot, with the average F1-score increase of 20.53% (from 35.77% to 56.30%). The performance was also stable with 2 or more shots, with an average F1-score of 57.25% across all models. Conclusion: The study indicates that LLMs are capable of structuring PET/ CT clinical report data efectively, with few-shot learning being a critical factor for achieving high accuracy. Future research should explore additional LLMs for optimal performance and the refnement of the shots used to enhance model performance. References: [1] Brown T B et al 2020 Language models are fewshot learners (arXiv:2005.14165).
APA:
Bracci, J., Capobianco, N., Shah, V., Spottiswoode, B., & Giobergia, F. (2024). Label extraction from PET/CT reports using Large Language Models. In EANM’24 Abstract Book Congress (pp. 193-193). Hamburg, Germany.
MLA:
Bracci, Jacopo, et al. "Label extraction from PET/CT reports using Large Language Models." Proceedings of the Annual Congress of the European Association of Nuclear Medicine, Hamburg, Germany 2024. 193-193.
BibTeX: Download