XU, Binbin; GIL-JARDINE, Cedric; THIESSARD, Frantz; TELLIER, Eric; AVALOS FERNANDEZ, Marta; LAGARDE, Emmanuel

dc.rights.license	open	en_US
hal.structure.identifier	Bordeaux population health [BPH]
dc.contributor.author	XU, Binbin
hal.structure.identifier	Bordeaux population health [BPH]
dc.contributor.author	GIL-JARDINE, Cedric ORCID: 0000-0001-5329-6405 IDREF: 159039223
hal.structure.identifier	Bordeaux population health [BPH]
dc.contributor.author	THIESSARD, Frantz
hal.structure.identifier	Bordeaux population health [BPH]
dc.contributor.author	TELLIER, Eric
hal.structure.identifier	Statistics In System biology and Translational Medicine [SISTM]
hal.structure.identifier	Bordeaux population health [BPH]
dc.contributor.author	AVALOS FERNANDEZ, Marta
hal.structure.identifier	Bordeaux population health [BPH]
dc.contributor.author	LAGARDE, Emmanuel
dc.date.accessioned	2021-04-30T10:46:44Z
dc.date.available	2021-04-30T10:46:44Z
dc.identifier.uri	https://oskar-bordeaux.fr/handle/20.500.12278/27139
dc.description.abstractEn	In order to build a national injury surveillance system based on emergency room (ER) visits we are developing a coding system to classify their causes from clinical notes in free-text. Supervised learning techniques have shown good results in this area but require large number of annotated dataset. New levels of performance have been recently achieved in neural language models (NLM) with models based on the Transformer architecture incorporating an unsupervised generative pre-training step. Our hypothesis is that methods involving a generative self-supervised pre-training step can significantly reduce the required number of annotated samples for supervised fine-tuning. In this case study, we assessed whether we could predict from free-text clinical notes whether a visit was the consequence of a traumatic or non-traumatic event. Using fully re-trained GPT-2 models (without OpenAI pre-trained weightings), we compared two scenarios: Scenario A (26 study cases of different training data sizes) consisted in training the GPT-2 on the trauma/non-trauma labeled (up to 161 930) clinical notes. In Scenario B (19 study cases), a first step of self-supervised pre-training phase with unlabeled (up to 151 930) notes and the second step of supervised fine-tuning with labeled (up to 10 000) notes. Results showed that, Scenario A needed to process >6 000 notes to achieve good performance (AUC>0.95), Scenario B needed only 600 notes, gain of a factor 10. At the end case of both scenarios, for 16 times more data (161 930 vs. 10 000), the gain from Scenario A compared to Scenario B is only an improvement of 0.89% in AUC and 2.12% in F1 score. To conclude, it is possible to adapt a multi-purpose NLM model such as the GPT-2 to create a powerful tool for classification of free-text notes with only very small number of labeled samples.
dc.language.iso	EN	en_US
dc.subject.en	Neural Language Model
dc.subject.en	Pre-training
dc.subject.en	Transformer
dc.subject.en	GPT-2
dc.title.en	Neural Language Model for Automated Classification of Electronic Medical Records at the Emergency Room. The Significant Benefit of Unsupervised Generative Pre-training
dc.type	Document de travail - Pré-publication	en_US
dc.subject.hal	Statistiques [stat]/Machine Learning [stat.ML]	en_US
dc.subject.hal	Statistiques [stat]/Méthodologie [stat.ME]	en_US
dc.subject.hal	Statistiques [stat]/Calcul [stat.CO]	en_US
dc.subject.hal	Statistiques [stat]/Applications [stat.AP]	en_US
dc.subject.hal	Informatique [cs]/Apprentissage [cs.LG]	en_US
dc.subject.hal	Sciences du Vivant [q-bio]/Santé publique et épidémiologie	en_US
dc.identifier.arxiv	1909.01136	en_US
bordeaux.hal.laboratories	Bordeaux Population Health Research Center (BPH) - UMR 1219	en_US
bordeaux.institution	Université de Bordeaux	en_US
bordeaux.institution	INSERM	en_US
bordeaux.team	SISTM	en_US
bordeaux.team	ERIAS	en_US
bordeaux.team	SISTM_BPH
bordeaux.team	IETO	en_US
bordeaux.import.source	hal
hal.identifier	hal-02425097
hal.version	1
hal.export	false
workflow.import.source	hal
bordeaux.COinS	ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.au=XU,%20Binbin&GIL-JARDINE,%20Cedric&THIESSARD,%20Frantz&TELLIER,%20Eric&AVALOS%20FERNANDEZ,%20Marta&rft.genre=preprint

Fichier(s) constituant ce document

Fichiers	Taille	Format	Vue
Il n'y a pas de fichiers associés à ce document.

Ce document figure dans la(les) collection(s) suivante(s)

Bordeaux Population Health Research Center (BPH) - UMR 1219

Afficher la notice abrégée

Neural Language Model for Automated Classification of Electronic Medical Records at the Emergency Room. The Significant Benefit of Unsupervised Generative Pre-training

Fichier(s) constituant ce document

Ce document figure dans la(les) collection(s) suivante(s)