Running in circles: practical limitations for real-life application of data fission and data thinning in post-clustering differential analysis
HIVERT, Benjamin
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
THIEBAUT, Rodolphe
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
See more >
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
HIVERT, Benjamin
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
THIEBAUT, Rodolphe
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
HEJBLUM, Boris
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
< Reduce
Statistics In System biology and Translational Medicine [SISTM]
Bordeaux population health [BPH]
Language
EN
Document de travail - Pré-publication
English Abstract
Post-clustering inference in scRNA-seq analysis presents significant challenges in controlling Type I error during Differential Expression Analysis. Data fission, a promising approach, aims to split the data into two new ...Read more >
Post-clustering inference in scRNA-seq analysis presents significant challenges in controlling Type I error during Differential Expression Analysis. Data fission, a promising approach, aims to split the data into two new independent parts, but relies on strong parametric assumptions of non-mixture distributions, which are violated in clustered data. We show that applying data fission to these mixtures requires knowledge of the clustering structure to accurately estimate component-specific scale parameters. These estimates are critical for ensuring decomposition and independence. We theoretically quantify the direct impact of the bias in estimating this scales parameters on the inflation of the Type I error rate, caused by a deviation from the independence. Since component structures are unknown in practice, we propose a heteroscedastic model with non-parametric estimators for individual scale parameters. This model uses proximity between observations to capture the effect of the underlying mixture on data dispersion. While this approach works well when clusters are well-separated, it introduces bias when separation is weak, highlighting the difficulty of applying data fission in real-world scenarios with unknown degrees of separation.Read less <
English Keywords
Unsupervised learning
Mixture Model
Post-clustering inference
Type I error
Non-parametric estimation
local variance
European Project
European HIV Vaccine Alliance (EHVA): a EU platform for the discovery and evaluation of novel prophylactic and therapeutic vaccine candidates
ANR Project
MultiScale AI for SingleCell-Based Precision Medicine - ANR-22-PESN-0002
University of Bordeaux Graduate School in Digital Public Health - ANR-17-EURE-0019
University of Bordeaux Graduate School in Digital Public Health - ANR-17-EURE-0019