LAmbDA: Label Ambiguous Domain Adaption Dataset Integration Reduces Batch Effects and Improves Subtype Detection
AbstractMotivationRapid advances in single cell RNA sequencing have produced more granular subtypes of cells in multiple tissues from different species. There exists a need to develop rigorous methods that can i) model multiple datasets with ambiguous labels across species and studies and ii) remove systematic biases across datasets and species.ResultsWe developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets and applied our framework on scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy pancreas: 91%, brain: 78%) using LAmbDA Random Forest. LAmbDA Feedforward 1 Layer Neural Network achieved higher weighted accuracy in labeling cellular subtypes than CaSTLe or MetaNeighbor in brain (48%, 32%, 20% respectively). Furthermore, LAmbDA Feedforward 1 Layer Neural Network was the only method to correctly predict ambiguous cellular subtype labels in both pancreas and brain compared to CaSTLe and MetaNeighbor. LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing.Availability: github.com/tsteelejohnson91/LAmbDAContact:[email protected], [email protected]