Yu S. Huang

Ph.D. with 15+ years of leadership in computational biology, AI drug discovery, genomic/protein AI modeling, and high-performance computing (HPC) infrastructure. Expert in generative AI, protein language models, multimodal fusion, structure-based drug design. Proven track record in defining technical strategy, building end-to-end AI platforms from scratch, leading cross-functional teams, managing academic‑industry collaborations, and driving publications & IP strategy. Led the development and industrial deployment of AI models for cancer early detection, virtual screening, and molecular generation.

Since 2022, I have served as Senior Director Bioinformatics at 臻和 Genecast Corp Ltd.. My main responsibility is to define and execute long-term technical strategy for AI-driven precision oncology, aligned with corporate product pipelines and business goals. Lead the development of multimodal AI platforms integrating sequence, structure, and epigenomic data for non-invasive cancer detection and protein computing. Built enterprise-grade AI computing infrastructure (K8s, PyTorch, distributed storage, high-speed interconnect) to support large-scale computing. Lead and mentor a high-performance team of algorithm scientists, bioinformaticians, and software engineers to deliver end-to-end solutions from in silico modeling to experimental validation. Led cross-disciplinary team management and promoted tight integration between computational models and experimental biology. External scientific engagement, conference presentations, high-impact publications, and IP strategy; drove research-to-product translation.

From 2015 to 2021, I was Professor and Principal Investigator at Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences (中科院). I led the establishment of AI-driven computational biology and drug discovery and built a mature structure-based drug design & virtual screening system. Developed Fergie (VAE-based small molecule generation) and Deffini (structure-based virtual screening DNN) to enable structure-guided drug design at scale. Developed core algorithms for genomic variant calling, copy number analysis, and methylation sequencing to support early-stage drug R&D. Directed national/provincial research projects, built academic-industry partnerships, and delivered high-impact publications. During this time, I also taught courses on Artificial Intelligence and Machine Learning, Julia (a scientific computing language), Matrix Computations, Optimization.

My career also includes experience as a Bioinformatics Scientist at Illumina Inc. (San Diego) and a Postdoctoral Researcher in Human Genetics at UCLA. I earned my Ph.D. in Computational Biology and Bioinformatics from USC, specializing in machine learning, statistical learning, graph theory, population genetics. Inside a Ph.D. program founded by mathematician and founding father of computational biology M.S. Waterman, I gained a comprehensive understanding of machine learning, statistics, probability, stochastic processes.

I received my B.S. in Biology from Fudan University in July 2003, during which I became proficient in C/C++, PostgreSQL, Java, Python, and Linux systems. My fascination with computers began with writing my first BASIC program on an Intel-8088 PC in the 8th grade.

In my spare time, I enjoy reading, surfing, skateboarding, and snowboarding.

EMAIL
GitHub 代码库: github.com/polyactis
ORCID 0000-0001-5967-4948, ResearchGate
小红书: polyactis
Expertise: Machine/Deep/Statistical Learning, Bioinformatics, AI Drug Design, Optimization, Distributed Computing

latest posts

Aug 06, 2025	AI赋能肿瘤基因诊断！黄宇博士深度解析：技术创新如何破解临床痛点 - 中国抗癌协会TBM医研企座谈会
Jun 12, 2025	Bayesian p-value/贝叶斯p-value
Sep 02, 2024	How to calculate the derivative of a matrix multiplication such as WHW
Apr 18, 2024	贝叶斯统计 Statistical Rethinking v2 in Julia 讲解视频
Mar 15, 2024	7 Julia correctness issues are now fixed.

selected publications

Multi-omics analysis identifies different molecular subtypes with unique outcomes in early-stage poorly differentiated lung adenocarcinoma

B Liu , W Tao , X Zhou , and 14 more authors

Molecular Cancer, 2025

Abs Bib HTML PDF

Introduction: Early-stage poorly differentiated lung adenocarcinoma (LUAD) is plagued by a high risk of postoperative recurrence, and its prognostic heterogeneity complicates treatment and surveillance planning. We conducted this integrative multi-omics study to identify those patients with a truly high risk of adverse outcomes. Methods: Whole-exome, RNA and whole methylome sequencing were carried out on 101 treatment-naïve early-stage poorly differentiated LUADs. Integrated analyses were conducted to disclose molecular characteristics and explore molecular subtyping. Functional validation of key molecules was carried out through in vitro and in vivo experiments. Results: Recurrent tumors exhibited significantly higher ploidy (p = 0.024), the fraction of the genome altered (FGA, p = 0.042), and aneuploidy (p < 0.05) compared to non-recurrent tumors, as well as a higher frequency of CNVs. Additionally, recurrent tumors showed hypomethylation at both the global level and in CpG island regions. Integrative transcriptomic and methylation analyses identified three molecular subtypes (C1, C2, and C3), with the C1 subtype presenting the worst prognosis (p = 0.024). Although frequently mutated genes showed similar mutation frequencies across the three subtypes, the C1 subtype exhibited the highest tumor mutation burden (TMB), mutant-allele tumor heterogeneity (MATH), aneuploidy, and HLA loss of heterozygosity (HLA-LOH), along with relatively lower immune cell infiltration. Furthermore, GINS1 and CPT1C were found to promote LUAD progression, and their high expression correlated with a poor prognosis. Conclusions: This multi-omics study identified three integrative subtypes with distinct prognostic implications, paving the way for more precise management and postoperative monitoring of early-stage poorly differentiated LUAD.
@article{Liu2025MultiomicsStrati, title = {Multi-omics analysis identifies different molecular subtypes with unique outcomes in early-stage poorly differentiated lung adenocarcinoma}, author = {Liu, B and Tao, W and Zhou, X and Xu, LD and Luo, Y and Yang, X and Min, Q and Huang, M and Zhu, Y and Cui, X and Wang, Y and Gong, T and Zhang, E and Huang, YS and Chen, W and Yan, S and Wu, N}, journal = {Molecular Cancer}, year = {2025}, doi = {10.1186/s12943-025-02333-7}, url = {https://doi.org/10.1186/s12943-025-02333-7}, dimensions = {true}, }
MinervaDelta
Quantitative and dynamic ctDNA as a biomarker of response and survival in patients with advanced lung squamous cell carcinoma receiving immunochemotherapy or chemotherapy alone

F Zhou , J Zhang , S Ren , and 20 more authors

Journal of Thoracic Oncology, 2025

Abs Bib HTML PDF

Introduction: Although circulating tumor DNA (ctDNA) dynamics have been widely explored for therapeutic response assessment, standardized methodologies remain elusive. Here, we developed MinerVa-Delta, a novel approach to quantify ctDNA dynamics by calculating weighted mutation changes in samples with multiple tracked variants. Methods: MinerVa-Delta was developed and analytically validated using serially diluted reference samples. The optimal cutoff was determined in a discovery cohort of 227 patients with advanced lung squamous cell carcinoma (LUSC) receiving programmed cell death protein 1 blockade plus chemotherapy or chemotherapy alone and further validated in an independent cohort of 97 patients with LUSC treated with chemotherapy alone. Variants were de novo called in pretreatment samples using a 769-gene next-generation sequencing panel, serving as a basis for personalized variant tracking in posttreatment plasma after two cycles of treatment. We applied MinerVa-Delta to evaluate prognosis and therapeutic response in advanced LUSC. Results: Patients classified as molecular responders (MinerVa-Delta <30%) exhibited significantly improved outcomes compared with nonresponders (MinerVa-Delta ≥30%), with superior progression-free survival (hazard ratio = 0.19, p < 0.001) and overall survival (hazard ratio = 0.24, p < 0.001). MinerVa-Delta displayed consistent prediction performance in the validation cohort. Furthermore, MinerVa-Delta accurately identified radiologic stable disease patients, a clinically heterogeneous population, who could benefit from initial treatment. Conclusions: Our findings suggest MinerVa-Delta is feasible for evaluating treatment response in patients with advanced LUSC. Integrating ctDNA profiling with conventional imaging could enhance response assessment, particularly in radiologic stable disease patients, enabling more precise therapeutic decision-making.
@article{Zhou2025MinervaDelta, title = {Quantitative and dynamic ctDNA as a biomarker of response and survival in patients with advanced lung squamous cell carcinoma receiving immunochemotherapy or chemotherapy alone}, author = {Zhou, F and Zhang, J and Ren, S and Chen, J and Li, F and Ma, T and Fang, Y and Wang, Q and Yao, W and Guo, R and Lv, D and Cang, S and Dong, X and Wang, H and Yang, N and Fan, Y and Cui, J and Wang, Z and He, J and Huang, YS and Chen, W and Xu, LD and Zhou, C}, journal = {Journal of Thoracic Oncology}, year = {2025}, doi = {10.1016/j.jtho.2025.05.021}, url = {https://doi.org/10.1016/j.jtho.2025.05.021}, dimensions = {true}, }
plasmaMSI
A Systematic Method to Detect Next-Generation Sequencing–Based Microsatellite Instability in Plasma Cell-Free DNA: plasmaMSI

F Huang , L Zhao , H Xie , and 12 more authors

Journal of Molecular Diagnostics, 2025

Abs Bib HTML PDF

Microsatellite instability (MSI) detection using tumor tissue is a well-established prognostic and predictive biomarker for certain types of cancers. However, tumor tissue samples are less convenient to obtain than blood plasma samples. The main challenge facing next-generation sequencing–based MSI detection in blood plasma samples is the ultralow signal/noise ratio in plasma cell-free DNA (cfDNA). To address the challenge, plasmaMSI, a highly accurate cfDNA MSI detection method, is introduced with three novel performance-improving features: i) a set of stringent locus selection criteria to select loci with high robustness and compatibility across sequencing platforms; ii) a new deduplication strategy that greatly improves the signal/noise ratio for MSI detection; and iii) an MSI calling algorithm that customizes the baseline for each test sample based on its duplication rate. Through analytical validation in diluted cell line samples, the limit of detection of plasmaMSI was determined to be 0.15%. Furthermore, in analyzing 95 evaluable cfDNA samples from patients with gastrointestinal cancers, plasmaMSI exhibited a positive percentage agreement of 92.9% (39/42) and a negative percentage agreement of 100% (53/53) with tissue MSI-PCR. plasmaMSI provides novel solutions to key challenges in cfDNA MSI detection that have not been addressed by existing methods. It has also been systematically validated and is already used in clinical testing for patients with cancer.
@article{Huang2025plasmaMSI, title = {A Systematic Method to Detect Next-Generation Sequencing–Based Microsatellite Instability in Plasma Cell-Free DNA: plasmaMSI}, author = {Huang, F and Zhao, L and Xie, H and Han, T and Huang, J and Wang, X and Yang, J and Hong, Y and Shu, J and Yu, J and Li, Q and He, J and Chen, W and Huang, YS and Li, W}, journal = {Journal of Molecular Diagnostics}, year = {2025}, doi = {10.1016/j.jmoldx.2024.10.002}, url = {https://doi.org/10.1016/j.jmoldx.2024.10.002}, dimensions = {true}, }
TOTEM
TOTEM: a multi-cancer detection and localization approach using circulating tumor DNA methylation markers

D Xiong , T Han , Y Li , and 7 more authors

BMC cancer, 2024

Abs Bib HTML PDF

Background: Detection of cancer and identification of tumor origin at an early stage improve the survival and prognosis of patients. Herein, we proposed a plasma cfDNA-based approach called TOTEM to detect and trace the cancer signal origin (CSO) through methylation markers. Methods: We performed enzymatic conversion-based targeted methylation sequencing on plasma cfDNA samples collected from a clinical cohort of 500 healthy controls and 733 cancer patients with seven types of cancer (breast, colorectum, esophagus, stomach, liver, lung, and pancreas) and randomly divided these samples into a training cohort and a testing cohort. An independent validation cohort of 143 healthy controls, 79 liver cancer patients and 100 stomach cancer patients were recruited to validate the generalizability of our approach. Results: A total of 57 multi-cancer diagnostic markers and 873 CSO markers were selected for model development. The binary diagnostic model achieved an area under the curve (AUC) of 0.907, 0.908 and 0.868 in the training, testing and independent validation cohorts, respectively. With a training specificity of 98%, the specificities in the testing and independent validation cohorts were 100% and 98.6%, respectively. Overall sensitivity across all cancer stages was 65.5%, 67.3% and 55.9% in the training, testing and independent validation cohorts, respectively. Early-stage (I and II) sensitivity was 50.3% and 45.7% in the training and testing cohorts, respectively. For cancer patients correctly identified by the binary classifier, the top 1 and top 2 CSO accuracies were 77.7% and 86.5% in the testing cohort (n=148) and 76.0% and 84.0% in the independent validation cohort (n=100). Notably, performance was maintained with only 21 diagnostic and 214 CSO markers, achieving a training AUC of 0.865, a testing AUC of 0.866, and an integrated top 2 accuracy of 83.1% in the testing cohort. Conclusions: TOTEM demonstrates promising potential for accurate multi-cancer detection and localization by profiling plasma methylation markers. The real-world clinical performance of our approach needs to be investigated in a much larger prospective cohort.
@article{XiongHan2024TOTEM, title = {TOTEM: a multi-cancer detection and localization approach using circulating tumor DNA methylation markers}, author = {Xiong, D and Han, T and Li, Y and Hong, Y and Li, S and Li, X and Tao, W and Huang, YS and Chen, W and Li, C}, journal = {BMC cancer}, year = {2024}, doi = {10.1186/s12885-024-12626-7}, url = {https://doi.org/10.1186/s12885-024-12626-7}, dimensions = {true}, }
Multimodal analysis of cell-free DNA whole-methylome sequencing for cancer detection and localization

F Bie , Z Wang , Li Y. , and 22 more authors

Nature Communications, 2023

Abs Bib HTML PDF

Multimodal epigenetic characterization of cell-free DNA (cfDNA) could improve the performance of blood-based early cancer detection. However, integrative profiling of cfDNA methylome and fragmentome has been technologically challenging. Here, we adapt an enzyme-mediated methylation sequencing method for comprehensive analysis of genome-wide cfDNA methylation, fragmentation, and copy number alteration (CNA) characteristics for enhanced cancer detection. We apply this method to plasma samples of 497 healthy controls and 780 patients of seven cancer types and develop an ensemble classifier by incorporating methylation, fragmentation, and CNA features. In the test cohort, our approach achieves an area under the curve value of 0.966 for overall cancer detection. Detection sensitivity for early-stage patients achieves 73% at 99% specificity. Finally, we demonstrate the feasibility to accurately localize the origin of cancer signals with combined methylation and fragmentation profiling of tissue-specific accessible chromatin regions. Overall, this proof-of-concept study provides a technical platform to utilize multimodal cfDNA features for improved cancer detection.
@article{Bie2023THEMIS, title = {Multimodal analysis of cell-free DNA whole-methylome sequencing for cancer detection and localization}, author = {Bie, F and Wang, Z and Y., Li and Guo, W. and Hong, Y. and Han, T. and Lv, F. and Yang, S. and Li, S. and Li, X. and Nie, P. and Xu, S. and Zang, R. and Zhang, M. and Song, P. and Feng, F. and Duan, J. and Bai, G. and Li, Y. and Huai, Q. and Zhou, B. and Huang, YS and Chen, W. and Tan, F. and Gao, S.}, journal = {Nature Communications}, year = {2023}, doi = {10.1038/s41467-023-41774-w}, url = {https://doi.org/10.1038/s41467-023-41774-w}, dimensions = {true}, }
eGADA: enhanced Genomic Alteration Detection Algorithm, a fast Sparse-Bayesian-Learning based genomic segmentation algorithm

YS Huang

bioRxiv, 2023

Abs Bib HTML PDF

eGADA is an enhanced version of GADA, which is a fast segmentation algorithm utilizing the Sparse Bayesian Learning (or Relevance Vector Machine) technique from Tipping 2001. It can be applied to array intensity data, NGS sequencing coverage data, or any sequential data that displays characteristics of stepwise functions. Improvements by eGADA over GADA include: a) a customized Red-Black tree to significantly expedite the final backward elimination step of GADA; b) code in C++, which is safer and better structured than C; c) use Boost libraries extensively to provide user-friendly help and commandline argument processing; d) user-friendly input and output formats; e) export a dynamic library eGADA.so (packaged via Boost.Python) that offers API to Python; f) some bug fixes/optimization. The code is published at https://github.com/polyactis/eGADA.
@article{Huang2023eGADA, title = {eGADA: enhanced Genomic Alteration Detection Algorithm, a fast Sparse-Bayesian-Learning based genomic segmentation algorithm}, author = {Huang, YS}, journal = {bioRxiv}, year = {2023}, doi = {10.1101/2023.08.20.553622}, url = {https://doi.org/10.1101/2023.08.20.553622}, dimensions = {true}, }
Deffini
Deffini: A family-specific deep neural network model for structure-based virtual screening

D Zhou , F Liu , Y Zheng , and 3 more authors

Computers in Biology and Medicine, 2022

Abs Bib HTML PDF

Deep learning-based virtual screening methods have been shown to significantly improve the accuracy of traditional docking-based virtual screening methods. In this paper, we developed Deffini, a structure-based virtual screening neural network model. During training, Deffini learns protein-ligand docking poses to distinguish actives and decoys and then to predict whether a new ligand will bind to the protein target. Deffini outperformed Smina with an average AUC ROC of 0.92 and AUC PRC of 0.44 in 3-fold cross-validation on the benchmark dataset DUD-E. However, when tested on the maximum unbiased validation (MUV) dataset, Deffini achieved poor results with an average AUC ROC of 0.517. We used the family-specific training approach to train the model to improve the model performance and concluded that family-specific models performed better than the pan-family models. To explore the limits of the predictive power of the family-specific models, we constructed Kernie, a new protein kinase dataset consisting of 358 kinases. Deffini trained with the Kernie dataset outperformed all recent benchmarks on the MUV kinases, with an average AUC ROC of 0.745, which highlights the importance of quality datasets in improving the performance of deep neural network models and the importance of using family-specific models.
@article{Zhou2022Deffini, title = {Deffini: A family-specific deep neural network model for structure-based virtual screening}, author = {Zhou, D and Liu, F and Zheng, Y and Hu, L and Huang, T and Huang, YS}, journal = {Computers in Biology and Medicine}, year = {2022}, doi = {10.1016/j.compbiomed.2022.106323}, url = {https://doi.org/10.1016/j.compbiomed.2022.106323}, dimensions = {true}, }
Accucopy
Accucopy: Accurate and Fast Inference of Allele-specific Copy Number Alterations from Low-coverage Low-purity Tumor Sequencing Data

X Fan , G Luo , and YS Huang

BMC Bioinformatics, 2021

Abs Bib HTML PDF

BACKGROUND: Copy number alterations (CNAs), due to their large impact on the genome, have been an important contributing factor to oncogenesis and metastasis. Detecting genomic alterations from the shallow-sequencing data of a low-purity tumor sample remains a challenging task. RESULTS: We introduce Accucopy, a method to infer total copy numbers (TCNs) and allele-specific copy numbers (ASCNs) from challenging low-purity and low-coverage tumor samples. Accucopy adopts many robust statistical techniques such as kernel smoothing of coverage differentiation information to discern signals from noise and combines ideas from time-series analysis and the signal-processing field to derive a range of estimates for the period in a histogram of coverage differentiation information. Statistical learning models such as the tiered Gaussian mixture model, the expectation–maximization algorithm, and sparse Bayesian learning were customized and built into the model. Accucopy is implemented in C++ /Rust, packaged in a docker image, and supports non-human samples, more at http://www.yfish.org/software/. CONCLUSIONS: We describe Accucopy, a method that can predict both TCNs and ASCNs from low-coverage low-purity tumor sequencing data. Through comparative analyses in both simulated and real-sequencing samples, we demonstrate that Accucopy is more accurate than Sclust, ABSOLUTE, and Sequenza.
@article{Fan2021Accucopy, title = {Accucopy: Accurate and Fast Inference of Allele-specific Copy Number Alterations from Low-coverage Low-purity Tumor Sequencing Data}, author = {Fan, X and Luo, G and Huang, YS}, journal = {BMC Bioinformatics}, doi = {10.1186/s12859-020-03924-5}, url = {https://doi.org/10.1186/s12859-020-03924-5}, year = {2021}, publisher = {BMC}, dimensions = {true}, }