Kaira Cristina Peralis Tomaz¹, Felipe Ciamponi², Rafaela Pacheco¹, Jennifer Santos¹, Mariana Cavalheiro³, Fabio Patroni1,4, Julio Vancini Bernardi1,5, Murilo Meneghetti¹ , Lorena6, Alexandre Rossi Paschoal6, Marcelo Mendes Brandão1*.
1 – Integrative and System Biology Laboratory (LaBIS); Universidade Estadual de Campinas (Unicamp), Campinas, Brazil; 2 – Suzano S.A. (FuturaGene—Biotech Division), Itapetininga, Brazil; 3 – Genomics for Climate Change Research Center, Universidade Estadual de Campinas, Campinas, SP, Brasil; 4 - Brazilian Centre for Research in Energy and Materials (CNPEM), Ministry for Science, Technology, and Innovations (MCTI), Campinas, Brazil; 5 - Laboratory of Enzymology and Molecular Biology of Microorganisms (LEBIMO), Universidade Estadual de Campinas, Campinas, SP, Brasil; 6 - Department of Computer Science, The Federal University of Technology – Paraná (UTFPR)The institution will open in a new tab, Cornélio Procópio, Brazil
* Correspondence author:
Background:
Polygenic diseases, such as Alzheimer's disease (AD) and type 2 diabetes mellitus (T2D), arise from the cumulative effect of numerous genetic variants and complex gene-environment interactions, presenting significant challenges for genetic analysis and clinical management. Both AD and T2D are increasing in prevalence globally and are hypothesized to share genetic risk factors, yet the underlying mechanisms remain poorly understood.
Methods:
This study developed an open-source bioinformatics pipeline to investigate shared genetic architecture (pleiotropy) between AD and T2D using whole-exome sequencing data from the dbGaP repository. Three alignment tools (Bowtie2, BWA-MEM, BWA-MEM2) and four variant callers (GATK, BCFtools, DeepVariant, FreeBayes) were benchmarked for computational efficiency and variant detection. Genome-wide association studies (GWAS) were performed for both diseases, followed by intersection analysis to identify pleiotropic single nucleotide polymorphisms (SNPs). Variants were annotated and filtered for clinical relevance and evolutionary conservation. Functional enrichment analyses were conducted using Gene Ontology (GO) clustering via REVIGO.
Results:
BWA-MEM2 and DeepVariant demonstrated optimal computational performance for alignment and variant calling, respectively. GWAS identified 1,264 nominally significant pleiotropic SNPs (p < 0.05 in both diseases), which were filtered to 89 high-confidence variants enriched for missense and intronic effects. Functional annotation revealed shared pathways involving carbohydrate metabolism, extracellular matrix organization, and steroid metabolism, supporting the hypothesis of common metabolic and neurovascular mechanisms underlying AD-T2D comorbidity. All scripts, datasets, and results are openly available to promote reproducibility and collaborative research.
Conclusions:
The open-source pipeline enables scalable, reproducible analysis of polygenic disease genetics and highlights shared biological processes between AD and T2D. This approach demonstrates the value of open science frameworks in accelerating discoveries and facilitating the development of targeted interventions for complex diseases. Limitations include potential ethnic bias and computational resource requirements, which future work may address through federated learning and expanded population datasets
Supplementary material
All supplementary material are available at https://doi.org/10.25824/redu/5MNXME,
Scripts availability
the pipeline scripts are present in https://github.com/labis-unicamp/pipeline1
Permanent link
https://url.bioinfoguy.net/pipeline1paper