The Open Science Revolution in Complex Disease Genetics: An Integrated Pipeline from FASTQ to GWAS and Functional Pleiotropy

Kaira Cristina Peralis Tomaz¹, Felipe Ciamponi², Rafaela Pacheco¹, Jennifer Santos¹, Mariana Cavalheiro³, Fabio Patroni1,4, Julio Vancini Bernardi1,5, Murilo Meneghetti¹ , Lorena6, Alexandre Rossi Paschoal6, Marcelo Mendes Brandão1*.

1 – Integrative and System Biology Laboratory (LaBIS); Universidade Estadual de Campinas (Unicamp), Campinas, Brazil; 2 – Suzano S.A. (FuturaGene—Biotech Division), Itapetininga, Brazil; 3 – Genomics for Climate Change Research Center, Universidade Estadual de Campinas, Campinas, SP, Brasil; 4 - Brazilian Centre for Research in Energy and Materials (CNPEM), Ministry for Science, Technology, and Innovations (MCTI), Campinas, Brazil; 5 - Laboratory of Enzymology and Molecular Biology of Microorganisms (LEBIMO), Universidade Estadual de Campinas, Campinas, SP, Brasil; 6 - Department of Computer Science, The Federal University of Technology – Paraná (UTFPR)The institution will open in a new tab, Cornélio Procópio, Brazil

* Correspondence author: Este endereço de e-mail está sendo protegido de spambots. Você precisa habilitar o JavaScript para visualizá-lo.

Polygenic diseases, such as Alzheimer's disease (AD) and type 2 diabetes (T2D), present complex challenges in medical genetics due to their non-Mendelian inheritance patterns, which involve multiple alleles and environmental factors. The global incidence of AD is increasing, and diabetes is contributing to a growing healthcare burden. Recently, it has been noted that cognitive dysfunction is a significant comorbidity of diabetes, indicating a potential link between AD and diabetes and suggesting possible genetic connections. Advances in identifying specific genetic variants and understanding their interactions are paving the way for personalised medicine and thereby enhancing treatment effectiveness.
Bioinformatics analysis of genomic data offers valuable insights into the genetic foundations of Alzheimer's disease (AD) and diabetes, facilitating the development of targeted interventions. The GWAS (Genomewide Association Study) approach is an essential and well-established tool in bioinformatics for analysing genetic associations with phenotypes. However, despite the abundance of publicly available data on the internet, bioinformatics analyses remain a bottleneck in conducting biological studies.
Despite advances in genomics, bioinformatics workflows remain fragmented, limiting translational insights. Here, we present a fully open-source pipeline that streamlines polygenic disease analysis from raw sequencing data (FASTQ) to functional annotation. This pipeline solves important problems in making research repeatable and able to grow by combining tested methods for finding genetic variants, conducting GWAS, analysing pleiotropy, and adding regulatory information.
We also demonstrate how standardised, ethically curated datasets enable robust analyses of shared genetic mechanisms by leveraging the NIH’s Database of Genotypes and Phenotypes (dbGaP), a foundational open-science repository for genotype-phenotype studies. dbGaP’s dual-access model (open metadata vs. controlled individual-level data) allowed us to harmonise diverse cohorts while adhering to ethical guidelines, exemplifying how open data infrastructures can accelerate discoveries in comorbidities like AD-T2D.
Our pipeline's modular design enables researchers to bypass costly data generation phases and focus on hypothesis-driven exploration, democratising access to high-impact genomics. By aligning with open-science principles, this work mirrors transformative initiatives like the Human Genome Project, where shared data and tools spurred global collaboration. The integration of dbGaP datasets highlights the untapped potential of public repositories to fuel large-scale, reproducible studies — particularly in under-resourced settings. This system pushes forward genetic research and emphasises the need for open-source, community-focused science in genomics, encouraging collaboration across different fields and revealing the connections that contribute to complex disease studies.

Supplementary material

All supplementary material will be available at http://redu.unicamp.br