Overview
DeNSAS (De Novo Sequence Annotation System) is a comprehensive, reference-free pipeline designed for functional annotation of transcripts, proteins, and genes in genome assemblies that lack reference annotation. Developed in-house, this automated system integrates multiple public databases and computational tools to provide accurate functional characterization of sequences without requiring a reference genome.
Core Functionality
DeNSAS employs a 6-step process using custom Perl scripts and locally built databases to provide:
- Sequence Description Assignment: Integrating results from multiple database searches to assign comprehensive functional descriptions to sequences
- Gene Ontology (GO) Annotation: Assignment of GO terms for biological process, molecular function, and cellular component
- Protein Domain Identification: Recognition of conserved protein domains using PFAM
- RNA Family Classification: Identification of RNA families using RFAM
- Protease Classification: Specialized annotation of peptidases using MEROPS
- Gene Family Assignment: Characterization of gene families based on sequence similarity
Key Features
- Reference-Free Annotation: Functions independently without requiring a reference genome or annotation
- Integrative Approach: Combines data from multiple databases including PFAM, RFAM, UniProt, InterPro, MEROPS, and RefSeq
- Novel Gene Identification: Capable of characterizing potentially novel genes and duplication events
- Comprehensive Output: Provides detailed annotation including:
- Full gene names and unique IDs (when close matches exist)
- Protein domains and motifs
- Gene family assignments
- GO term classifications
- RNA family patterns
Advantages Over Existing Tools
- Unlike reference-based tools (e.g., OrthoFinder, CRB-Blast), DeNSAS does not require an existing reference strain to guide annotation
- Compared to other reference-free annotation pipelines (e.g., PANNZER2), DeNSAS offers improved accuracy through its integrative nature and multi-database approach
- Custom filtering thresholds ensure high-quality annotations with minimal false positives
Technical Implementation
DeNSAS employs stringent similarity thresholds:
- BLAST searches use an e-value cut-off of 10e⁻⁵ and HSP similarity threshold of 80%
- Sequence descriptions employ more restrictive thresholds (e-value 1e-10 and 90% similarity)
- RNA and protein family identification filtered by Expectation Value (1e-10)
- Transparent labeling of sequences with insufficient database matches as "UNKNOWN DESCRIPTION"
Applications
- Annotation of non-model organisms
- Characterization of newly sequenced genomes
- Identification of novel genes and gene families
- Comparative genomic analyses
- Functional studies of transcriptomes
Releases
- V2.3-beta.4 - Almost good (Pre-release)