Searching for genes in bacterial genomes

Do you have a gene (or several) that you want to get out of a genome (or several)? This guide will explain some common ways you can search genomes for genes.

Preflight checks

Installing Mamba

Mamba is a fast, drop-in replacement for conda that significantly speeds up package installation and dependency resolution.

Type which mamba to check if you have mamba installed. If not, check if you have conda with which conda

Install Mamba into existing conda

If you already have conda installed:

conda install -n base -c conda-forge mamba

If no conda - Install Miniforge for scratch

Miniforge comes with mamba pre-installed:

# Download and install Miniforge
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

After installing Miniforge, you should type conda init and then log out and back into the shell.

Installing Required Bioinformatics Tools

Once you are sure you have mamba installed (which mamba), install the necessary tools for this searching.

# Install all tools in one command
mamba create --name=gene_search -c conda-forge -c bioconda mmseqs2 hmmer seqkit mafft fasttree blast

Verify Installation

conda activate gene_search
mmseqs version
hmmbuild -h
seqkit version
mafft --version
fasttree -help
rpsblast -version

Preparing Your Reference Database

What genomes are you searching? Before searching for genes, you need to prepare a combined reference database from your genome files. If you are searching just a single genome, you can proceed to the next section.

Combine Genome Protein Files

# Combine all .faa files from your genome database into a single file
cat input/genome_database/*/*.faa > genome_database/concatenated_orfs.faa

# Verify the combined file
grep -c "^>" genome_database/concatenated_orfs.faa  # Count number of sequences

Alternative: Combine with Custom Glob Pattern

# If your files are in a different location or pattern
cat path/to/genomes/**/*.faa > concatenated_orfs.faa

# Or combine specific genome files
cat genome1.faa genome2.faa genome3.faa > concatenated_orfs.faa

Note: Ensure sequence headers are unique across all input files to avoid ambiguity in search results.

Getting Your Query Gene

Step 1: Find Your Taxon and Annotation

Go to one of these resources to find your taxon of interest: - https://www.ebi.ac.uk/interpro/taxonomy/uniprot/
- https://www.kegg.jp/kegg/tables/br08606.html

Then search within that taxon for your annotation of interest.

Step 2: Decide on Search Strategy

Question: Am I searching broadly for this protein coding gene or specifically?

Specifically: Search the exact sequence - best for closely related genomes with high sequence similarity
By orthologous group: Use pre-built PSSM profiles from NCBI’s COG database - best when searching for functional orthologs across different species
Broadly: Get the UniRef50 cluster for that gene and generate an alignment and HMM of all sequences in the cluster - best for capturing diverse homologs while maintaining some specificity

Specific Search Approach (Sequence-based)

1. Get the UniRef90 Representative Sequence

You can search for the exact sequence you are interested in. Or, it could be more reproducible to search the UniRef90 representative sequence of your gene of interest:

If going the Uniref90 route, you can search for your sequence at https://www.uniprot.org/blast and get the uniref90 cluster for that sequence.

Or if you got your sequence from kegg.jp, you can click on the uniprot cross reference, then on “similar proteins” then on the Uniref_90 cluster.

Then, download the uniref_90 representative sequence.

# Replace UniRef90_D5H0P4 with your cluster ID from Uniprot
curl "https://rest.uniprot.org/uniref/UniRef90_D5H0P4.fasta" -o UniRef90_D5H0P4.fasta

2. Search Against Genome Database Using MMseqs2

mmseqs easy-search \
  input/UniRef90_D5H0P4.faa \
  input/genome_database/concatenated_orfs.faa \
  UniRef90_D5H0P4_hits.tsv \
  -e 1e-20 \
  tmp \
  --format-output "query,target,pident,alnlen,evalue" \
  --format-mode 4

Domain-based Search Approach (Using NCBI COG PSSMs)

This approach uses pre-built Position-Specific Scoring Matrices (PSSMs) from NCBI’s COG (Clusters of Orthologous Genes) database.

Note: COGs can be very broad. Make sure the COG you are looking for is specific enough to the function you are interested in. For reference, there are fewer than 5,000 COGs in the 2024 database.

1. Download the COG Database

# Download the COG database
update_blastdb.pl --decompress Cog

2. Identify Your COG of Interest

Find the Uniprot protein ID and search for it on http://eggnog5.embl.de

Note the COG ID (e.g., COG0001, COG4716) for reference.

3. Search Against Genome Database Using RPS-BLAST

# Search proteins against the COG database
rpsblast \
  -query genome_database/concatenated_orfs.faa \
  -db Cog \
  -out COG_hits.txt \
  -evalue 1e-5 \
  -outfmt "6 qseqid sseqid pident length evalue bitscore stitle"

4. Filter for Your Specific COG

# Extract hits for a specific COG (e.g., COG0001)
grep "COG0001" COG_hits.txt > COG0001_hits.txt

# Extract sequence IDs
cut -f1 COG0001_hits.txt | sort -u > cog_hit_ids.txt

Broad Search Approach (HMM-based)

1. Get Query Sequences from UniRef50 Cluster

# Replace UniRef50_C2EQ21 with your cluster ID
curl "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=uniref_cluster_50:UniRef50_C2EQ21" \
  > UniRef50_C2EQ21_all_members.fasta

2. Align the Sequences

mafft UniRef50_C2EQ21_all_members.fasta > UniRef50_C2EQ21_aligned.fasta

3. Build the HMM Profile

hmmbuild UniRef50_C2EQ21.hmm UniRef50_C2EQ21_aligned.fasta

4. Search Against Genome Database Using hmmsearch

hmmsearch \
  --tblout UniRef50_C2EQ21_hits.tbl \
  -E 1e-10 \
  UniRef50_C2EQ21.hmm \
  genome_database/concatenated_orfs.faa

Example Downstream Analysis

Once you have your table of gene hits, you might do a lot of different things with them.

One example would be to align the gene hits together and make a tree.

To do this we’ll parse the gene hit table to get the hit ids, then use seqkit to get the sequences from the reference database we searched. After we have the sequences we’ll align with mafft, and then make a tree with fasttree which can be viewed on ITOL.

1. Extract Hit IDs

For MMseqs2 results:

cut -f2 UniRef90_D5H0P4_hits.tsv | sort -u > hit_ids.txt

For RPS-BLAST results:

# Already extracted above as cog_hit_ids.txt
# Or extract from the full results:
cut -f1 COG_hits.txt | sort -u > hit_ids.txt

For HMM search results:

# Skip header lines and extract target names
grep -v '^#' UniRef50_C2EQ21_hits.tbl | awk '{print $1}' | sort -u > hit_ids.txt

2. Retrieve Hit Sequences

To add the filename (strain ID) to each sequence header:

for file in genome_database/*/*.faa; do
  seqkit grep -f hit_ids.txt "$file" | \
    seqkit replace -p '^' -r "$(basename ${file%.faa})_"
done > hit_sequences.faa

Or without filename prefix:

seqkit grep -f hit_ids.txt genome_database/*/*.faa > hit_sequences.faa

3. Align Hit Sequencess

Optional: Consider adding reference sequences from UniRef90 before alignment

mafft hit_sequences.faa > aligned_hit_sequences.faa

4. Build Phylogenetic Tree

fasttree aligned_hit_sequences.faa > hit_sequences.tree

Notes

E-value Thresholds

MMseqs2: -e 1e-20 is stringent for specific searches
hmmsearch: -E 1e-10 is appropriate for HMM profiles
RPS-BLAST: -evalue 1e-5 is standard for domain searches
Adjust thresholds based on your specific use case and desired sensitivity/specificity

Performance Tips

The tmp directory in MMseqs2 commands is used for temporary files and will be created automatically
For large datasets, use MAFFT’s --thread option to specify multiple CPU cores
The COG database requires ~500MB of disk space

COG Database Updates

The COG database was updated in 2024, increasing coverage from 1,309 to 2,296 species
Use update_blastdb.pl --showall to see all available databases

Phylogenetic Analysis

FastTree produces an approximately-maximum-likelihood tree quickly
For more accurate phylogenies, consider using RAxML or IQ-TREE