Benchmarking Combinations of different Polishing Tools to Optimise Hybrid Genome Assembly Quality

Sunday. March 31, 2024 - 13 mins

Bangalore, India 2024

I was tasked with improving the quality and accuracy of our hybrid assemblies. To achieve this, I studied how different polishing models integrated into the pipeline worked and developed a suite of scripts to automate the evaluation of different combinations. These scripts compare assemblies based on key metrics, allowing us to optimize our workflow and ensure the reliability of our genomic data.

Objective The purpose of this analysis is to systematically evaluate the impact of different polishing model combinations on the quality and completeness of hybrid genome assemblies constructed from both Illumina and Nanopore reads, generated by the Dragonflye pipeline. By employing various combinations of polishing models such as Racon, Medaka, Pilon, and Polypolish, we aim to identify the most effective strategy for enhancing assembly accuracy, contiguity, and overall quality. This analysis is crucial for optimizing the workflow, and ensuring the robustness of genome assemblies.

Polishing models

Racon: Corrects base errors and improves consensus accuracy in reads obtained from initial draft assemblies without a consensus step. (Canu v2.2 Manual: https://github.com/marbl/canu)
Medaka: Improves base calling accuracy and reduces insertion/deletions (indels) in Nanopore reads by using Neural networks. (Oxford Nanopore Technologies: https://github.com/nanoporetech/medaka)
Pilon: Corrects indels and improves assembly continuity by aligning Illumina reads to the draft assembly for further error correction. (Walker et al. (2014). Pilon: An integrative tool for reference genome improvement. PLoS ONE 9(11): e112963. (DOI: 10.1371/journal.pone.0112963))
Polypolish: Polishes contigs from hybrid assemblies by using evidence from both Illumina and Nanopore reads to improve base quality and potentially resolve complex regions. (Wick et al. (2017) [DOI: 10.1093/bioinformatics/btx516])

Equipment and Software

Dragonflye pipeline (version 1.1.1) Racon (version 1.5.0) Medaka (version 1.8.0) Pilon (version 1.24) Polypolish (version v0.5.0) QUAST (version 5.0.2) BUSCO (version 5.5.0)

Assembly and Polishing Procedure with Dragonflye, Metrics Collection, and Analysis

1. Sample Preparation:

Ensure a consistent set of FastQ files for both Illumina and Nanopore reads for all polishing model combinations.

2. Assembly Generation:

Installing Dragonflye from Bioconda:

# With conda
conda create -n dragonflye -c conda-forge -c bioconda dragonflye

# With Mamba (faster)
mamba create -n dragonflye -c conda-forge -c bioconda dragonflye

3. Running the Dragonflye pipeline:

Use the following script to run Dragonflye with multiple polishing combinations:

#!/bin/bash

# Script for Running Dragonflye Assembly Pipeline

# Define variables
IDS_FILE="ids"
BASE_DIR="/mnt/d/assembly_evaluation_cgps/fastqs/KPN"
G_SIZE="5.5M"
CPUS="16"
RAM="15"
MODEL="r941_min_sup_g507"

# Function to execute Dragonflye command
run_dragonflye() {
   local command_num=$1
   local out_dir="$BASE_DIR/dragonflye_out_$command_num"

   echo "Executing Command $command_num"

   for f in $(cat $IDS_FILE); do
       dragonflye --trim --gsize $G_SIZE --reads ${f}*-ont.fastq.gz \
           --outdir "$out_dir/$f" --prefix $f --R1 "$f"*-R1.fastq.gz \
           --R2 "$f"*-R2.fastq.gz --cpus $CPUS --ram $RAM $2
   done

   # Check the exit status of the Dragonflye command
   if [ $? -eq 0 ]; then
       echo "Command $command_num succeeded"
   else
       echo "Command $command_num failed, skipping remaining commands"
       exit 1
   fi
}

# Command 1
run_dragonflye 1 "--racon 1 --medaka 1 --pilon 1 --polypolish 1"

# Command 2
run_dragonflye 2 "--racon 1 --polypolish 1"

# Command 3
run_dragonflye 3 "--racon 1"

# Command 4
run_dragonflye 4 "--racon 1 --medaka 1 --model $MODEL --pilon 1"

# Command 5
run_dragonflye 5 "--medaka 1 --model $MODEL --pilon 1 --polypolish 1"

echo "All Dragonflye commands executed successfully"

3. Polishing:

Apply Racon, Medaka, Pilon, and Polypolish rounds according to your analysis needs.

4. Metrics Collection:

For each assembly, report the following metrics:

Complete and single-copy BUSCOs (%)
GC content (%)
N50
Depth
Total number of base pairs
Average contig length
Number of contigs
Number of circular contigs
Average depth of contigs

Data Analysis: The following scripts can be used to automate the capturing of these metrics from the resulting assemblies obtained by different polishing model combinations.

1. Complete and single-copy BUSCOs (%)

Install BUSCO using Conda or Docker

# With Mamba
mamba install busco
# or update with
mamba update busco

Alternatively, use the docker container:

  docker pull quay.io/biocontainers/busco:<tag>

(see `busco/tags`_ for valid values for ``<tag>``)

Running BUSCO in a loop on multiple FASTA files:

# Example:
for f in */*.fa; do busco -i $f -o $f --out_path /path/to/busco_output -m geno -c 45 -l /path/to/lineages/pseudomonadales_odb10/; done

Summarizing BUSCO results

Parse multiple BUSCO output text files and append the results to a combined form.
Execute the script: busco_summarise_results.py
This script captures required metrics from BUSCO summary files generated by the BUSCO tool. It parses the summary files and outputs the results to a CSV file.

import sys
import os
import re

def parse_busco_summary(file_path):
   with open(file_path, 'r') as file:
       lines = file.readlines()
   species_name = os.path.basename(file_path).replace('short_summary.specific.enterobacterales_odb10.', '').replace('.txt', '')
   summary_line = next(line for line in lines if 'C:' in line)
   complete_percentage = float(re.search(r'C:(\d+\.\d+)%', summary_line).group(1))
   single_percentage = float(re.search(r'S:(\d+\.\d+)%', summary_line).group(1))
   return [species_name, complete_percentage, single_percentage]

def main():
   if len(sys.argv) != 2:
       print("Usage: python busco_summarise_results.py /path/to/busco_summaries/")
       sys.exit(1)
   busco_directory = sys.argv[1]
   output_csv = os.path.join(busco_directory, "output_new.csv")
   busco_files = [os.path.join(busco_directory, file) for file in os.listdir(busco_directory) if file.endswith(".txt")]
   data = [parse_busco_summary(file) for file in busco_files]

   with open(output_csv, 'w') as csv_file:
       csv_file.write("Species,Complete_Percentage,Single_Percentage\n")
       for species, c_percentage, s_percentage in data:
           csv_file.write(f"{species},{c_percentage},{s_percentage}\n")

if __name__ == "__main__":
   main()

Compare and analyze the metrics for each assembly to identify trends and variations in performance.
Utilize graphical representations for a comprehensive analysis by generating plots to summarise the results https://gitlab.com/ezlab/busco/-/blob/master/scripts/generate_plot.py?ref_type=heads

2. GC content (%) & N50

Install QUAST
Given that you already have a conda environment in which you want to have this package, install with:
```
mamba install quast
```

and update with::

   mamba update quast

To create a new environment, run:
```
mamba create --name myenvname quast
```
Alternatively, use the docker container:
```
docker pull quay.io/biocontainers/quast:<tag>
```
(see quast/tags_ for valid values for <tag>)

Example of running QUAST on your assembled fasta files

 quast */*.fa --threads n -o /mnt/d/assembly_evaluation_cgps/fastqs/ABA/dragonflye_out_1/QUAST

3. Estimated Sequencing Depth:

extract_depth_from_logs.py
This script extracts estimated sequencing depth and total number of base pairs from Dragonflye log files. Usage Replace ‘your_parent_directory’ with the actual path to the parent directory containing subdirectories. Replace ‘depth_out.csv’ with the desired output CSV file name.
Execute the script: python extract_depth_from_logs.py ```python import os import csv import re import glob

def extract_values_from_log(log_file_path): with open(log_file_path, ‘r’) as log_file: content = log_file.read()

   # Extracting values using regular expressions
   total_bp_match = re.search(r'\[dragonflye\] Read stats: total_bp = (\d+)', content)
   depth_match = re.search(r'\[dragonflye\] Estimated sequencing depth: (\d+)x', content)
  
   if total_bp_match and depth_match:
       total_bp = total_bp_match.group(1)
       sequencing_depth = depth_match.group(1)
      
       return total_bp, sequencing_depth
   else:
       return None, None

def process_directory(directory_path, output_csv_path): with open(output_csv_path, ‘w’, newline=’’) as csvfile: csv_writer = csv.writer(csvfile)

   # Write header to the CSV file
   csv_writer.writerow(['Directory', 'Total_BP', 'Sequencing_Depth'])
  
   # Iterate through directories
   for root, dirs, files in os.walk(directory_path):
       for d in dirs:
           log_file_path = os.path.join(root, d, 'dragonflye_out_1', d, '*.log')
           log_file_matches = glob.glob(log_file_path)
          
           if log_file_matches:
               log_file_path = log_file_matches[0]  # Assuming there's only one matching log file
              
               total_bp, sequencing_depth = extract_values_from_log(log_file_path)
              
               if total_bp is not None and sequencing_depth is not None:
                   csv_writer.writerow([d, total_bp, sequencing_depth])

Replace ‘your_directory_path’ with the actual path to the parent directory containing subdirectories

directory_path = ‘/path/to/your/parent_directory’

Replace ‘output.csv’ with the desired output CSV file name

output_csv_path = ‘/path/to/your/output_directory/depth_out.csv’

process_directory(directory_path, output_csv_path)

**4. Average contig length**
- calculate_contig_avg_length.sh
- This script extracts contig length from an assembled FASTA file and calculates the average length for all contigs. Usage ./calculate_contig_avg_length.sh <input_fasta_file> Replace <input_fasta_file> with the path to your assembled FASTA file.
```bash
#!/bin/bash

# Check for the correct number of command-line arguments
if [ "$#" -ne 1 ]; then
   echo "Usage: $0 <input_fasta_file>"
   exit 1
fi

# Input FASTA file
fasta_file="$1"

# Check if the input file exists
if [ ! -f "$fasta_file" ]; then
   echo "Error: File $fasta_file not found."
   exit 1
fi

# Extract file name without extension
file_prefix=$(basename "$fasta_file" .fasta)

# Create a directory named "length" inside the directory of the input FASTA file
output_dir="$(dirname "$fasta_file")/length"
mkdir -p "$output_dir"

# Output CSV file for mean len in the "length" directory
csv_len_file="$output_dir/${file_prefix}_mean_len.csv"

# Extract len= values and calculate mean
awk '/^>/ { len_sum += gensub(/^.*len=([0-9]+).*$/, "\\1", "g", $0); len_count++; next } END { if (len_count > 0) print len_sum / len_count }' "$fasta_file" > "$csv_len_file"

echo "Mean of len values calculated. Result saved in $csv_len_file"

5. Average contig depth

calculate_contig_avg_depth.sh
This script extracts contig depth from an assembled FASTA file and calculates the average depth for all contigs. Usage ./calculate_contig_avg_depth.sh Replace with your assembled FASTA file path. ```bash #!/bin/bash

Check for the correct number of command-line arguments

if [ “$#” -ne 1 ]; then echo “Usage: $0 " exit 1 fi

Input FASTA file

fasta_file=”$1”

Check if the input file exists

if [ ! -f “$fasta_file” ]; then echo “Error: File $fasta_file not found.” exit 1 fi

Extract file name without extension

file_prefix=$(basename “$fasta_file” .fasta)

Output CSV file in the same directory as the input file

csv_file=”$(dirname “$fasta_file”)/${file_prefix}_mean_cov.csv”

Extract cov= values and calculate mean

awk ‘/^>/ { sum += gensub(/^.cov=([0-9.]+).$/, “\1”, “g”, $0); count++; next } END { if (count > 0) print sum / count }’ “$fasta_file” > “$csv_file”

echo “Mean of cov values calculated. Result saved in $csv_file”

**6. Number of contigs and circular contigs**
- extract_contig_stats.sh 
- This script extracts the number of contigs and the number of circular contigs from a FASTA file. Usage ./extract_contig_stats.sh <input_fasta_file>
```bash
#!/bin/bash

# Check for the correct number of command-line arguments
if [ "$#" -ne 1 ]; then
   echo "Usage: $0 <input_fasta_file>"
   exit 1
fi

# Input FASTA file
fasta_file="$1"

# Check if the input file exists
if [ ! -f "$fasta_file" ]; then
   echo "Error: File $fasta_file not found."
   exit 1
fi

# Extract file name without extension
file_name=$(basename "$fasta_file" .fasta)

# Create a directory named "contig_info" inside the directory of the input FASTA file
output_dir="$(dirname "$fasta_file")/contig_info"
mkdir -p "$output_dir"

# Output CSV file in the "contig_info" directory
csv_file="$output_dir/${file_name}_contig_stats.csv"

# Count the number of contigs and circular contigs
num_contigs=$(grep -c "^>" "$fasta_file")
num_circular_contigs=$(grep -c "^>.*circular=Y" "$fasta_file")

# Write the result to the CSV file without column headers
echo "$file_name,$num_contigs,$num_circular_contigs" >> "$csv_file"

echo "Contig statistics calculated. Result saved in $csv_file"

6. Reporting

Prepare a comprehensive summary report highlighting key findings, comparisons between polishing model combinations, and any notable observations.

Sample results sheet

ID	Complete and single-copy BUSCOs (S) %	GC %	N50	Depth	Total no of bps	Avg contig length	No of contigs	No of circular contigs	Avg depth of contigs
nigeria-acinetobacter_bauminnii-01	99.5	39.1	339622	1523	1614554	2162	0	3725	0
nigeria-acinetobacter_bauminnii-02	99.6	39.0	139727	2119	461479	7274	2	0	443
nigeria-acinetobacter_bauminnii-03	99.5	39.2	140085	3615	562649	9990	9	8	1064
nigeria-acinetobacter_bauminnii-04	99.2	39.1	240791	3955	161453	3671	3	6	877
nigeria-acinetobacter_bauminnii-05	99.6	39.0	39271	1918	461459	1980	8	1	902
nigeria-escherichia_coli-01.fa	99.5	50.3	94945	3264	218971	5773	7	7	695
nigeria-escherichia_coli-02.fa	99.5	50.3	94944	9334	420268	8043	9	0	608
nigeria-escherichia_coli-03.fa	99.3	50.3	34899	8466	428765	5073	3	3	959

Observation of Results

General Trends

BUSCO Completeness: All combinations generally exhibit high BUSCO completeness, indicating good genome assembly quality.
GC Content: The GC content varies across the different samples, but remains relatively consistent within each combination.
Contig Length and Number: The average contig length and number of contigs appear to be influenced by the specific combination of polishing tools used.

Specific Observations:

Combination 1 (Racon+Medaka+Pilon+Polypolish):

This combination consistently achieves high BUSCO completeness and GC content across all samples.
The average contig length and number of contigs vary, suggesting that the combination’s effectiveness might be influenced by the specific characteristics of each sample. Combination 2 (Racon+Polypolish):
This combination also performs well in terms of BUSCO completeness and GC content.
The average contig length and number of contigs may differ slightly compared to Combination 1, indicating that the inclusion of Medaka and Pilon in Combination 1 might have a minor impact. Combination 3 (Racon): While Racon alone can produce decent results, the addition of other polishing tools in Combinations 1 and 2 generally leads to improvements in certain metrics.

Srikanth Srinivas

Bioinformatics graduate student at University of Bristol | Former Bioinformatics Intern at Global Health Research Unit, CRL, KIMS