Benchmarking Combinations of different Polishing Tools to Optimise Hybrid Genome Assembly Quality

- 13 mins

Bangalore, India 2024


I was tasked with improving the quality and accuracy of our hybrid assemblies. To achieve this, I studied how different polishing models integrated into the pipeline worked and developed a suite of scripts to automate the evaluation of different combinations. These scripts compare assemblies based on key metrics, allowing us to optimize our workflow and ensure the reliability of our genomic data.

Objective The purpose of this analysis is to systematically evaluate the impact of different polishing model combinations on the quality and completeness of hybrid genome assemblies constructed from both Illumina and Nanopore reads, generated by the Dragonflye pipeline. By employing various combinations of polishing models such as Racon, Medaka, Pilon, and Polypolish, we aim to identify the most effective strategy for enhancing assembly accuracy, contiguity, and overall quality. This analysis is crucial for optimizing the workflow, and ensuring the robustness of genome assemblies.

Polishing models

Equipment and Software

Dragonflye pipeline (version 1.1.1) Racon (version 1.5.0) Medaka (version 1.8.0) Pilon (version 1.24) Polypolish (version v0.5.0) QUAST (version 5.0.2) BUSCO (version 5.5.0)

Assembly and Polishing Procedure with Dragonflye, Metrics Collection, and Analysis

1. Sample Preparation:

Ensure a consistent set of FastQ files for both Illumina and Nanopore reads for all polishing model combinations.

2. Assembly Generation:

Installing Dragonflye from Bioconda:

# With conda
conda create -n dragonflye -c conda-forge -c bioconda dragonflye

# With Mamba (faster)
mamba create -n dragonflye -c conda-forge -c bioconda dragonflye

3. Running the Dragonflye pipeline:

Use the following script to run Dragonflye with multiple polishing combinations:

#!/bin/bash

# Script for Running Dragonflye Assembly Pipeline

# Define variables
IDS_FILE="ids"
BASE_DIR="/mnt/d/assembly_evaluation_cgps/fastqs/KPN"
G_SIZE="5.5M"
CPUS="16"
RAM="15"
MODEL="r941_min_sup_g507"

# Function to execute Dragonflye command
run_dragonflye() {
   local command_num=$1
   local out_dir="$BASE_DIR/dragonflye_out_$command_num"

   echo "Executing Command $command_num"

   for f in $(cat $IDS_FILE); do
       dragonflye --trim --gsize $G_SIZE --reads ${f}*-ont.fastq.gz \
           --outdir "$out_dir/$f" --prefix $f --R1 "$f"*-R1.fastq.gz \
           --R2 "$f"*-R2.fastq.gz --cpus $CPUS --ram $RAM $2
   done

   # Check the exit status of the Dragonflye command
   if [ $? -eq 0 ]; then
       echo "Command $command_num succeeded"
   else
       echo "Command $command_num failed, skipping remaining commands"
       exit 1
   fi
}

# Command 1
run_dragonflye 1 "--racon 1 --medaka 1 --pilon 1 --polypolish 1"

# Command 2
run_dragonflye 2 "--racon 1 --polypolish 1"

# Command 3
run_dragonflye 3 "--racon 1"

# Command 4
run_dragonflye 4 "--racon 1 --medaka 1 --model $MODEL --pilon 1"

# Command 5
run_dragonflye 5 "--medaka 1 --model $MODEL --pilon 1 --polypolish 1"

echo "All Dragonflye commands executed successfully"

3. Polishing:

Apply Racon, Medaka, Pilon, and Polypolish rounds according to your analysis needs.

4. Metrics Collection:

For each assembly, report the following metrics:

  1. Data Analysis: The following scripts can be used to automate the capturing of these metrics from the resulting assemblies obtained by different polishing model combinations.

1. Complete and single-copy BUSCOs (%)

Install BUSCO using Conda or Docker

# With Mamba
mamba install busco
# or update with
mamba update busco

Alternatively, use the docker container:

  docker pull quay.io/biocontainers/busco:<tag>

(see `busco/tags`_ for valid values for ``<tag>``)

Running BUSCO in a loop on multiple FASTA files:

# Example:
for f in */*.fa; do busco -i $f -o $f --out_path /path/to/busco_output -m geno -c 45 -l /path/to/lineages/pseudomonadales_odb10/; done

Summarizing BUSCO results

import sys
import os
import re

def parse_busco_summary(file_path):
   with open(file_path, 'r') as file:
       lines = file.readlines()
   species_name = os.path.basename(file_path).replace('short_summary.specific.enterobacterales_odb10.', '').replace('.txt', '')
   summary_line = next(line for line in lines if 'C:' in line)
   complete_percentage = float(re.search(r'C:(\d+\.\d+)%', summary_line).group(1))
   single_percentage = float(re.search(r'S:(\d+\.\d+)%', summary_line).group(1))
   return [species_name, complete_percentage, single_percentage]

def main():
   if len(sys.argv) != 2:
       print("Usage: python busco_summarise_results.py /path/to/busco_summaries/")
       sys.exit(1)
   busco_directory = sys.argv[1]
   output_csv = os.path.join(busco_directory, "output_new.csv")
   busco_files = [os.path.join(busco_directory, file) for file in os.listdir(busco_directory) if file.endswith(".txt")]
   data = [parse_busco_summary(file) for file in busco_files]

   with open(output_csv, 'w') as csv_file:
       csv_file.write("Species,Complete_Percentage,Single_Percentage\n")
       for species, c_percentage, s_percentage in data:
           csv_file.write(f"{species},{c_percentage},{s_percentage}\n")

if __name__ == "__main__":
   main()

2. GC content (%) & N50

and update with::

   mamba update quast

3. Estimated Sequencing Depth:

def extract_values_from_log(log_file_path): with open(log_file_path, ‘r’) as log_file: content = log_file.read()

   # Extracting values using regular expressions
   total_bp_match = re.search(r'\[dragonflye\] Read stats: total_bp = (\d+)', content)
   depth_match = re.search(r'\[dragonflye\] Estimated sequencing depth: (\d+)x', content)
  
   if total_bp_match and depth_match:
       total_bp = total_bp_match.group(1)
       sequencing_depth = depth_match.group(1)
      
       return total_bp, sequencing_depth
   else:
       return None, None

def process_directory(directory_path, output_csv_path): with open(output_csv_path, ‘w’, newline=’’) as csvfile: csv_writer = csv.writer(csvfile)

   # Write header to the CSV file
   csv_writer.writerow(['Directory', 'Total_BP', 'Sequencing_Depth'])
  
   # Iterate through directories
   for root, dirs, files in os.walk(directory_path):
       for d in dirs:
           log_file_path = os.path.join(root, d, 'dragonflye_out_1', d, '*.log')
           log_file_matches = glob.glob(log_file_path)
          
           if log_file_matches:
               log_file_path = log_file_matches[0]  # Assuming there's only one matching log file
              
               total_bp, sequencing_depth = extract_values_from_log(log_file_path)
              
               if total_bp is not None and sequencing_depth is not None:
                   csv_writer.writerow([d, total_bp, sequencing_depth])

Replace ‘your_directory_path’ with the actual path to the parent directory containing subdirectories

directory_path = ‘/path/to/your/parent_directory’

Replace ‘output.csv’ with the desired output CSV file name

output_csv_path = ‘/path/to/your/output_directory/depth_out.csv’

process_directory(directory_path, output_csv_path)


**4. Average contig length**
- calculate_contig_avg_length.sh
- This script extracts contig length from an assembled FASTA file and calculates the average length for all contigs. Usage ./calculate_contig_avg_length.sh <input_fasta_file> Replace <input_fasta_file> with the path to your assembled FASTA file.
```bash
#!/bin/bash

# Check for the correct number of command-line arguments
if [ "$#" -ne 1 ]; then
   echo "Usage: $0 <input_fasta_file>"
   exit 1
fi

# Input FASTA file
fasta_file="$1"

# Check if the input file exists
if [ ! -f "$fasta_file" ]; then
   echo "Error: File $fasta_file not found."
   exit 1
fi

# Extract file name without extension
file_prefix=$(basename "$fasta_file" .fasta)

# Create a directory named "length" inside the directory of the input FASTA file
output_dir="$(dirname "$fasta_file")/length"
mkdir -p "$output_dir"

# Output CSV file for mean len in the "length" directory
csv_len_file="$output_dir/${file_prefix}_mean_len.csv"

# Extract len= values and calculate mean
awk '/^>/ { len_sum += gensub(/^.*len=([0-9]+).*$/, "\\1", "g", $0); len_count++; next } END { if (len_count > 0) print len_sum / len_count }' "$fasta_file" > "$csv_len_file"

echo "Mean of len values calculated. Result saved in $csv_len_file"

5. Average contig depth

Check for the correct number of command-line arguments

if [ “$#” -ne 1 ]; then echo “Usage: $0 " exit 1 fi

Input FASTA file

fasta_file=”$1”

Check if the input file exists

if [ ! -f “$fasta_file” ]; then echo “Error: File $fasta_file not found.” exit 1 fi

Extract file name without extension

file_prefix=$(basename “$fasta_file” .fasta)

Output CSV file in the same directory as the input file

csv_file=”$(dirname “$fasta_file”)/${file_prefix}_mean_cov.csv”

Extract cov= values and calculate mean

awk ‘/^>/ { sum += gensub(/^.cov=([0-9.]+).$/, “\1”, “g”, $0); count++; next } END { if (count > 0) print sum / count }’ “$fasta_file” > “$csv_file”

echo “Mean of cov values calculated. Result saved in $csv_file”


**6. Number of contigs and circular contigs**
- extract_contig_stats.sh 
- This script extracts the number of contigs and the number of circular contigs from a FASTA file. Usage ./extract_contig_stats.sh <input_fasta_file>
```bash
#!/bin/bash

# Check for the correct number of command-line arguments
if [ "$#" -ne 1 ]; then
   echo "Usage: $0 <input_fasta_file>"
   exit 1
fi

# Input FASTA file
fasta_file="$1"

# Check if the input file exists
if [ ! -f "$fasta_file" ]; then
   echo "Error: File $fasta_file not found."
   exit 1
fi

# Extract file name without extension
file_name=$(basename "$fasta_file" .fasta)

# Create a directory named "contig_info" inside the directory of the input FASTA file
output_dir="$(dirname "$fasta_file")/contig_info"
mkdir -p "$output_dir"

# Output CSV file in the "contig_info" directory
csv_file="$output_dir/${file_name}_contig_stats.csv"

# Count the number of contigs and circular contigs
num_contigs=$(grep -c "^>" "$fasta_file")
num_circular_contigs=$(grep -c "^>.*circular=Y" "$fasta_file")

# Write the result to the CSV file without column headers
echo "$file_name,$num_contigs,$num_circular_contigs" >> "$csv_file"

echo "Contig statistics calculated. Result saved in $csv_file"

6. Reporting

Sample results sheet

ID Complete and single-copy BUSCOs (S) % GC % N50 Depth Total no of bps Avg contig length No of contigs No of circular contigs Avg depth of contigs
nigeria-acinetobacter_bauminnii-01 99.5 39.1 339622 1523 1614554 2162 0 3725 0
nigeria-acinetobacter_bauminnii-02 99.6 39.0 139727 2119 461479 7274 2 0 443
nigeria-acinetobacter_bauminnii-03 99.5 39.2 140085 3615 562649 9990 9 8 1064
nigeria-acinetobacter_bauminnii-04 99.2 39.1 240791 3955 161453 3671 3 6 877
nigeria-acinetobacter_bauminnii-05 99.6 39.0 39271 1918 461459 1980 8 1 902
nigeria-escherichia_coli-01.fa 99.5 50.3 94945 3264 218971 5773 7 7 695
nigeria-escherichia_coli-02.fa 99.5 50.3 94944 9334 420268 8043 9 0 608
nigeria-escherichia_coli-03.fa 99.3 50.3 34899 8466 428765 5073 3 3 959

Observation of Results

General Trends

Specific Observations:

Combination 1 (Racon+Medaka+Pilon+Polypolish):

Srikanth Srinivas

Srikanth Srinivas

Bioinformatics graduate student at University of Bristol | Former Bioinformatics Intern at Global Health Research Unit, CRL, KIMS

rss facebook twitter github youtube mail spotify instagram linkedin google pinterest medium vimeo gscholar