Benchmarking Combinations of different Polishing Tools to Optimise Hybrid Genome Assembly Quality
- 13 minsBangalore, India 2024
I was tasked with improving the quality and accuracy of our hybrid assemblies. To achieve this, I studied how different polishing models integrated into the pipeline worked and developed a suite of scripts to automate the evaluation of different combinations. These scripts compare assemblies based on key metrics, allowing us to optimize our workflow and ensure the reliability of our genomic data.
Objective The purpose of this analysis is to systematically evaluate the impact of different polishing model combinations on the quality and completeness of hybrid genome assemblies constructed from both Illumina and Nanopore reads, generated by the Dragonflye pipeline. By employing various combinations of polishing models such as Racon, Medaka, Pilon, and Polypolish, we aim to identify the most effective strategy for enhancing assembly accuracy, contiguity, and overall quality. This analysis is crucial for optimizing the workflow, and ensuring the robustness of genome assemblies.
Polishing models
-
Racon: Corrects base errors and improves consensus accuracy in reads obtained from initial draft assemblies without a consensus step. (Canu v2.2 Manual: https://github.com/marbl/canu)
-
Medaka: Improves base calling accuracy and reduces insertion/deletions (indels) in Nanopore reads by using Neural networks. (Oxford Nanopore Technologies: https://github.com/nanoporetech/medaka)
-
Pilon: Corrects indels and improves assembly continuity by aligning Illumina reads to the draft assembly for further error correction. (Walker et al. (2014). Pilon: An integrative tool for reference genome improvement. PLoS ONE 9(11): e112963. (DOI: 10.1371/journal.pone.0112963))
-
Polypolish: Polishes contigs from hybrid assemblies by using evidence from both Illumina and Nanopore reads to improve base quality and potentially resolve complex regions. (Wick et al. (2017) [DOI: 10.1093/bioinformatics/btx516])
Equipment and Software
Dragonflye pipeline (version 1.1.1) Racon (version 1.5.0) Medaka (version 1.8.0) Pilon (version 1.24) Polypolish (version v0.5.0) QUAST (version 5.0.2) BUSCO (version 5.5.0)
Assembly and Polishing Procedure with Dragonflye, Metrics Collection, and Analysis
1. Sample Preparation:
Ensure a consistent set of FastQ files for both Illumina and Nanopore reads for all polishing model combinations.
2. Assembly Generation:
Installing Dragonflye from Bioconda:
# With conda
conda create -n dragonflye -c conda-forge -c bioconda dragonflye
# With Mamba (faster)
mamba create -n dragonflye -c conda-forge -c bioconda dragonflye
3. Running the Dragonflye pipeline:
Use the following script to run Dragonflye with multiple polishing combinations:
#!/bin/bash
# Script for Running Dragonflye Assembly Pipeline
# Define variables
IDS_FILE="ids"
BASE_DIR="/mnt/d/assembly_evaluation_cgps/fastqs/KPN"
G_SIZE="5.5M"
CPUS="16"
RAM="15"
MODEL="r941_min_sup_g507"
# Function to execute Dragonflye command
run_dragonflye() {
local command_num=$1
local out_dir="$BASE_DIR/dragonflye_out_$command_num"
echo "Executing Command $command_num"
for f in $(cat $IDS_FILE); do
dragonflye --trim --gsize $G_SIZE --reads ${f}*-ont.fastq.gz \
--outdir "$out_dir/$f" --prefix $f --R1 "$f"*-R1.fastq.gz \
--R2 "$f"*-R2.fastq.gz --cpus $CPUS --ram $RAM $2
done
# Check the exit status of the Dragonflye command
if [ $? -eq 0 ]; then
echo "Command $command_num succeeded"
else
echo "Command $command_num failed, skipping remaining commands"
exit 1
fi
}
# Command 1
run_dragonflye 1 "--racon 1 --medaka 1 --pilon 1 --polypolish 1"
# Command 2
run_dragonflye 2 "--racon 1 --polypolish 1"
# Command 3
run_dragonflye 3 "--racon 1"
# Command 4
run_dragonflye 4 "--racon 1 --medaka 1 --model $MODEL --pilon 1"
# Command 5
run_dragonflye 5 "--medaka 1 --model $MODEL --pilon 1 --polypolish 1"
echo "All Dragonflye commands executed successfully"
3. Polishing:
Apply Racon, Medaka, Pilon, and Polypolish rounds according to your analysis needs.
4. Metrics Collection:
For each assembly, report the following metrics:
- Complete and single-copy BUSCOs (%)
- GC content (%)
- N50
- Depth
- Total number of base pairs
- Average contig length
- Number of contigs
- Number of circular contigs
- Average depth of contigs
- Data Analysis: The following scripts can be used to automate the capturing of these metrics from the resulting assemblies obtained by different polishing model combinations.
1. Complete and single-copy BUSCOs (%)
Install BUSCO using Conda or Docker
# With Mamba
mamba install busco
# or update with
mamba update busco
Alternatively, use the docker container:
docker pull quay.io/biocontainers/busco:<tag>
(see `busco/tags`_ for valid values for ``<tag>``)
Running BUSCO in a loop on multiple FASTA files:
# Example:
for f in */*.fa; do busco -i $f -o $f --out_path /path/to/busco_output -m geno -c 45 -l /path/to/lineages/pseudomonadales_odb10/; done
Summarizing BUSCO results
- Parse multiple BUSCO output text files and append the results to a combined form.
- Execute the script: busco_summarise_results.py
- This script captures required metrics from BUSCO summary files generated by the BUSCO tool. It parses the summary files and outputs the results to a CSV file.
import sys
import os
import re
def parse_busco_summary(file_path):
with open(file_path, 'r') as file:
lines = file.readlines()
species_name = os.path.basename(file_path).replace('short_summary.specific.enterobacterales_odb10.', '').replace('.txt', '')
summary_line = next(line for line in lines if 'C:' in line)
complete_percentage = float(re.search(r'C:(\d+\.\d+)%', summary_line).group(1))
single_percentage = float(re.search(r'S:(\d+\.\d+)%', summary_line).group(1))
return [species_name, complete_percentage, single_percentage]
def main():
if len(sys.argv) != 2:
print("Usage: python busco_summarise_results.py /path/to/busco_summaries/")
sys.exit(1)
busco_directory = sys.argv[1]
output_csv = os.path.join(busco_directory, "output_new.csv")
busco_files = [os.path.join(busco_directory, file) for file in os.listdir(busco_directory) if file.endswith(".txt")]
data = [parse_busco_summary(file) for file in busco_files]
with open(output_csv, 'w') as csv_file:
csv_file.write("Species,Complete_Percentage,Single_Percentage\n")
for species, c_percentage, s_percentage in data:
csv_file.write(f"{species},{c_percentage},{s_percentage}\n")
if __name__ == "__main__":
main()
-
Compare and analyze the metrics for each assembly to identify trends and variations in performance.
-
Utilize graphical representations for a comprehensive analysis by generating plots to summarise the results https://gitlab.com/ezlab/busco/-/blob/master/scripts/generate_plot.py?ref_type=heads
2. GC content (%) & N50
- Install QUAST
- Given that you already have a conda environment in which you want to have this package, install with:
mamba install quast
and update with::
mamba update quast
- To create a new environment, run:
mamba create --name myenvname quast
- Alternatively, use the docker container:
docker pull quay.io/biocontainers/quast:<tag>
(see
quast/tags
_ for valid values for<tag>
) - Example of running QUAST on your assembled fasta files
quast */*.fa --threads n -o /mnt/d/assembly_evaluation_cgps/fastqs/ABA/dragonflye_out_1/QUAST
3. Estimated Sequencing Depth:
- extract_depth_from_logs.py
- This script extracts estimated sequencing depth and total number of base pairs from Dragonflye log files. Usage Replace ‘your_parent_directory’ with the actual path to the parent directory containing subdirectories. Replace ‘depth_out.csv’ with the desired output CSV file name.
- Execute the script: python extract_depth_from_logs.py ```python import os import csv import re import glob
def extract_values_from_log(log_file_path): with open(log_file_path, ‘r’) as log_file: content = log_file.read()
# Extracting values using regular expressions
total_bp_match = re.search(r'\[dragonflye\] Read stats: total_bp = (\d+)', content)
depth_match = re.search(r'\[dragonflye\] Estimated sequencing depth: (\d+)x', content)
if total_bp_match and depth_match:
total_bp = total_bp_match.group(1)
sequencing_depth = depth_match.group(1)
return total_bp, sequencing_depth
else:
return None, None
def process_directory(directory_path, output_csv_path): with open(output_csv_path, ‘w’, newline=’’) as csvfile: csv_writer = csv.writer(csvfile)
# Write header to the CSV file
csv_writer.writerow(['Directory', 'Total_BP', 'Sequencing_Depth'])
# Iterate through directories
for root, dirs, files in os.walk(directory_path):
for d in dirs:
log_file_path = os.path.join(root, d, 'dragonflye_out_1', d, '*.log')
log_file_matches = glob.glob(log_file_path)
if log_file_matches:
log_file_path = log_file_matches[0] # Assuming there's only one matching log file
total_bp, sequencing_depth = extract_values_from_log(log_file_path)
if total_bp is not None and sequencing_depth is not None:
csv_writer.writerow([d, total_bp, sequencing_depth])
Replace ‘your_directory_path’ with the actual path to the parent directory containing subdirectories
directory_path = ‘/path/to/your/parent_directory’
Replace ‘output.csv’ with the desired output CSV file name
output_csv_path = ‘/path/to/your/output_directory/depth_out.csv’
process_directory(directory_path, output_csv_path)
**4. Average contig length**
- calculate_contig_avg_length.sh
- This script extracts contig length from an assembled FASTA file and calculates the average length for all contigs. Usage ./calculate_contig_avg_length.sh <input_fasta_file> Replace <input_fasta_file> with the path to your assembled FASTA file.
```bash
#!/bin/bash
# Check for the correct number of command-line arguments
if [ "$#" -ne 1 ]; then
echo "Usage: $0 <input_fasta_file>"
exit 1
fi
# Input FASTA file
fasta_file="$1"
# Check if the input file exists
if [ ! -f "$fasta_file" ]; then
echo "Error: File $fasta_file not found."
exit 1
fi
# Extract file name without extension
file_prefix=$(basename "$fasta_file" .fasta)
# Create a directory named "length" inside the directory of the input FASTA file
output_dir="$(dirname "$fasta_file")/length"
mkdir -p "$output_dir"
# Output CSV file for mean len in the "length" directory
csv_len_file="$output_dir/${file_prefix}_mean_len.csv"
# Extract len= values and calculate mean
awk '/^>/ { len_sum += gensub(/^.*len=([0-9]+).*$/, "\\1", "g", $0); len_count++; next } END { if (len_count > 0) print len_sum / len_count }' "$fasta_file" > "$csv_len_file"
echo "Mean of len values calculated. Result saved in $csv_len_file"
5. Average contig depth
- calculate_contig_avg_depth.sh
- This script extracts contig depth from an assembled FASTA file and calculates the average depth for all contigs. Usage ./calculate_contig_avg_depth.sh
Replace with your assembled FASTA file path. ```bash #!/bin/bash
Check for the correct number of command-line arguments
if [ “$#” -ne 1 ]; then echo “Usage: $0
Input FASTA file
fasta_file=”$1”
Check if the input file exists
if [ ! -f “$fasta_file” ]; then echo “Error: File $fasta_file not found.” exit 1 fi
Extract file name without extension
file_prefix=$(basename “$fasta_file” .fasta)
Output CSV file in the same directory as the input file
csv_file=”$(dirname “$fasta_file”)/${file_prefix}_mean_cov.csv”
Extract cov= values and calculate mean
awk ‘/^>/ { sum += gensub(/^.cov=([0-9.]+).$/, “\1”, “g”, $0); count++; next } END { if (count > 0) print sum / count }’ “$fasta_file” > “$csv_file”
echo “Mean of cov values calculated. Result saved in $csv_file”
**6. Number of contigs and circular contigs**
- extract_contig_stats.sh
- This script extracts the number of contigs and the number of circular contigs from a FASTA file. Usage ./extract_contig_stats.sh <input_fasta_file>
```bash
#!/bin/bash
# Check for the correct number of command-line arguments
if [ "$#" -ne 1 ]; then
echo "Usage: $0 <input_fasta_file>"
exit 1
fi
# Input FASTA file
fasta_file="$1"
# Check if the input file exists
if [ ! -f "$fasta_file" ]; then
echo "Error: File $fasta_file not found."
exit 1
fi
# Extract file name without extension
file_name=$(basename "$fasta_file" .fasta)
# Create a directory named "contig_info" inside the directory of the input FASTA file
output_dir="$(dirname "$fasta_file")/contig_info"
mkdir -p "$output_dir"
# Output CSV file in the "contig_info" directory
csv_file="$output_dir/${file_name}_contig_stats.csv"
# Count the number of contigs and circular contigs
num_contigs=$(grep -c "^>" "$fasta_file")
num_circular_contigs=$(grep -c "^>.*circular=Y" "$fasta_file")
# Write the result to the CSV file without column headers
echo "$file_name,$num_contigs,$num_circular_contigs" >> "$csv_file"
echo "Contig statistics calculated. Result saved in $csv_file"
6. Reporting
- Prepare a comprehensive summary report highlighting key findings, comparisons between polishing model combinations, and any notable observations.
Sample results sheet
ID | Complete and single-copy BUSCOs (S) % | GC % | N50 | Depth | Total no of bps | Avg contig length | No of contigs | No of circular contigs | Avg depth of contigs |
---|---|---|---|---|---|---|---|---|---|
nigeria-acinetobacter_bauminnii-01 | 99.5 | 39.1 | 339622 | 1523 | 1614554 | 2162 | 0 | 3725 | 0 |
nigeria-acinetobacter_bauminnii-02 | 99.6 | 39.0 | 139727 | 2119 | 461479 | 7274 | 2 | 0 | 443 |
nigeria-acinetobacter_bauminnii-03 | 99.5 | 39.2 | 140085 | 3615 | 562649 | 9990 | 9 | 8 | 1064 |
nigeria-acinetobacter_bauminnii-04 | 99.2 | 39.1 | 240791 | 3955 | 161453 | 3671 | 3 | 6 | 877 |
nigeria-acinetobacter_bauminnii-05 | 99.6 | 39.0 | 39271 | 1918 | 461459 | 1980 | 8 | 1 | 902 |
nigeria-escherichia_coli-01.fa | 99.5 | 50.3 | 94945 | 3264 | 218971 | 5773 | 7 | 7 | 695 |
nigeria-escherichia_coli-02.fa | 99.5 | 50.3 | 94944 | 9334 | 420268 | 8043 | 9 | 0 | 608 |
nigeria-escherichia_coli-03.fa | 99.3 | 50.3 | 34899 | 8466 | 428765 | 5073 | 3 | 3 | 959 |
Observation of Results
General Trends
- BUSCO Completeness: All combinations generally exhibit high BUSCO completeness, indicating good genome assembly quality.
- GC Content: The GC content varies across the different samples, but remains relatively consistent within each combination.
- Contig Length and Number: The average contig length and number of contigs appear to be influenced by the specific combination of polishing tools used.
Specific Observations:
Combination 1 (Racon+Medaka+Pilon+Polypolish):
- This combination consistently achieves high BUSCO completeness and GC content across all samples.
- The average contig length and number of contigs vary, suggesting that the combination’s effectiveness might be influenced by the specific characteristics of each sample. Combination 2 (Racon+Polypolish):
- This combination also performs well in terms of BUSCO completeness and GC content.
- The average contig length and number of contigs may differ slightly compared to Combination 1, indicating that the inclusion of Medaka and Pilon in Combination 1 might have a minor impact. Combination 3 (Racon): While Racon alone can produce decent results, the addition of other polishing tools in Combinations 1 and 2 generally leads to improvements in certain metrics.