This is the documentation of the TRITEX sequence assembly pipeline for Triticeae genomes developed in the research group Domestication Genomics at the Leibniz Institute of Plant Genetics and Crop Research (IPK) Gatersleben.
1. Introduction
The purpose of the TRITEX pipeline is to construct chromosome-scale sequence assemblies for Triticeae genomes from different Illumina sequencing dataset within the time frame of several weeks. Possible applications are pan-genomic studies in barley and wheat, assistance in gene isolation projects, or genomic research in crop-wild relatives.
The pipeline is open-source and hosted in a public Bitbucket repository.
A paper describing the pipeline was published in Genome Biology.
TRITEX has been run on highly inbred genotypes of barley (Hordeum vulgare), tetraploid wheat (Triticum turgidum) and hexaploid wheat (T. aestivum) with reasonable results: super-scaffold N50 values in the range of dozens of Mb and pseudomolecules with better gene space representation than a BAC-by-BAC assembly. It has never been tested and is not expected to work on heterozygous or autopolyploid genomes.
The TRITEX pipeline only works with Illumina sequencing data of sufficient coverage for a certain set of libaries. A description of paired-end and mate-pair datasets suitable for use with the pipeline can be found in the supplements of Avni et al. 2017, section 1.1 and Table S12, or in the supplements of the International Wheat Genome Sequencing Consortium (IWGSC) 2018 section 1.1 and Table S1. A protocol for generating chromosome-conformation capture sequencing (Hi-C) data suitable for use with the pipeline is described in Himmelbach et al. 2018. Refer to the technical notes of 10X Genomics on how to generate Chromium data.
In case of questions on the pipeline, contact Martin Mascher.
2. Installing the required genome assembly and sequence analysis tools
-
The TRITEX pipeline needs to be run on a rather powerful Unix server. It has been tested on servers running recent CentOS Linux version. The commands were started in interactive terminal sessions opened by direct SSH to the servers. Modifications to the commands will be necessary if an intervening scheduling system is used. In case of direct SSH access, the use of a terminal multiplexer such as tmux is recommended. Jobs may run several days. If Kerberos authentification is used, ensure that tickets are renewed.
-
Install the following software. The pipeline has been successfully run with the version indicated. Using the most recent versions should be fine as well.
-
Z shell, 5.0.2
-
GNU Parallel, 20150222, paper
-
KmerStream, 1.1, paper
-
Minia, version 3, git commit 68b0c75, paper on an older version
Follow the manual, section "5 Input", paragraph "Larger k-mer lengths" to enable k-mer sizes up to 500.
-
SOAPDenovo, 2.04-r241, paper
-
Novosort. This is commercial software. A single-threaded version is available for free.
-
Seqtk, 1.0-r76
-
pigz, 2.3.4, parallel gzip
The pipeline assumes that the GNU (as opposed to BSD) versions of common UNIX tools (
coreutils
,sed
,awk
) are used. -
-
The following R packages should be installed and accessible in
.libPaths()
. The versions specified are known to work with TRITEX. Newer versions will likely work as well. -
If you want to use the R Shiny app Hi-C map inspector, you need to set up a web server running R Shiny. Refer to the R Shiny manual. Note that setting up a Shiny server is a non-trivial task of Unix systems administration and requires super-user rights.
-
Here are some thoughts on hardware requirements:
-
Scaffolding a bread wheat genome assembly with SOAPDenovo2 will require more than 500 GB of RAM. Also the pseudomolecule construction has not been run on machines with less than 1 TB of RAM and is written with speed and ease of implementation in mind, rather than with goal of being RAM efficient. Similarly, being able to sort large alignment in SAM format in memory (several hundred GB) instead of having to write temporary files on disk will accelerate the proc
-
Minia3 writes temporary files sizes on the order of hundreds of GB (which are deleted automatically when it finishes). The raw data volume for bread wheat also exceeds 1 TB.
-
Multi-threading is not strictly necessary. To finish the whole pipeline in a reasonable timeframe, 32 cores should be sufficient.
-
Thus, a machine with 32 cores, 1 TB and 3 TB of free disk space should be sufficient the run the TRITEX pipeline for hexaploid wheat.
-
In addition, you need to ensure secure long-term storage of the raw data (e.g. on redundant magnetic tape drives).
-
-
Clone the Bitbucket repository of the IPK assembly pipeline. The code has been released under the MIT license. The path to the local copy of the repository will be referenced by
$bitbucket
in the code listings. -
The scripts used below have parameters such as
--soap
or--minia
(indicated in the code listings) to specify the absolute paths to the required executables. For better readibility, there are omitted from the commands.Instead of specifying the paths each time you call the scripts, you may also edit the scripts and modify the default paths. We chose not to use executables in
$PATH
to enforce proper documentation of software versions.
3. Setting up the folder structure
-
A folder for each assembly project should be created. We refer to this directory as
$projectdir
or${projectdir}
in the code listings below. -
The project folder should have the following subdirectory structure. At the start of the pipeline, all folders except "raw_reads" should be empty.
- raw_reads
-
This folder contains files (or symbolic links refering) to the raw reads. In the first step of the pipeline, the merged PE450 reads will be written into raw_reads/PE450 folder. The following subdirectories subdirectories should be created to accomodate all input datasets:
-
PE450
-
PE800
-
MP3
-
MP6
-
MP9
-
10x
-
hic
Note that the PE800, MP3 and MP6 libraries can be omitted without greatly affecting assembly quality in barley and bread wheat.
-
- corrected_reads
-
This folder will hold the error-corrected FASTQ files for the PE and MP datasets.
- ntcard
-
This folder will hold the genome size estimates produced by ntcard from the error-corrected PE450 reads.
- minia
-
This folder will contain the contig assemblies produced by Minia3. Subfolders for each k will be created.
- soap
-
This folder will hold the scaffolding results as well as calibration runs to find optimal link count thresholds.
- gapcloser
-
This folder will contain the GapCloser results.
- 10x
-
This directory will contain the scaffolds after gapclosing prepared for pseudomolecule construction as well as the 10X read mapping results.
- hic
-
This folder will contain the alignments of Hi-C reads to the scaffolds.
- pseudomolecules
-
This folder will contain the assembly R objects, plots for quality assessment and final FASTA files.
4. Merging and error correction of PE450 reads
-
In this step, we will use BBMerge to merge the two overlapping 250 bp reads of the PE450 library. Then, BFC will be applied to correct errors in the merged reads.
-
Set up a list of pairs of FASTQ files. The following commands will create a three-column table (separated by
:
) that holds the names of matching read 1 and reads 2 in the first two columns and the prefix for the merged FASTQ files in the third column.list='PE450_files.txt' paste \ <(find $projectdir/raw_reads/PE450 | sort | grep _R1_) \(1) <(find $projectdir/raw_reads/PE450 | sort | grep _R2_) \ <(find $projectidr/raw_reads/PE450 | sort | grep _R1_ | sed -e 's/_R1_/_/' -e 's/.fastq.gz$//') \ | tr '\t' ':' > $list (2)
1 The backslash \
will be used throughout the tutorial to break long commands into several lines. When pasting the commands or editing it, make sure that no white space follows the backslash. Otherwise, the shell will interpret the lines as belonging to different commands.2 Adjustments will be required if FASTQ files for read 1/2 are not identifiable by the substrings R1
andR2
. For example, some sequencing providers use the suffixes1.fq.gz
and2.fq.gz
. The list of FASTQ files may also be created and edited in Excel if this is more convenient than using UNIX tools. The columns should be separated by a colon (:
). -
Start BBMerge. The following command will run BBMerge on all pairs of FASTQ files for the PE450 library to merge overlapping reads and create a histogram of read lengths after merging. The script will take a few hours to complete.
list='PE450_files.txt' (1) strictness='maxloose' (2) while read i; do $bitbucket/shell/run_bbmerge.zsh --threads 10 --strictness $strictness $i done < $list (3) (4)
1 This is the list created in the previous step. 2 This yields results cloesest to the output of PEAR. If coverage is very high, more stringent criteria ( verystrict
) can be used.3 You need to specify the path to the BBMerge executable ( --bbmerge
). -
Check the BBMerge output files. A good PE450 libraries has 70 - 95 % of assembled reads. If the proportion of merged reads is lower, inspect the error profile of the reads (e.g. using FASTQC).
Also inspect the distribtion of read lengths after merging. The peak should be between 400 and 470 bp. You may use R or Excel to plot read length distributions in the
*ihist.txt
files created by BBMerge. -
Run BFC for error correction. This will correct and trim the PE450 reads and create a hash table with k-mer counts to be used for correcting the paired-end and mate-pair reads used for scaffolding in a later step.
count="$projectdir/corrected_reads/PE450_bfc.hash" outdir="$projectdir/corrected_reads" input="$projectdir/raw_reads/PE450" genomesize='5G' (1) $bitbucket/shell/run_bfc_PE450.zsh --threads 30 --genomesize $genomesize \ --input $input --outdir $outdir (2) (3)
1 BFC chooses k-mer size based on the specified genome size. A rough estimate is sufficient. The genome size of BFC ( -g
) influences the choice of k-mer length (-k
) and the size of the Bloom filter (-b
).2 This step will run for about 1 day on 30 cores for hexaploid wheat. 3 You need to specify the path to the executable of BFC ( --bfc
). -
Check the BFC log files for any error messages. Also check read counts before and after correction as well as after trimming.
find $projectdir/raw_reads/PE450 | grep ihist | xargs cat | awk '!/^#/ {s+=$2} END {print s}' awk '{s+=$2} END {print s}' $projectdir/corrected_reads/PE450_bfc_correct_length_dist.tsv awk '{s+=$2} END {print s}' $projectdir/corrected_reads/PE450_bfc_correct_trim_length_dist.tsv awk '{s+=$1*$2} END {print s/GENOMESIZE}' $projectdir/corrected_reads/PE450_bfc_correct_trim_length_dist.tsv (1)
1 Replace GENOMESIZE in the AWK command with the correct values (e.g. "5e9" for barley). 40x coverage or more should yield good assemblies. -
The read counts before and after correction should be identical. The read counts after trimming should not be less than 80 % the count of raw reads.
If everything is OK, you can delete some files to save disk space: (i) the uncorrected merged FASTQ files in the input directory (
$projectdir/raw_reads/PE450
) and (ii) the corrected, untrimmed FASTQ file ($projectdir/corrected_reads/PE450_bfc_correct.fq.gz
)
5. Estimating assembly size with ntCard and KmerStream
-
Run
ntCard
(paper, Github) for different k-mer lengths followed by KmerStream to get a rough estimate which assembly size to expect. The results may also serve as genome size estimates, but interpretation should proceed with caution.$bitbucket/shell/run_ntcard.zsh --reads $projectdir/corrected_reads/PE450_bfc_correct_trim.fq.gz --threads 20 \ --outdir $projectdir/ntcard (1) (2)
1 The script will create a file named ntcard_summary.tsv
in the output directory. The "size" column reports the estimated assembly size.2 You need to specify the paths to the executable of ntCard ( --ntcard
) and toKmerStreamEstimate.py
(--estimate
).python
should be in$PATH
.$PYTHONPATH
may need to be modified to makeKmerStreamEstimate.py
work. -
As the ntCard script runs, you can begin with the next stage: contig assembly.
6. Contig assembly with Minia3
-
The contig assembly can be started as soon the error correction of the PE450 reads has finished.
-
The following command will start Minia3 iteratively for increasing k-mer sizes (100, 200, 300, 350, 400, 450, 500). Each step uses the error-corrected PE450 reads as well as the assembly of the preceding step as its input.
$bitbucket/shell/run_minia_unitig.zsh --outdir $projectdir/minia --threads 20 \ --reads $projectdir/corrected_reads/PE450_bfc_correct_trim.fq.gz \ > minia_unitig.out 2> minia_unitig.err (1) (2)
1 Modify the number of threads and the names of the log files. Log files for each step are written to the respective directories. 2 You need to specify the paths to the following executables: Minia3 ( --minia
), SAMtools (--samtools
) and the scriptn50
from$bitbucket/shell
(--n50
).A Minia3 run for a single k-mer size takes about 1 day for barley and 3 days for hexaploid wheat. So the above command will take 1 week for barley and 3 weeks for wheat to complete.
-
Should the script terminate permaturely, it can be resumed. Assuming you have successful Minia3 runs up to a k-mer size of 300 and the server was rebooted during the k=350 run, you can resume the assembly process like this:
$bitbucket/shell/run_minia_unitig.zsh --restart 1 --kmers 300,350,400,450,500 \ --outdir $projectdir/minia_unitig --threads 20 \ --reads $projectdir/PE450_bfc_correct_trim.fq.gz > minia_unitig_resume.out 2> minia_unitig_resume.err (1)
1 Make sure to delete the output folders of the incomplete Minia runs before restarting the assembly loop.
7. Error correction of PE800 and MP reads
-
While the assembly runs, the error correction for PE800 and mate pair libraries can be done.
-
Below is an example how to call the scripts for error correction. There are two scripts, one for the PE800 (calling cutadapt and BFC) and one for the MPs (calling NxTrim an BFC). The jobs for the PE800, MP3, MP6 and MP9 are independent of each other and can be started separately.
count="$projectdir/corrected_reads/PE450_bfc.hash" base="$projectdir" outdir="$projectdir/corrected_reads" genomesize='5G' (1) $bitbucket/shell/run_bfc_PE800.zsh --genomesize $genomesize --threads 20 \ --outdir $outdir --hash $count --base $base (2) for lib in MP3 MP6 MP9; do $bitbucket/shell/run_bfc_MP.zsh --genomesize $genomesize --threads 20 \ --outdir $outdir --hash $count --base $base --lib $lib (3) done
1 Change the genome size according to your species and the number of threads according to your resources. 2 You need to specify the paths to the executables of Cutadapt ( --cutdapt
) and BFC (--bfc
).3 You need to specify the paths to the executables of NxTrim ( --nxtrim
) and BFC (--bfc
).
8. Scaffolding with SOAPDenovo2
-
Set up a configuration file for SOAPDenovo2. Below is an example:
max_rd_len=500 [LIB] avg_ins=800 (1) asm_flags=2 rank=1 (2) pair_num_cutoff=5 (3) map_len=35 q1=$projectdir/corrected_reads/PE800.bfc_R1.fq.gz (4) q2=$projectdir/corrected_reads/PE800.bfc_R2.fq.gz [LIB] avg_ins=3000 asm_flags=2 rank=2 pair_num_cutoff=5 map_len=35 q1=$projectdir/corrected_reads/MP3.mp_unknown_bfc_R1.fq.gz q2=$projectdir/corrected_reads/MP3.mp_unknown_bfc_R2.fq.gz [LIB] avg_ins=6000 asm_flags=2 rank=3 pair_num_cutoff=5 map_len=35 q1=$projectdir/corrected_reads/MP6.mp_unknown_bfc_R1.fq.gz q2=$projectdir/corrected_reads/MP6.mp_unknown_bfc_R2.fq.gz [LIB] avg_ins=9000 asm_flags=2 rank=4 pair_num_cutoff=5 map_len=35 q1=$projectdir/corrected_reads/MP9.mp_unknown_bfc_R1.fq.gz q2=$projectdir/corrected_reads/MP9.mp_unknown_bfc_R2.fq.gz (5)
1 The insert sizes need match to the files specified below. 2 Rank specifies in which order the libraries are used. It should always increase from smallest to largest fragment size. 3 The value of pair_num_cutoff
influences the assembly results substantially, but it is not necessary to make a good choice here. The best values for each libraries will be determined in the next step by trying out different values.4 All input files are the error corrected files in the $projectdir/corrected_reads
folder.5 Double check that all files exist and R1 and R2 are correctly used. -
Map the reads to the Minia3 contigs.
$bitbucket/shell/run_soap_mapping.zsh --threads 20 --config '$projectdir/soap.cfg' \ --input '$projectdir/minia/minia_k500/minia_k500.unitigs.fa' --outdir '$projectdir/soap' (1) (2) (3) (4)
1 The config file has been created in the previous step. 2 The input file is the assembly file created by Minia3. 3 The output will be written to the folder minia_k500.unitigs
in the directory$projectdir/soap
. You can modify the folder by using the--suffix
parameter. The output folder will be used in later steps and be referred to as$assembly
or${assembly}
.4 You need to specify the paths to the executables of SOAPDenovo2 ( --soap
) and SOAPDenovo-fusion (--fusion
). -
Now determine the best value for
pair_num_cutoff
for each library by running the scaffolding many times for different values. Proceed from library with the smallest to that with the largest insert size.input="$projectdir/soap/$assembly" (1) lib='PE800' (2) $bitbucket/shell/run_soap_loop.zsh --threads 31 --name $lib --size 800 --interval 5:50 --input $input \(3) > ${input}_${lib}_thresholds.tsv 2> ${input}_${lib}_thresholds.err \(4) lib='MP3' $bitbucket/shell/run_soap_loop.zsh --threads 11 --name $lib --size 3000 --interval 5:15 --input $input \ --pegrads "$projectdir/soap/${assembly}_PE800_l45/${assembly}.peGrads" \(5) (6) > ${input}_${lib}_thresholds.tsv 2> ${input}_${lib}_thresholds.err lib='MP6' $bitbucket/shell/run_soap_loop.zsh --threads 11 --name $lib --size 6000 --interval 5:15 --input $input \ --pegrads "$projectdir/soap/${assembly}_MP3_l6/${assembly}.peGrads" \(7) > ${input}_${lib}_thresholds.tsv 2> ${input}_${lib}_thresholds.err (6) lib='MP9' $bitbucket/shell/run_soap_loop.zsh --threads 11 --name $lib --size 9000 --interval 5:15 --input $input \ --pegrads "$projectdir/soap/${assembly}_MP6_l9/${assembly}.peGrads" \(7) (8) > ${input}_${lib}_thresholds.tsv 2> ${input}_${lib}_thresholds.err (6)
1 This is the folder containing the SOAP mapping results from the previous step. 2 $lib` is only a name, but the --size
parameter below must match the insert size used in the configuration file.3 Try out values from 5 to 50 ( --interval 5:50
) for thepair_num_cutoff
of the PE800 library.4 Inspect the output files. The error log file should be empty. In the threshold file, there are five columns: (1) pair_num_cutoff
; (2) total assembly size; (3) cumulative size of scaffolds whose length is larger or equal to 1 Mb; (4) N50; (5) N90; (5) non-N bases. Select thepair_num_cutoff
value with the highest N50 and/or size of scaffold above 1 Mb. A directory for each value ofpair_num_cutoff
was created containing the files with assembly statistics ($assembly.scafSeq.n50
and$assembly.scafStatistics
).5 Specify here the path to the *.peGrads
file for the PE800 scaffolding result.6 Repeat step 4 for the respective library. 7 Repeat step 5, i.e. select the best result from the preceding step. 8 You need to specify the paths to the following executables: SOAPDenovo2 ( --soap
), GNU Parallel (--parallel
), SAMtools (--samtools
), and then50
script from$bitbucket/shell
(--n50
). -
To speed up the analysis, gap filling was disabled in the SOAPDenovo2 iterations for finding the best link count thresholds. Now that we have determined the best values for
pair_num_cutoff
, add the gap filling step.$bitbucket/shell/run_soap_scaffolding.zsh --threads 10 \ --suffix 'ALL_LIB' --pegrads "$projectdir/soap/${assembly}_MP9_l6/${assembly}.peGrads" \ --input "$projectdir/soap/$assembly" --outdir "$projectdir/soap" (1) (2) (3)
1 --gapdiff
is the maximum allowed difference between statistically estimated gap sizes and gap sizes determined by local k-mer assembly around gaps.2 --suffix
is a name that can be anything, but should be something. The default is "SCAFF".3 You need to specify the path to the SOAPDenovo2 executable ( --soap
).
9. Filling internal gaps
-
We will now use the PE450 reads and GapCloser to fill internal gaps (stretches of N characters) in the scaffolds assembled by SOAPDenovo2. This ususally reduced the proportion of unfilled internal gaps by 2 to 3 percentage points. First, we need to create again a configuration file similar to the one used for scaffolding.
max_rd_len=500 [LIB] avg_ins=450 (1) asm_flags=2 rank=1 pair_num_cutoff=5 map_len=35 q=$projectdir/corrected_reads/PE450_correct_trim.fq.gz (2)
1 Put the size of the PE450 library here. 2 Insert the path to the error corrected PE450 reads, which were also used in the Minia3 assembly. -
Now run GapCloser.
$bitbucket/shell/run_gapcloser.zsh --threads 30 \ --outdir "$projectdir/gapcloser" --config "$projectdir/gapcloser.cfg" \ --input "$projectdir/soap/${assembly}/${assembly}.scafSeq" \ --name $assembly (1) (2) (3) (4)
1 Choose the number of CPU cores ( --threads
) carefully. GapCloser will use all the cores you give to it and run for several days (in case of polyploid wheat). You can track the progress by looking at the size of the output file.2 The config file has been created in the previous step. 3 The input file is the FASTA file with the scaffolds of the final SOAP run with optimized pair_num_cutoff
values and gap filling enabled.4 You need to specify the paths to executables of GapCloser ( --gapcloser
), SAMtools (--samtools
) and Seqtk (--seqtk
).
10. Mapping 10X and Hi-C data
-
First, create a FASTA file with new scaffold names following the scheme
scaffold_1
,scaffold_2
, …wd="$projectdir" scaf="$projectdir/gapcloser/$assembly/$assembly.fasta" (1) base=${scaf:t:r} (2) ref=$wd'/10x/'${base}'/scaffolds.fasta' (3) dirtenex=$ref:h dirhic=$wd'/hic/'${base} dirpseudomolecules=$wd'/pseudomolecules/'${base} mkdir $dirtenex $dirhic $dirpseudomolecules (4) awk '/^>/ {print "\n>scaffold_"++n; next} {printf $0}' $scaf | awk NF > $ref && { samtools faidx $ref && $bitbucket/shell/n50 $ref.fai > $ref:r.n50 } (5) (6)
1 Set the path the GapCloser output. 2 This is syntax specific to the Z shell. It evaluates to minia_k500_contigs
if the value of$scaf
issome_directory/minia_k500_contigs.fasta
, i.e. it strips the directory name and the file extension from the path.3 Path to the renamed FASTA file 4 Create the directories for the 10X mapping, the Hi-C mapping and the pseudomolecule construction for your assembly. 5 Write the renamed FASTA file. 6 This assumes that the SAMtools executable is in $PATH
. -
Digest the assembly for Hi-C mapping.
ref="$projectdir/10x/$assembly/scaffolds.fasta" $bitbucket/shell/digest_emboss.zsh --ref $ref --enzyme 'HindIII' --sitelen 6 --minlen 100 (1) (2) (3) (4)
1 $ref
is the renamed FASTA file created in step 1.2 Another common choice is --enzyme DpnII --sitelen 4
. Make sure that the enzyme matches the one used for making your Hi-C libraries.3 While this runs, you can proceed with the next steps. 4 You need to specify the paths to the restrict
executable of EMBOSS (--restrict
) and to BEDTools (--bedtools
). Furthermore, the path to the REBASE file ("link_emboss_e.txt") needs to be given (--rebase
). -
Copy the directories with 10X reads (one for each 10X library) to the 10X analysis folder. Assuming that the sample folders contain only symbolic links with absolute paths you can use the following Z shell command to copy the folder and the symbolic links.
cp -r $projectdir/raw_reads/10x/*(/) $projectdir/10x/$assembly
-
Now the 10X mapping can be started.
find $projectdir/10x/$assembly -type d | tail -n +2 | while read sample; do (1) $bitbucket/shell/run_10x_mapping.zsh --threads 20 --mem '200G' \ --ref $projectdir/10x/$assembly/scaffolds.fasta --tmp $TMPDIR $sample (2) (3) (4) done
1 You can also call run_hic_mapping.zsh
for individual samples.2 The temporary directory needs to be large enough to hold huge BAM files (dozens of GB). 3 --mem
specifies the RAM allocated for Novosort. Two Novosort processes are running in one pipe (sorting by postion with concomitant duplicate removal, and then sorting by read name again). So twice the value of--mem
is needed to sort everthing in memory (which is faster than using temporary files).4 You need to specify the paths to the following executables: Cutadapt ( --cutadapt
), Seqtk (--seqtk
), Minimap2 (--mimimap
), Novosort (--novosort
), SAMtools (--samtools
), and BEDTools (--bedtools
). -
While the 10X mapping is running, minor mapping jobs can be started in the mean time. For barley, we align the 2012 WGS assemby of Morex to our new assembly. This will be used later to lift POPSEQ genetic markers to the new assembly.
qry='assembly_WGSMorex.fasta' (1) size='6G' mem='5G' threads=10 ref="$projectdir/10x/$assembly/scaffolds.fasta" prefix=${ref:r}_morexWGS { minimap2 -t $threads -2 -I $size -K$mem -x asm5 $ref $qry | gzip > $prefix.paf.gz } 2> $prefix.err (2)
1 This file can be downloaded from here. 2 This assumes that the Minimap2 executable is in $PATH
. -
We align our assembly to the 2017 BAC-by-BAC assembly of Morex. This alignment will be used later to create colinearity plots between both assemblies. This code is rather generic and can be used in modified form for assembly-to-assembly alignments.
qry='barley_morex_pseudomolecules.fasta' (1) size='6G' mem='5G' threads=10 ref="$projectdir/10x/$assembly/scaffolds.fasta" prefix=${ref:r}_morexPSMOL { minimap2 -t $threads -2 -I $size -K$mem -x asm5 $ref $qry | gzip > $prefix.paf.gz } 2> $prefix.err (2) (3) (4)
1 This file can be downloaded from here: https://doi.org/10.5447/ipk/2016/34. 2 This assumes that the Minimap2 executable is in $PATH
.3 The value of -I
should exceed the assembly size.4 Use sort | bgzip
instead if you want retrieve region with Tabix later. The sort keys need to be chosen depending on whether you want to index by reference or by query. -
Once the 10X mapping is done, or if free resources are available, proceed with the Hi-C alignment. First create the directories of each Hi-C samples in the Hi-C analysis folder (
$projectdir/hic/$assembly
) and copy/symlink the read files there. -
Now start the Hi-C mapping.
ref="$projectdir/hic/$assembly/scaffolds.fasta" bed=${ref:r}_HindIII_fragments_100bp.bed (1) linker='AAGCTAGCTT' (2) find $projectdir/hic/$assembly -type d | tail -n +2 | while read sample; do $bitbucket/shell/run_hic_mapping.zsh --threads 30 --mem '100G' --ref $ref \ --linker $linker --bed $bed --tmp $TMPDIR $sample (3) (4) (5) done
1 Use the BED file with the in silico digest for the correct restriction enzyme. Using the wrong enzyme will give weird results. 2 Specify correct linker sequences that is created by ligation of fill-in overhangs during the Hi-C protocol. If HindIII was used, the correct linker is AAGCTAGCTT; if DpnII was used, the linker is GATCGATC. 3 You can also call run_hic_mapping.zsh
for individual samples.4 The same consideration for temporary directories and memory requirements as for the 10X mapping apply. 5 You need to specify the paths to the following executables: Cutadapt ( --cutadapt
), Minimap2 (--mimimap
), Novosort (--novosort
), SAMtools (--samtools)
, and BEDTools (--bedtools)
, and bgzip (--bgzip
).
11. Super-scaffolding with 10X Chromium data
-
All subsequent steps should be carried out in the working directory
$projectdir/pseudomolecules/$assembly
. -
Most steps will be done in R from now on. If you are unsure whether a command should be pasted to the R or to the shell prompt, hover with the mouse over the code listings. SH or R should appear in the top-right corner.
-
First load all the necessary datasets into R.
# Load the R functions for pseudomolecule construction. source("$bitbucket/R/pseudomolecule_construction.R") (1) # Import a genetic map of sufficient density. In barley, we use the POPSEQ map. readRDS('Morex_IBSC2012_WGS_assembly_POPSEQ_info.Rds') -> popseq (2) setnames(popseq, sub("morex", "css", names(popseq))) # Import the table of scaffold lengths. f <- '$projectdir/10x/$assembly/scaffolds.fasta.fai' (1) (3) fread(f, head=F, select=1:2, col.names=c("scaffold", "length")) -> fai # Alignment of genetic markers used for the genetic map. In this example, the Morex WGS assembly by IBSC (2012). f <- '$projectdir/10x/$assembly/scaffolds_morexWGS.paf.gz' (4) read_morexaln_minimap(paf=f, popseq=popseq, minqual=30, minlen=500, prefix=T)->morexaln # Read the list of 10X molecules. dirtenex <- '$projectdir/10x/$assembly' (5) fread(cmd=paste("find", dirtenex, "-type f | grep 'molecules.tsv.gz$'"), head=F)$V1->f names(f) <- c("S1", "S2") (6) read_10x_molecules(files=f, ncores=length(f)) -> molecules # Read the list of Hi-C links. dirhic <- '$projectdir/hic/$assembly' (7) fread(cmd=paste('find', dirhic, '| grep "_fragment_pairs.tsv.gz$" | xargs zcat'), header=F, col.names=c("scaffold1", "pos1", "scaffold2", "pos2"))->fpairs
1 Note that $bitbucket$, `$projectdir
and$assembly
will not evaluated as variable by R here and in all other instances below. Replace them always with a correct path by modifying the command with a text editor before running it.2 This file can be downloaded from here. Use it as a template to create analogous for other species. 3 This file was created here. 4 This file was created here. 5 This file is the 10X analysis directory defined here. 6 Adjust if you have more than or less than two 10X samples. 7 This file is the Hi-C analysis directory defined here. -
Now initialize the assembly object, assign POPSEQ coordinates to scaffolds and compute the physical coverage with 10X molecules and Hi-C links. Then save the object.
init_assembly(fai=fai, cssaln=morexaln, molecules=molecules, fpairs=fpairs) -> assembly anchor_scaffolds(assembly = assembly, popseq=popseq, species="barley") -> assembly (1) add_molecule_cov(assembly = assembly, cores=30) -> assembly add_hic_cov(assembly, cores=30)->assembly saveRDS(assembly, file="assembly_v1.Rds") (2)
1 Possible values for the species
parameter are:barley
,wheat
(works for A genome species, T. turgidum and T. aestivum) andrye
. More species can be accomodated by re-defining the functionchrNames()
. TypechrNames
to see its source code. The only use ofspecies
is to map numeric chromosome IDs to proper names, e.g. 1 to 1H in barley (see the note by Linde-Laursen to see why this makes a difference) or 21 to 7D.2 It is recommend to prefix the names of the RDS files created in this and later steps with the current date (e.g 181129) to distinguish different versions. -
Find potential chimeras and generate diagnostic plots for them.
find_10x_breaks(assembly=assembly, interval=5e4, minNbin=20, dist=1e4, ratio=-3) -> breaks (1) breaks[d >= 1e4][order(-d)] -> b plot_chimeras(assembly=assembly, scaffolds=b, breaks=b, species="barley", file="assembly_v1_putative_chimeras.pdf", cores=10) (2)
1 This will out put 200 bp bins having eight times fewer 10X molecules spanning them than the average across the genome. 2 Create diagnostic plots for those without breakpoint more than 10 kb away from the scaffold ends. -
Break the scaffolds and save the corrected assembly object.
break_10x(assembly, prefix="scaffold_corrected", ratio=-3, interval=5e4, minNbin=20, dist=2e3, slop=2e2, species="barley", intermediate=F, ncores=30)->a a$assembly -> assembly_v2 saveRDS(assembly_v2, file="assembly_v2.Rds") saveRDS(a$breaks, file="assembly_v1_breaks.Rds")
-
Run the 10X scaffolding with different parameters for the number of read pairs required to assign molecules to scaffolds (
npairs
), the number of molecules to link adjacent scaffolds (nmol
) and the number of distinct 10X libraries (nlib
) that satisfy thenpairs
andnmol
thresholds. Save the results.setnames(data.table(expand.grid(2:8, 2:8, 2:2)), c("npairs", "nmol", "nlib"))->grid mclapply(mc.cores=10, 1:nrow(grid), function(i){ grid[i, npairs]->n grid[i, nmol]->m grid[i, nlib]->s scaffold_10x(assembly=assembly_v2, prefix="scaffold_10x", min_npairs=n, max_dist=1e5, min_nmol=m, popseq_dist=10, max_dist_orientation=5, min_nsample=s, ncores=1)->z })->res saveRDS(res, file="assembly_v2_10x_trials.Rds")
-
Sort scaffolding results by N50 and pick the best one. Then save the result.
data.table(grid, index=1:length(res), max=unlist(lapply(res, function(i) max(i$info$length))), n50=unlist(lapply(res, function(i) n50(i$info$length))))[order(-n50)] (1) best_idx <- 7 (1) res[[best_idx]] -> assembly_v2_10x saveRDS(assembly_v2_10x, file="assembly_v2_10x.Rds")
1 Set the value of index in the first row (i.e. assembly with the highest N50) as the best_idx
. If the N50 or the maximum scaffold are unreasonably (N50 > 100 Mb or largest scaffold > 400 Mb), select a parameter set that is within these bounds.
12. Pseudomolecule construction with Hi-C data
-
Lift the Hi-C link and POPSEQ information from the corrected scaffolds to the super-scaffolds and save the resulting assembly object.
init_10x_assembly(assembly=assembly_v2, map_10x=assembly_v2_10x, molecules=F) -> assembly_v3 anchor_scaffolds(assembly = assembly_v3, popseq=popseq, species="barley") -> assembly_v3 add_hic_cov(assembly_v3, cores=10)->assembly_v3 saveRDS(assembly_v3, file="assembly_v3.Rds")
-
Plot potential chimeras evident from the Hi-C data. If there are chimeras, see below how to correct them.
assembly_v3$info[mri < -4 & length >= 5e5, scaffold] -> bad assembly_v3$cov[scaffold %in% bad][order(r)][!duplicated(scaffold)][order(-length)][, .(scaffold, br=bin)] -> breaks plot_chimeras(scaffolds=breaks, species="barley", assembly_v3, file="assembly_v3_putative_chimeras.pdf", cores=10)
-
Read the positions of restriction fragments on the original scaffolds and transpose them to 10X super-scaffold coordinates.
f <- '$projectdir/10x/$assembly/scaffolds_HindIII_fragments_100bp.bed' (1) read_fragdata(info=assembly_v2$info, file=f, map_10x=assembly_v2_10x, assembly_10x=assembly_v3) -> frag_data
1 Change the enzyme name as needed. -
Define the set of scaffolds for Hi-C map construction. Only scaffolds with chromosome assignments based on Hi-C links are used. Excluding small scaffolds (e.g. less than 300 kb) is sensible as well.
frag_data$info[!is.na(hic_chr) & length >= 3e5, .(scaffold, nfrag, chr=hic_chr, cM=popseq_cM)] -> hic_info (1)
1 If the contact matrix of the map (see below) doesn’t look nice, try increasing the minimum scaffold length (300 kb in this example). -
Run the Hi-C map construction.
hic_map(info=hic_info, assembly=assembly_v3, frags=frag_data$bed, species="barley", ncores=7, min_nfrag_scaffold=30, max_cM_dist = 30, binsize=1e5, min_nfrag_bin=20, gap_size=100) -> hic_map_v1 (1) (2) (3)
1 If the contact matrix of the map (see below) doesn’t look nice, try increasing the minimum number of restriction fragments per scaffold. 2 If parts of chromosomes are missing, increase max_cM_dist
(30 cM in this example).3 A good value for binsize
is one third of the minimum scaffold length used to define thehic_info
object. -
Lift the Hi-C links to pseudomolecule coordinates. Bin and normalize Hi-C links in 1 Mb bins along the Hi-C-based pseudomolecules. Save the Hi-C map object.
f <- '$projectdir/10x/$assembly/scaffolds_HindIII_fragments_100bp_split.nuc.txt' add_psmol_fpairs(assembly=assembly_v2, hic_map=hic_map_v1, map_10x=assembly_v2_10x, assembly_10x=assembly_v3, nucfile=f) -> hic_map_v1 bin_hic_step(hic=hic_map_v1$links, frags=hic_map_v1$frags, binsize=1e6, chrlen=hic_map_v1$chrlen, chrs=1:7, cores=7)->hic_map_v1$hic_1Mb normalize_cis(hic_map_v1$hic_1Mb, ncores=7, percentile=0) -> hic_map_v1$hic_1Mb$norm saveRDS(hic_map_v1, "hic_map_v1.Rds")
-
Collect basic summary statistics.
# Lengths and scaffold numbers per chromosome hic_map_v1$agp[gap == F, .(n=.N, s=sum(scaffold_length/1e6)), key=agp_chr] # Total size of scaffolds in pseudomolecules hic_map_v1$agp[gap == F & agp_chr != "chrUn", sum(scaffold_length)/1e9] # Total size of unanchored scaffolds. hic_map_v1$agp[gap == F & agp_chr == "chrUn", sum(scaffold_length)/1e9]
-
Plot the contact matrices for intra- and interchromosomal contacts.
contact_matrix(hic_map=hic_map_v1, links=hic_map_v1$hic_1Mb$norm, boundaries=T, species="barley", file="hic_map_v1_contact_matrix_1Mb.pdf") interchromosomal_matrix_plot(hic_map=hic_map_v1, file="hic_map_v1_interchromosomal.png", species="barley")
-
Find potential inversions, create diagnostic plots and export the Hi-C map for use in Hi-C map inspector. If there are obvious errors (remaining chimeras, misoriented scaffolds), see below how to correct them.
find_inversions(hic_map=hic_map_v1, links=hic_map_v1$hic_1Mb$norm, species="barley", cores=10)->inv plot_inversions(inversions=inv, hic_map=hic_map_v1, species="barley", file="hic_map_v1_inversion_statistic.pdf") inspector_export(hic_map=hic_map_v1, assembly=assembly_v3, inv=inv, species="barley", file="hic_map_v1_inspector_export.Rds")
-
The source code required for deploying Hi-C map inspector on an R Shiny server is contained in
$bitbucket/R/map_inspector.Rmd
. You need to modify the paths to the data directory and the default map object. -
A useful method of assembly QC is the alignment to a closely related species. In this example, we plot the collinearity between the assembly we made (y-axis) the Morex IBSC 2017 pseudomolecules (x-axis).
readRDS(file='Morex_IBSC2017_AGP.Rds') -> morex_agp (1) orig_scaffold_to_agp(assembly=assembly_v2, map_10x=assembly_v2_10x, hic_map = hic_map_v1) -> a (2) f <- '$projectdir/10x/$assembly/scaffolds_morexPSMOL.paf.gz' (3) read_paf(save=T, f)->paf_morex lift_morex_aln_agp(paf=paf_morex_masked, morex_agp=morex_agp, scaffold_to_agp = a) -> p (4) p[alnlen >= 2e3 & mapq >= 30 & agp_chr == morex_chr & agp_chr != "chrUn"] -> pp (5) hic_map_v1$agp[, .(chrlen=max(agp_end)), key=agp_chr] -> chrlen (6) morex_agp[, .(chrlen=max(agp_end)), key=agp_chr] -> chrlen_morex (6) pdf("hic_map_v1_vs_IBSC2017.pdf") (7) par(mar=c(5,5,3,1)) lapply(sort(unique(pp$agp_chr)), function(cc){ pp[cc, on="agp_chr"] -> q q[, idx := 1:.N] plot(0, type='n', las=1, bty='l', xlab="position in MorexV1 (IBSC, 2017) pseudomolecule (Mb)", ylab"position in pseudomolecule (Mb)", main=cc, xlim=c(0, chrlen_morex[cc, on="agp_chr"]$chrlen/1e6), ylim=c(0, chrlen[cc, on="agp_chr"]$chrlen/1e6)) q[aln_orientation == 1, lines(c(morex_agp_start/1e6, morex_agp_end/1e6), c(agp_start/1e6, agp_end/1e6)), by=idx] q[aln_orientation == -1, lines(c(morex_agp_start/1e6, morex_agp_end/1e6), c(agp_end/1e6, agp_start/1e6)), by=idx] }) dev.off()
1 Load the AGP of the Morex IBSC 2017 assembly. This file can be downloaded from here. 2 Create a mapping table of original (uncorrected) scaffolds to pseuomolecule positions. 3 Read the alignment of the IBSC 2017 BAC sequences to the new assembly. This file was created here. 4 Tranpose mappings to pseudomolecule coordinates in both assemblies. 5 Filter alignment for plotting (minimum alignment and map scores, excluding chrUn) 6 Get tables with chromosome lengths for both assemblies. 7 Create the PDF files with the plots, one page per chromosome.
13. Manual correction of misassemblies
-
This is an example of breaking chimeras in the 10X super-scaffolds that were detected by inspecting of diagnostic plots.
bad = c("scaffold_10x_42", "scaffold_10x_815") (1) assembly_v3$cov[scaffold %in% bad][order(r)][!duplicated(scaffold)][order(-length)][, .(scaffold, br=bin)]->breaks (2) plot_chimeras(scaffolds=breaks, species="barley", assembly_v3, file=paste0("assembly_chimera_candidates.pdf"), cores=10) breaks[c(2, 3, 4)] ->br (3) rbind(br, data.table(scaffold=rep("scaffold_10x_237", 2), br=c(3.2e6, 3.5e6))) -> br plot_chimeras(scaffolds=br, species="barley", assembly_v3, breaks=br, file=paste0("assembly_v3_putative_chimeras_v2.pdf"), cores=10) break_scaffolds(br, assembly_v3, prefix="scaffold_10x_v2_", slop=1e4, cores=10, species="barley") -> assembly_v4 (4) saveRDS(assembly_v4, file="assembly_v4.Rds") f <- '$projectdir/10x/$assembly/scaffolds_HindIII_fragments_100bp.bed' read_fragdata(info=assembly_v2$info, file=f, map_10x=assembly_v2_10x, assembly_10x=assembly_v3)->frag_data (5)
1 Get a list of potential chimeras, e.g. from Hi-C map inspector. 2 Extract the bins with lowest physical Hi-C coverage. 3 If need be, refine the list of breakpoints manually. 4 Break the scaffolds and save the new assembly object. 5 Create a new frag_data
table as for the uncorrected 10X super-scaffolds and proceeding with Hi-C map construction. -
Flip misoriented scaffolds, either a single scaffold or a group of adjacent scaffolds.
# Flip single scaffolds. l <- c('scaffold_10x_815', 'scaffold_10x_42') correct_inversions(hic_map=hic_map_v1, scaffolds=l, species="barley")->hic_map_v2 # Flip an interval, i.e. a set of a adjacent scaffold in the Hi-C map ranges <- data.table(agp_chr="chr2H", start=100, end=104) correct_multi_inversions(hic_map_v1, ranges, species="barley") -> hic_map_v2 (1)
1 This will invert the order of scaffolds in bins 100 to 104, and flip the orientation of each scaffold in this range. -
After correcting inversions, proceed with binning binning, calculating summary statistics and the validation plots.
14. Compiling pseudomolecules
-
We proceed in two steps. First, we create a FASTA file for the 10X super-scaffolds and then a FASTA file for the pseudomolecules.
-
We start with writing the required BED files in R.
# Load the RDS file of the corrected assembly and the 10X superscaffolds. readRDS("assembly_v2_10x.Rds")->assembly_v2_10x (1) readRDS("assembly_v2.Rds")->assembly_v2 (1) # Write a BED file mapping the corrected to original scaffolds. assembly_v2$info[, .(orig_scaffold, orig_start = orig_start - 1, orig_end, scaffold, length)] -> oldnew write.table(sep="\t", col.names=F, row.names=F, quote=F, oldnew[, .(orig_scaffold, orig_start, orig_end, scaffold)], file="assembly_v1_assembly_v2.bed") # Write a BED file for the 10X superscaffolds. bed <- assembly_v2_10x$agp[, .(scaffold, start=0, end=length, super, super_start, orientation=ifelse(orientation == 1, "+", "-"))] bed[is.na(orientation), orientation := "+"] (2) options(scipen=20) (3) write.table(bed, col.names=F, row.names=F, quote=F, sep="\t", file="assembly_v2_10x.bed") # Write the 10X AGP file. agpbed<- assembly_v2_10x$agp[, .(super, super_start = super_start-1, super_end, scaffold, index, orientation=ifelse(orientation == 1, "+", "-"))] agpbed[is.na(orientation), orientation := "+"] options(scipen=20) write.table(agpbed, col.names=F, quote=F, row.names=F, sep="\t", file="assembly_v2_10x_AGP.bed")
1 These are the results of 10X scaffolding and the assembly object with the 10X super-scaffolds. Do not use a manually corrected 10X super-scaffold object here. 2 Scaffolds without orientation are placed in +
orientation into the pseduomolecules.3 This is required to avoid writing large numbers in scientific notation (e.g. 4e+05
). -
If you did manual breaking of the initial 10X super-scaffolds (
assembly_v3
), you need to run the following code snippet to create BED files specifying the boundaries of the corrected scaffolds in the corrected assembly.readRDS("assembly_v4.Rds") -> assembly_v4 (1) assembly_v4$info[, .(orig_scaffold, orig_start = orig_start - 1, orig_end, scaffold, length)] -> oldnew oldnew[orig_end - orig_start != length] sum(as.numeric(oldnew$length)) write.table(sep="\t", col.names=F, row.names=F, quote=F, oldnew[, .(orig_scaffold, orig_start, orig_end, scaffold)], file="assembly_v3_assembly_v4.bed")
1 This is the final 10X super-scaffold assembly object used for Hi-C mapping. Change this name and the file names below as needed. -
Now we use shell commands to write the FASTA files with 10X super-scaffolds.
bedtools='/opt/Bio/bedtools/2.26.0/bin/bedtools' (1) samtools='/opt/Bio/samtools/1.9/bin/samtools' (1) # Specify paths and file names. input='$projectdir/10x/$assembly/scaffolds.fasta' bed='assembly_v1_assembly_v2.bed' output='assembly_v2.fasta' # Write the FASTA file for the corrected assembly. $bedtools getfasta -name -fi $input -bed $bed -fo /dev/stdout | sed 's/::.*$/ /' > $output $samtools faidx $output # Add a dummy sequence composed of Ns. assembly='assembly_v2.fasta' tenexinput='assembly_v2+gap.fasta' { repeat 1000 echo -n N } | awk 'BEGIN {print ">gap"} NF' \ | cat $assembly - > $tenexinput $samtools faidx $tenexinput # Write the FASTA file for the 10X superscaffolds. bed='assembly_v2_10x.bed' tenexinput='assembly_v2+gap.fasta' tenex='assembly_v3.fasta' $bedtools getfasta -s -name -fi $tenexinput -bed $bed -fo /dev/stdout \(2) | sed 's/::/ /' \ | awk '/^>/ && !h[$1]++ {print "\n" $0; next} !/^>/ {printf $0}' \ | awk NF | fold -w 60 > $tenex $samtools faidx $tenex # Check that the superscaffolds contain all the input sequence. $bitbucket/shell/check_psmol.zsh --assembly 'assembly_v2.fasta' \ --agp 'assembly_v2_10x_AGP.bed' --psmol $tenex > ${tenex:r}_check.out 2> ${tenex:r}_check.err (3) (4) (5)
1 Change the paths to samtools
andbedtools
. Please use BEDtools version 2.26.0 as newer versions may modify the format of the header line.2 Note the use of strand information ( -s
).3 The output files lists the differences between the input sequences and the sequences retrieved from the pseudomolecule using the AGP BED file. It should be empty. 4 You need to specify the paths to the following executables: BEDtools ( --bedtools
), SAMtools (--samtools
), Tabix (--tabix
) and bgzip (--bgzip
).5 The syntax ${tenex:r}
to strip the file extension is specific to the Z shell. -
If you made manual changes of the 10X super-scaffolds, run the next commands to write FASTA files for the corrected assembly.
bedtools='/opt/Bio/bedtools/2.26.0/bin/bedtools' (1) samtools='/opt/Bio/samtools/1.9/bin/samtools' (1) # Write a FASTA file for the manually curated 10X super-scaffolds. input='$projectdir/pseudomolecules/$assembly/assembly_v3.fasta' (1) bed='assembly_v3_assembly_v4.bed' output='assembly_v4.fasta' $bedtools getfasta -name -fi $input -bed $bed -fo /dev/stdout \(2) | sed 's/::/ /' > $output $samtools faidx $output
1 This is the FASTA file with the 10X super-scaffolds created in the previous step. 2 The parameter -s
is not required. Breaking chimeric sequences does not change orientation. -
Now write analogous BED files for the chromosomal pseudomolecules, starting from the 10X super-scaffolds.
readRDS("hic_map_v3.Rds") -> hic_map_v3 (1) bed <- hic_map_v3$agp[, .(scaffold, start=0, end=scaffold_length, agp_chr, agp_start, orientation=ifelse(orientation == 1, "+", "-"))] bed[is.na(orientation), orientation := "+"] options(scipen=20) write.table(bed, col.names=F, row.names=F, quote=F, sep="\t", file="assembly_v3_to_psmol.bed") agpbed<-hic_map_v3$agp[, .(agp_chr, agp_start = agp_start-1, agp_end, scaffold, index, orientation=ifelse(orientation == 1, "+", "-"))] agpbed[is.na(orientation), orientation := "+"] options(scipen=20) write.table(agpbed, col.names=F, quote=F, row.names=F, sep="\t", file="pseudomolecule_v3_AGP.bed")
1 This is the final Hi-C map object. -
Write the FASTA files of the pseudomolecules.
samtools='/opt/Bio/samtools/1.9/bin/samtools' (1) # Add a dummy sequence composed of Ns. assembly='assembly_v3.fasta' (2) psmolinput='assembly_v3+gap.fasta' { repeat 1000 echo -n N } | awk 'BEGIN {print ">gap"} NF' \ | cat $assembly - > $psmolinput $samtools faidx $psmolinput # Write the FASTA file for the pseudomolecules. psmol='pseudomolecule_v3.fasta' $bitbucket/shell/write_psmol.zsh --bed 'assembly_v3_to_psmol.bed' \ --assembly 'assembly_v3+gap.fasta' --cores 8 > $psmol 2> $psmol:r.err (3) (4) $samtools faidx $psmol (2) # Check that the pseudomolecules contain all the input sequence. $bitbucket/shell/check_psmol.zsh --assembly 'assembly_v3.fasta' \ --agp 'pseudomolecule_v3_AGP.bed' --psmol $psmol > ${psmol:r}_check.out 2> ${psmol:r}_check.err (5) (6)
1 This is the FASTA file created in step 3 (if you didn’t do any manual breaking), or in step 4 (if you did manual breaking; in this case, use, e.g., assembly_v4
here).2 Change the path to samtools
.3 The number of cores should equal the number of pseudomolecules in your assembly. 4 You need to specify the paths to the following executables: BEDtools ( --bedtools
), GNU Parallel (--parallel
), Tabix (--tabix
) and bgzip (--bgzip
).5 The output file lists the differences between the input sequences and the sequences retrieved from the pseudomolecule using the AGP BED file. It should be empty. 6 You need to specify the paths to the following executables: BEDtools ( --bedtools
), SAMtools (--samtools
), Tabix (--tabix
) and bgzip (--bgzip
).