Assembly statistics and evaluation¶
So, we now have an assembly of our reads in rna-assembly.fa
. Let’s
take a look at this file –
head rna-assembly.fa
This is a FASTA file with complex (and not, on the face of it, very informative!) sequence headers, and a bunch of sequences. There are three things you might want to do with this assembly - check its quality, annotate it, and search it. Below we’re going to check its quality; other workshops do (will) cover annotation and search.
Applying transrate¶
transrate is a program for assessing RNAseq assemblies that will give you a bunch of assembly statistics, along with a few other outputs.
First, let’s download and install it:
cd
curl -O -L https://bintray.com/artifact/download/blahah/generic/transrate-1.0.2-linux-x86_64.tar.gz
tar xzf transrate-1.0.2-linux-x86_64.tar.gz
export PATH=~/transrate-1.0.2-linux-x86_64:$PATH
Now run transrate on the assembly to get some preliminary stats:
transrate --assembly=/mnt/work/rna-assembly.fa --output=/mnt/work/stats
This should give you output like this:
[ INFO] 2016-03-30 13:38:11 : n seqs 62
[ INFO] 2016-03-30 13:38:11 : smallest 206
[ INFO] 2016-03-30 13:38:11 : largest 4441
[ INFO] 2016-03-30 13:38:11 : n bases 81132
[ INFO] 2016-03-30 13:38:11 : mean len 1308.58
[ INFO] 2016-03-30 13:38:11 : n under 200 0
[ INFO] 2016-03-30 13:38:11 : n over 1k 37
[ INFO] 2016-03-30 13:38:11 : n over 10k 0
[ INFO] 2016-03-30 13:38:11 : n with orf 41
[ INFO] 2016-03-30 13:38:11 : mean orf percent 64.76
[ INFO] 2016-03-30 13:38:11 : n90 841
[ INFO] 2016-03-30 13:38:11 : n70 1563
[ INFO] 2016-03-30 13:38:11 : n50 1606
[ INFO] 2016-03-30 13:38:11 : n30 2070
[ INFO] 2016-03-30 13:38:11 : n10 3417
[ INFO] 2016-03-30 13:38:11 : gc 0.46
[ INFO] 2016-03-30 13:38:11 : gc skew -0.04
[ INFO] 2016-03-30 13:38:11 : at skew -0.05
[ INFO] 2016-03-30 13:38:11 : cpg ratio 1.88
[ INFO] 2016-03-30 13:38:11 : bases n 0
[ INFO] 2016-03-30 13:38:11 : proportion n 0.0
[ INFO] 2016-03-30 13:38:11 : linguistic complexity 0.23
...which is pretty useful basic stats.
I’d suggest paying attention to:
- n seqs
- largest
- mean orf percent
and more or less ignoring the rest ;).
Note: don’t use n50 to characterize your transcriptome, as with transcripts you are not necessarily aiming for the longest contigs, and isoforms will mess up your statistics in any case.
Evaluating read mapping¶
You can also use transrate to assess read mapping; this will use read evidence to detect many kinds of errors.
To do this, you have to supply transrate with the reads:
transrate --assembly=/mnt/work/rna-assembly.fa --left=/mnt/work/left.fq --right=/mnt/work/right.fq --output=/mnt/work/transrate_reads
The relevant output is here:
[ INFO] 2016-03-30 15:38:55 : Read mapping metrics:
[ INFO] 2016-03-30 15:38:55 : -----------------------------------
[ INFO] 2016-03-30 15:38:55 : fragments 57178
[ INFO] 2016-03-30 15:38:55 : fragments mapped 50925
[ INFO] 2016-03-30 15:38:55 : p fragments mapped 0.89
[ INFO] 2016-03-30 15:38:55 : good mappings 8297
[ INFO] 2016-03-30 15:38:55 : p good mapping 0.15
[ INFO] 2016-03-30 15:38:55 : bad mappings 42628
[ INFO] 2016-03-30 15:38:55 : potential bridges 22
[ INFO] 2016-03-30 15:38:55 : bases uncovered 11877
[ INFO] 2016-03-30 15:38:55 : p bases uncovered 0.15
[ INFO] 2016-03-30 15:38:55 : contigs uncovbase 48
[ INFO] 2016-03-30 15:38:55 : p contigs uncovbase 0.77
[ INFO] 2016-03-30 15:38:55 : contigs uncovered 13
[ INFO] 2016-03-30 15:38:55 : p contigs uncovered 0.21
[ INFO] 2016-03-30 15:38:55 : contigs lowcovered 25
[ INFO] 2016-03-30 15:38:55 : p contigs lowcovered 0.4
[ INFO] 2016-03-30 15:38:55 : contigs segmented 14
[ INFO] 2016-03-30 15:38:55 : p contigs segmented 0.23
[ INFO] 2016-03-30 15:38:55 : Read metrics done in 7 seconds
[ INFO] 2016-03-30 15:38:55 : No reference provided, skipping comparative diagnostics
[ INFO] 2016-03-30 13:43:11 : -----------------------------------
[ INFO] 2016-03-30 13:43:11 : TRANSRATE OPTIMAL SCORE 0.036
[ INFO] 2016-03-30 13:43:11 : TRANSRATE OPTIMAL CUTOFF 0.1589
[ INFO] 2016-03-30 13:43:11 : good contigs 27
[ INFO] 2016-03-30 13:43:11 : p good contigs 0.44
Using transrate to compare two transcriptomes¶
transrate can also compare an assembly to a “reference”. One nice thing about this is that you can compare two assemblies...
First, install the necessary software:
transrate --install-deps ref
Second, download a different assembly – this is done with the same starting reads, but without using digital normalization:
curl -O -L https://github.com/ngs-docs/2016-mar-nonmodel/raw/master/files/rna-assembly-nodn.fa.gz
gunzip rna-assembly-nodn.fa.gz
Compare in both directions:
transrate --assembly=/mnt/work/rna-assembly.fa --reference=/mnt/work/rna-assembly-nodn.fa --output=/mnt/work/assembly-compare1
and
transrate --reference=/mnt/work/rna-assembly.fa --assembly=/mnt/work/rna-assembly-nodn.fa --output=/mnt/work/assembly-compare2
In this case you can see that our assembly “covers” more of the other assembly than the other assembly does ours.
Merging two (or more) assemblies¶
Finally, you can also use transrate to merge contigs from multiple assemblies, if you’ve used read mapping –
transrate --assembly=/mnt/work/rna-assembly.fa \
--merge-assemblies=/mnt/work/rna-assembly-nodn.fa \
--left=/mnt/work/left.fq --right=/mnt/work/right.fq \
--output=/mnt/work/transrate-merge
and at the end you’ll see you have more “good” contigs –:
[ INFO] 2016-03-30 15:50:54 : p good contigs 0.52
Back to index: 2016 / Mar / mRNAseq on non-model organisms
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.