You will need some 3rd party packages for this lesson. 3rd party packages are developed by another developer or group because this the quality of those packages could be questionable. So always check all the 3rd party packages before use.
3rd party packages aren’t installed with R. You have to install it with the following command:
install.packages("<package_name>")
All the 3rd party package collected in a central repository called CRAN. So you just need the name of the package to install it.
You will need two packages to install RColorBrewer and gplots
install.packages("RColorBrewer")
install.packages("gplots")
Bioconductor is a specialized repository for bioinformatics tools. To be able to use Bioconductor packages you have to install and activate Bioconductor itself as well.
source("http://bioconductor.org/biocLite.R")
biocLite()
You’ll need to install and activate the following packages:
biocLite("Rsubread")
biocLite("Rsamtools")
biocLite("edgeR")
Sample: small sample of reads (56,168 read-pairs) from a single replicate Sequencing: strand specific, paired-end, Illumina, 100-100nts Reference: small chromosome fragment (1-839990 nt part of the 3L chromosome of Drosophila simulans contains 96 genes)
Call Rsubread library
library(Rsubread)
buildindex(basename="Genome_sequence_ToyExample_index",
reference="Genome_sequence_ToyExample.fa")
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 1.20.6
##
## //=========================== indexBuilder setting ===========================\\
## || ||
## || Index name : Genome_sequence_ToyExample_index ||
## || Index space : base-space ||
## || Memory : 8000 Mbytes ||
## || Repeat threshold : 100 repeats ||
## || Distance to next subread : 3 ||
## || ||
## || Input files : 1 file in total ||
## || o Genome_sequence_ToyExample.fa ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
##
## //================================= Running ==================================\\
## || ||
## || Check the integrity of provided reference sequences ... ||
## || No format issues were found ||
## || Scan uninformative subreads in reference sequences ... ||
## || 8%, 0 mins elapsed, rate=358.3k bps/s, total=0m ||
## || 16%, 0 mins elapsed, rate=499.0k bps/s, total=0m ||
## || 24%, 0 mins elapsed, rate=593.7k bps/s, total=0m ||
## || 33%, 0 mins elapsed, rate=667.0k bps/s, total=0m ||
## || 41%, 0 mins elapsed, rate=729.3k bps/s, total=0m ||
## || 49%, 0 mins elapsed, rate=786.5k bps/s, total=0m ||
## || 58%, 0 mins elapsed, rate=844.6k bps/s, total=0m ||
## || 66%, 0 mins elapsed, rate=894.2k bps/s, total=0m ||
## || 74%, 0 mins elapsed, rate=941.3k bps/s, total=0m ||
## || 83%, 0 mins elapsed, rate=985.4k bps/s, total=0m ||
## || 91%, 0 mins elapsed, rate=1032.9k bps/s, total=0m ||
## || 99%, 0 mins elapsed, rate=1073.4k bps/s, total=0m ||
## || 1 uninformative subreads were found. ||
## || These subreads were excluded from index building. ||
## || Build the index... ||
## || 8%, 0 mins elapsed, rate=399.2k bps/s, total=0m ||
## || 16%, 0 mins elapsed, rate=429.9k bps/s, total=0m ||
## || 24%, 0 mins elapsed, rate=445.0k bps/s, total=0m ||
## || 33%, 0 mins elapsed, rate=458.9k bps/s, total=0m ||
## || 41%, 0 mins elapsed, rate=472.0k bps/s, total=0m ||
## || 49%, 0 mins elapsed, rate=485.2k bps/s, total=0m ||
## || 58%, 0 mins elapsed, rate=498.5k bps/s, total=0m ||
## || 66%, 0 mins elapsed, rate=511.8k bps/s, total=0m ||
## || 74%, 0 mins elapsed, rate=524.4k bps/s, total=0m ||
## || 83%, 0 mins elapsed, rate=537.4k bps/s, total=0m ||
## || 91%, 0 mins elapsed, rate=550.5k bps/s, total=0m ||
## || 99%, 0 mins elapsed, rate=563.4k bps/s, total=0m ||
## || Save current index block... ||
## || [ 0.0% finished ] ||
## || [ 10.0% finished ] ||
## || [ 20.0% finished ] ||
## || [ 30.0% finished ] ||
## || [ 40.0% finished ] ||
## || [ 50.0% finished ] ||
## || [ 60.0% finished ] ||
## || [ 70.0% finished ] ||
## || [ 80.0% finished ] ||
## || [ 90.0% finished ] ||
## || [ 100.0% finished ] ||
## || ||
## || Total running time: 0.0 minutes. ||
## || Index Genome_sequence_ToyExample_index was successfully built! ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
An index needs to be built before read mapping can be performed. This function creates a hash table for the reference genome, which can then be used by Subread and Subjunc aligners for read alignment. Highly repetitive subreads (or uninformative subreads) are excluded from the hash table so as to reduce mapping ambiguity. (page 7 at Rsubread.pdf)
Created files: *_index.00.b.array *_index.00.b.tab *_index.files *_index.log *_index.reads
subjunc(index="Genome_sequence_ToyExample_index",
readfile1="Trimmed_reads_ToyExample_1.fq", readfile2="Trimmed_reads_ToyExample_2.fq",
input_format="FASTQ", output_format="BAM", output_file="Mapped_reads_ToyExample.BAM",
nthreads=1, phredOffset=64, unique=TRUE, minFragLength=50, maxFragLength=10000,
PE_orientation="fr")
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 1.20.6
##
## //============================= subjunc setting ==============================\\
## || ||
## || Function : Read alignment + Junction detection (RNA-Seq) ||
## || Input file 1 : Trimmed_reads_ToyExample_1.fq ||
## || Input file 2 : Trimmed_reads_ToyExample_2.fq ||
## || Output file : Mapped_reads_ToyExample.BAM (BAM) ||
## || Index name : Genome_sequence_ToyExample_index ||
## || ||
## || Threads : 1 ||
## || Phred offset : 64 ||
## || # of extracted subreads : 14 ||
## || Min read1 vote : 1 ||
## || Min read2 vote : 1 ||
## || Max fragment size : 10000 ||
## || Min fragment size : 50 ||
## || Maximum allowed mismatches : 3 ||
## || Maximum allowed indel bases : 5 ||
## || # of best alignments reported : 1 ||
## || Unique mapping : yes ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
##
## //====================== Running (03-May-2018 17:36:59) ======================\\
## || ||
## || The input file contains base space reads. ||
## || The range of Phred scores observed in the data is [5,41] ||
## || Load the 1-th index block... ||
## || 11% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 18% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 25% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 31% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 38% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 45% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 51% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 58% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 65% completed, 0.1 mins elapsed, rate=12.0k fragments per second ||
## || 69% completed, 0.1 mins elapsed, rate=5.8k fragments per second ||
## || 73% completed, 0.1 mins elapsed, rate=5.8k fragments per second ||
## || 76% completed, 0.1 mins elapsed, rate=5.8k fragments per second ||
## || 79% completed, 0.1 mins elapsed, rate=5.8k fragments per second ||
## || 83% completed, 0.1 mins elapsed, rate=5.8k fragments per second ||
## || 86% completed, 0.1 mins elapsed, rate=5.8k fragments per second ||
## || 89% completed, 0.1 mins elapsed, rate=5.8k fragments per second ||
## || 93% completed, 0.1 mins elapsed, rate=5.8k fragments per second ||
## || 96% completed, 0.2 mins elapsed, rate=5.8k fragments per second ||
## || ||
## || Completed successfully. ||
## || ||
## \\============================================================================//
##
## //================================= Summary ==================================\\
## || ||
## || Processed : 56,168 fragments ||
## || Mapped : 54,470 fragments (97.0%) ||
## || Correctly paired : 51,659 fragments ||
## || Junctions : 334 ||
## || Indels : 327 ||
## || ||
## || Running time : 0.2 minutes ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
Subjunc perform global alignments. The seed-and-vote paradigm enables efficient and accurate alignments to be carried out. phredOffse
: sanger 33
; illumina 64
PE_orientation
: character string giving the orientation of the two reads from the same pair. It has three possible values including fr
, ff
and rf
. Letter f
denotes the forward strand and letter r
the reverse strand. fr by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand). (page 2 at Rsubread.pdf)
For IGV the mapped reads needs to be sorted - by coordinates - and indexed.
Indexing BAM file with Rsamtools
Sorting the BAM file by coordinates.
library(Rsamtools, quietly=T)
sortBam(file="Mapped_reads_ToyExample.BAM", destination="Sorted_Mapped_reads_ToyExample")
## [1] "Sorted_Mapped_reads_ToyExample.bam"
Creating an index file for the sorted BAM file.
indexBam("Sorted_Mapped_reads_ToyExample.bam")
## Sorted_Mapped_reads_ToyExample.bam
## "Sorted_Mapped_reads_ToyExample.bam.bai"
FASTA (genome sequence) Genomes -> Load Genome from File: Genome_sequence_ToyExample.fa
GTF (genome annotation) Files -> Load from File: Genome_annotation_ToyExample.gtf
Sorted, indexed BAM (mapped read-pairs) Files -> Load from File: Sorted_Mapped_reads_ToyExample.bam
Creating an index file for the sorted BAM file.
Counts_ToyExample=featureCounts(files="Mapped_reads_ToyExample.BAM",
annot.ext="Genome_annotation_ToyExample.gtf",
isGTFAnnotationFile=T, GTF.featureType="exon",
GTF.attrType="gene_id", useMetaFeatures=T,
isPairedEnd=T, requireBothEndsMapped=T,
checkFragLength=F, nthreads=1, strandSpecific=2,
reportReads=T)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 1.20.6
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 1 BAM file ||
## || P Mapped_reads_ToyExample.BAM ||
## || ||
## || Output file : ./.Rsubread_featureCounts_pid25920 ||
## || Summary : ./.Rsubread_featureCounts_pid25920.summary ||
## || Annotation : Genome_annotation_ToyExample.gtf (GTF) ||
## || Assignment details : <input_file>.featureCounts ||
## || ||
## || Threads : 1 ||
## || Level : meta-feature level ||
## || Paired-end : yes ||
## || Strand specific : reversely stranded ||
## || Multimapping reads : not counted ||
## || Multi-overlapping reads : not counted ||
## || ||
## || Chimeric reads : counted ||
## || Both ends mapped : required ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file Genome_annotation_ToyExample.gtf ... ||
## || Features : 1446 ||
## || Meta-features : 96 ||
## || Chromosomes/contigs : 1 ||
## || ||
## || Process BAM file Mapped_reads_ToyExample.BAM... ||
## || Paired-end reads are included. ||
## || Assign fragments (read pairs) to features... ||
## || Total fragments : 56168 ||
## || Successfully assigned fragments : 47888 (85.3%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Read assignment finished. ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
This function assigns mapped sequencing reads to genomic features.
GTF.featureType
: a character string giving the feature type used to select rows in the GTF annotation which will be used for read summarization. exon by default.
GTF.attrType
: a character string giving the attribute type in the GTF annotation which will be used to group features (eg. exons) into meta-features (eg. genes). gene_id by default.
useMetaFeatures
: logical indicating whether the read summarization should be performed at the eature level (eg. exons) or meta-feature level (eg genes). If TRUE
, features in the annotation (each row is a feature) will be grouped into meta-features using their values using the “gene_id” attribute in the GTF-format annotation file, and reads will assiged to the meta-features instead of the features.
requireBothEndsMapped
: logical indicating if both ends from the same fragment are required to be successfully aligned before the fragment can be assigned to a feature or metafeature.
checkFragLength
: logical indicating if the two ends from the same fragment are required to satisify the fragment length criteria before the fragment can be assigned to a feature or meta-feature. The fragment length criteria are specified via minFragLength
and maxFragLength
.
strandSpecific
integer indicating if strand-specific read counting should be performed. It has three possible values: 0
(unstranded), 1
(stranded) and 2
(reversely stranded).
reportReads: logical indicating if read counting result for each read/fragment is saved to a file. If TRUE
, read counting results for reads/fragments will be saved to a tab-delimited file that contains four columns including name of read/fragment, status (assigned or the reason if not assigned), name of target feature/meta-feature and number of hits if the read/fragment is counted multiple times. Name of the file is the same as name of the input read file except a suffix .featureCounts
is added. Multiple files will be generated if there is more than one input read file. (page 12 at Rsubread.pdf)
featureCounts
variableDescription: page 16 at Rsubread.pdf
Names of objects of the featureCounts
variable.
names(Counts_ToyExample)
## [1] "counts" "annotation" "targets" "stat"
Counts of the first genes.
head(Counts_ToyExample$counts)
## Mapped_reads_ToyExample.BAM
## FBgn0264707 60
## FBgn0001316 1150
## FBgn0035104 11
## FBgn0035132 27
## FBgn0262679 40
## FBgn0035112 234
Some data about the counts.
Counts_ToyExample$stat
## Status Mapped_reads_ToyExample.BAM
## 1 Assigned 47888
## 2 Unassigned_Ambiguity 114
## 3 Unassigned_MultiMapping 0
## 4 Unassigned_NoFeatures 3736
## 5 Unassigned_Unmapped 4430
## 6 Unassigned_MappingQuality 0
## 7 Unassigned_FragmentLength 0
## 8 Unassigned_Chimera 0
## 9 Unassigned_Secondary 0
## 10 Unassigned_Nonjunction 0
## 11 Unassigned_Duplicate 0
stat
: a data frame giving numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library.
Summary statistics of the counted reads.
summary(Counts_ToyExample$counts)
## Mapped_reads_ToyExample.BAM
## Min. : 0.00
## 1st Qu.: 25.75
## Median : 97.50
## Mean : 498.83
## 3rd Qu.: 486.50
## Max. :5665.00
Histogram of the counts.
hist(Counts_ToyExample$counts)
hist(log2(Counts_ToyExample$counts))
Write the count table to a file.
write.table(Counts_ToyExample$counts, "Counts_ToyExample.tsv", quote=F, sep="\t")