######### LOAD PACKAGES ##########
library(immuneSIM)
library(DT)
library(Biostrings)
library(ggplot2)
library(ggpubr)
library(patchwork)
library(Biostrings)
library(msa)
library(dplyr)
library(webr)
library(plyr)
library(ggtree)
library(stringr)
library(bsselectR)
library(WeightedCluster)
library(apcluster)
library(reshape2)
immuneSIM
R packageimmuneSIM
enables in silico generation of single and paired chain human and mouse B- and T-cell repertoires with user-defined tunable properties to provide the user with experimental-like (or aberrant) data to benchmark their repertoire analysis methods.
number_of_seqs
: Integer defining the number of sequences that should be simulatedvdj_list
: List containing germline genes and their frequenciesspecies
: String defining species for which repertoire should be simulated ("mm": mouse, "hs": human. Default: "mm").receptor
: String defining receptor type ("ig" or "tr". Default: "ig")chain
: String defining chain (for ig: "h","k","l", for tr: "b" or "a". Default: "h")insertions_and_deletion_lengths
: Data.frame containing np1, np2 sequences as well as deletion lengths. (Pooled from murine repertoire data, Greiff,2017) Note: This is a subset of 500000 observations of the dataframe used in the paper. The full dataframe which can be introduced here can be found on: (Git-Link)user_defined_alpha
: Numeric. Scaling parameter used for the simulation of powerlaw distribution (recommended range 2-5. Default: 2, https://en.wikipedia.org/wiki/Power_law)name_repertoire
: String defining chosen repertoire name recorded in the name_repertoire column of the output for identification.length_distribution_rand
: Vector containing lengths of immune receptor sequences based on immune repertoire data (Greiff, 2017).random
: Boolean. If TRUE repertoire will consist of fully random sequences, independent of germline genes.shm.mode
: String defining mode of somatic hypermutation simulation based on AbSim (options: 'none', 'data','poisson', 'naive', 'motif', 'wrc'. Default: 'none'). See AbSim documentation.shm.prob
: Numeric defining probability of a SHM (somatic hypermutation) occurring at each position.vdj_noise
: Numeric between 0,1, setting noise level to be introduced in provided V,D,J germline frequencies. 0 denotes no noise. (Default: 0)vdj_dropout
: Named vector containing entries V,D,J setting the number of germline genes to be dropped out. (Default: c("V"=0,"D"=0,"J"=0))ins_del_dropout
: String determining whether insertions and deletions should occur. Options: "", "no_insertions", "no_insertions_n1", "no_insertions_n2", "no_deletions_v", "no_deletions_d_5", "no_deletions_d_3", "no_deletions_j", "no_deletions_vd", "no_deletions". Default: "")equal_cc
: Boolean that if set TRUE will override user_defined_alpha and generate a clone count distribution that is equal for all sequences. Default: FALSE.freq_update_time
: Numeric determining whether simulated VDJ frequencies agree with input after set amount of sequences to correct for VDJ bias. Default: Update after 50 percent of sequences.max_cdr3_length
: Numeric defining maximal length of cdr3. (Default: 100)min_cdr3_length
: Numeric defining minimal length of cdr3. (Default: 6)verbose
: Boolean toggling printing of progress on and off (Default: FALSE)airr_compliant
: Boolean determining whether output repertoire should be named in an AIRR compliant manner (Default: TRUE). (http://docs.airr-community.org/en/latest/)Following the quickstart guide, we can easily generate a TCRβ repertoire with 1000 different sequences:
# sim_repertoire <- immuneSIM(
# number_of_seqs = 1000,
# species = "hs",
# receptor = "tr",
# chain = "b",
# verbose= TRUE)
#
# save(sim_repertoire, file="data/simulated_repertoires/00_first_test")
#
# plot_report_repertoire(sim_repertoire, output_dir = "data/simulated_repertoires/")
load("data/simulated_repertoires/00_first_test")
The immuneSIM function outputs an R dataframe containing 20 columns and rows equal to the number of sequences simulated. Per sequence immuneSIM provides the following information:
datatable(sim_repertoire,
class = 'nowrap compact order-column',
extensions = c('Buttons', 'FixedColumns'),
options = list(pageLength = 10,
lengthMenu = c(10, 20, 50, 100),
dom = 'Blfrtip',
buttons = c('csv', 'excel', 'colvis'),
scrollX = TRUE,
fixedColumns = TRUE,
paging = TRUE))
It also generates three plots with the following information about the repertoire composition:
CuReSim
CuReSim (Customized Read Simulator) is a customized tool which generates synthetic NGS reads, supporting read simulation for major letter-base sequencing platforms.
The main features of CuReSim include:
For CuReSim input a .fastq file with the simulated TCR sequences (multiplied by their respective counts) is generated. Specifically, reverse complement sequences are required to mimic our experimental data.
######## Generate .fastq file for CuReSim #######
seqs <- as.character(sim_repertoire$sequence)
counts <- sim_repertoire$counts
reads <- DNAStringSet(rep(seqs, counts))
names(reads) <- seq(1, length(reads))
reads_rc <- reverseComplement(reads)
# writeXStringSet(reads_rc, "software/CuReSim/input/00_first_test.fastq", format = "fastq")
print(paste0("Number of reads: ", length(reads_rc)))
## [1] "Number of reads: 1643918"
Heads up! immuneSIM generates full-length VDJ sequences (~350 bp), while our reads contain also part of the first exon of segment C, along with the barcode and adapters sequences (~450 bp). We could introduce manually those extra nucleotides to make the simulated data even more similar to the experimental sequences.
Options are given with default values in brackets.
java -jar CuReSim.jar [options] -f <input_file> [options]
f
(String) genome fasta file or reads fastq file. It is the only MANDATORY fieldo
(String) name of output fastq file [output.fastq]n
(integer) number of reads to generate [50000]m
(integer) read mean size [200 bases]sd
(float) standard deviation for read size [20.0]r
(integer) number of random reads [0]d
(float) deletion rate [0.01]i
(float) insertion rate [0.005]s
(float) substitution rate [0.005]ui
when option -ui is added uniform distribution is drawn for indels [homopolymers]us
when option -us is added uniform distribution is drawn for substitutions [exponential]q
(character) quality encoding character in fastq file [5]v
verbose mode generating diagrams. R software is required [false]skip
skip the correction step [false]h
print this helpDefault values for deletion (1%), insertion (0.5%) and substitution (0.5%) rates are those typical of IonTorrent technology (source).
rl_before <- ggplot() +
aes(width(reads_rc)) +
geom_histogram(color="dodgerblue", fill="dodgerblue", binwidth = 1) +
scale_y_continuous(name = "Counts", expand = c(0.01,0)) +
scale_x_continuous("Read length", expand = c(0.01,0)) +
ggtitle("Before") +
theme_pubr()
### nohup java -jar CuReSim.jar -f ../input/00_first_test.fastq -d 0.01 -i 0.005 -s 0.005 -o ../output/00_first_test.fastq -v &
curesim_fq <- ShortRead::readFastq("software/CuReSim/output/00_first_test.fastq") # Read .fq file and store in a ShortReadQ class object
reads_sim <- ShortRead::sread(curesim_fq) # DNAStringSet with the reads
quality_sim <- quality(curesim_fq) # BStringSet with the sequencing qualities
new.quality.class <- switch(class(quality_sim), # Convert from BStringSet to XStringQuality
SFastqQuality="SolexaQuality",
FastqQuality="PhredQuality",
"XStringQuality")
quality_sim <- as(quality_sim, new.quality.class)
sample_sim <- QualityScaledDNAStringSet(reads_sim, quality_sim) # Get QualityScaledDNAStringSet to use as a pairwiseAlignment input
names(sample_sim) <- as.character(curesim_fq@id) # Character vector with the reads IDs
sample_sim_rc <- reverseComplement(sample_sim)
# Readlength histogram
rl_after <- ggplot() +
aes(width(sample_sim_rc)) +
geom_histogram(color="dodgerblue", fill="dodgerblue", binwidth = 1) +
scale_y_continuous(name = "Counts", expand = c(0.01,0)) +
scale_x_continuous("Read length", expand = c(0.01,0)) +
ggtitle("After") +
theme_pubr()
###
rm(reads_sim, quality_sim, sample_sim, curesim_fq)
rl_before + rl_after
######################################################
#################### VJ alignment ####################
######################################################
### VJ_alignment.R script executed via terminal (nohup)
print("Execution time: 14.5789240725173 hours")
## [1] "Execution time: 14.5789240725173 hours"
vj_alg_sim <- readRDS("data/simulated_repertoires/VJ_alignment_first_sim_rep.rds") # Data too big for datatable (RStudio crashes)
Heads up! Sample 111L had ~200,000 reads (~2 hours for VJ alignment), while our simulated repertoire of 1000 different HVR has a total count of ~1,600,000 reads. immuneSIM does not have a specific parameter for total counts or sequencing coverage.
###################################################### (4)
################# VJ alignment results ###############
###################################################### (4)
alg_rep_data <- select(vj_alg_sim,
V_score,
J_score,
Cys_check,
Phe_check,
HVR_frame)
alg_rep_data$HVR_status <- ifelse((alg_rep_data$V_score < 0 | alg_rep_data$J_score < 0 | !alg_rep_data$Cys_check | !alg_rep_data$Phe_check),
"HVR not found",
"HVR found")
alg_rep_data$label <- NA
alg_rep_data[alg_rep_data$V_score < 0, "label"] <- "No V"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score < 0), "label"] <- "No J"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & !alg_rep_data$Cys_check & alg_rep_data$Phe_check), "label"] <- "No Cys"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & !alg_rep_data$Phe_check), "label"] <- "No Phe"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & !alg_rep_data$Cys_check & !alg_rep_data$Phe_check), "label"] <- "No Cys no Phe"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & alg_rep_data$Phe_check), "label"] <- paste("Frame", alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & alg_rep_data$Phe_check), "HVR_frame"])
PieDonut(alg_rep_data, aes(HVR_status,label),
showPieName=FALSE,
labelposition = 1,
showRatioThreshold = 0.009,
color = "white",
explode = 2,
explodeDonut = TRUE,
explodePie = TRUE,
r0 = 0,
r1 = 0.9,
start = pi/2,
maxx = 1.6,
explodePos = 0,
donutLabelSize = 3.5,
showRatioDonut = FALSE,
title = "VJ alignment (simulated repertoire)")
df1 <- as.data.frame(table(alg_rep_data$HVR_status))
colnames(df1) <- c("Status", "# Reads")
df2 <- as.data.frame(table(alg_rep_data$label))
colnames(df2) <- c("Status", "# Reads")
knitr::kable(df1)
Status | # Reads |
---|---|
HVR found | 1555379 |
HVR not found | 88539 |
knitr::kable(df2)
Status | # Reads |
---|---|
Frame 0 | 492201 |
Frame 1 | 514612 |
Frame 2 | 548566 |
No Cys | 14071 |
No Cys no Phe | 681 |
No Phe | 73787 |
HVR can be delimited if V and J segments and their conserved Cys and Phe codons are identified in the read. This is the case for most of the reads (94.6%). However, the proportion of the three possible HVR reading frames does not resemble that of the real repertoires analyzed. Both samples 111H and 111L had a large proportion of HVRs in frame (frame 0), while the simulated repertoire reading frames have the same frequency.
Heads up! Sequencing error rates need to be tuned. We could try and estimate them from the VJ alignments of real repertoires and/or try a different Ion Torrent sequencing simulator.
Reads with delimited HVR (1,555,379 reads) can be grouped into unique clonotypes (same V and J match and HVR sequence). This results in 84,412 unique reads:
###################################################### (5)
################## Unique reads table ################
###################################################### (5)
vj_alg_good <- vj_alg_sim[!(vj_alg_sim$V_score < 0 | vj_alg_sim$J_score < 0 | !vj_alg_sim$Cys_check | !vj_alg_sim$Phe_check),]
vj_alg_good_splt <- split(vj_alg_good, paste0(vj_alg_good$V_match, " / ",
vj_alg_good$HVR_sequence, " / ",
vj_alg_good$J_match))
vj_alg_unique <- dplyr::bind_rows(lapply(vj_alg_good_splt, function(x){
qual_mat <- do.call(rbind, lapply(x[,"HVR_quality"], function(y){strtoi(charToRaw(y),16L)-33}))
qual_med <- round(robustbase::colMedians(qual_mat))
qual_ascii <- rawToChar(as.raw(qual_med+33))
df_row <- x[1,]
df_row <- subset(df_row, select=-HVR_quality)
df_row$HVR_quality_median <- qual_ascii
df_row$Counts <- nrow(x)
return(df_row)
}))
vj_alg_unique <- dplyr::select(vj_alg_unique,
c(V_match,
J_match,
HVR_sequence,
HVR_quality_median,
HVR_frame,
Counts))
# DT::datatable(vj_alg_unique, # Data too big for datatables, although it can be generated
# class = 'nowrap compact order-column',
# extensions = c('Buttons', 'FixedColumns'),
# options = list(pageLength = 10,
# lengthMenu = c(10, 20, 50, 100),
# dom = 'Blfrtip',
# buttons = c('csv', 'excel', 'colvis'),
# scrollX = TRUE,
# fixedColumns = TRUE,
# paging = TRUE))
There are 371 VJ families in the sample. A heatmap with the number of unique HVR sequences per VJ family is shown here:
###################################################### (6)
######### Heatmap unique reads per VJ family #########
###################################################### (6)
vj_mat <- as.data.frame(table(vj_alg_unique[,c("V_match", "J_match")]))
vj_mat[vj_mat == 0] <- NA
vj_mat$V_match <- factor(vj_mat$V_match, levels = rev(str_sort(unique(vj_mat$V_match), numeric = TRUE)))
vj_mat$J_match <- factor(vj_mat$J_match, levels = str_sort(unique(vj_mat$J_match), numeric = TRUE))
vj_heatmap <- ggplot(vj_mat, aes(J_match, V_match, fill = Freq)) +
geom_tile(color = "white") +
coord_equal() +
geom_text(aes(label = Freq), size = 3) +
scale_x_discrete(expand = c(0,0), name = "J match") +
scale_y_discrete(expand = c(0,0), name = "V match") +
scale_fill_gradient(low = "aquamarine3", high = "red", name = "Unique HVR count", na.value = "gray90") +
theme_pubr() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
text = element_text(size = 10))
vj_heatmap
53 VJ families with only one HVR sequence are excluded from the clustering step. These families are shown here:
data_ec <- vj_alg_good
### Unique reads
data_ec_uniq <- plyr::ddply(data_ec,.(V_match, J_match, HVR_sequence, HVR_frame), nrow)
data_ec_uniq <- data_ec_uniq %>% dplyr::rename(Counts = V1)
vj_fam <- split(data_ec_uniq, paste0(data_ec_uniq$V_match, " / ", data_ec_uniq$J_match)) # Split data frame by VJ match into list of data frames
### Single-member VJ families
single_idx <- unlist(lapply(vj_fam, function(x){nrow(x) == 1})) # Logic vector for one-membered families
vj_sgl_member <- bind_rows(vj_fam[single_idx]) # Data frame with one member clonotypes
DT::datatable(vj_sgl_member,
class = 'nowrap compact order-column',
extensions = c('Buttons', 'FixedColumns'),
options = list(pageLength = 10,
lengthMenu = c(10, 20, 50, 100),
dom = 'Blfrtip',
buttons = c('csv', 'excel', 'colvis'),
scrollX = TRUE,
fixedColumns = TRUE,
paging = TRUE))
Therefore, 318 VJ families would undergo the clustering step.
Dendrograms of 82/318 VJ families with more than one distinct HVR sequence are shown here (also available in /media/bacon/Carazo_TCRSeq_IonTorrentS5/03_sequenceAnalysis/april_2020/plots/dendrograms_sim_rep_weighted_hclust_wardD
). The method used was a weighted hierarchical clustering with ward.D linkage, as it showed to be the best linkage method for sample 111H.0
Note: only 82 dendrograms are available because execution halts trying to generate plots for very large (>1000 HVRs) VJ families. Will not fix this since those dendrograms would be unintelligible anyway. 82 dendrograms should be a sufficient sample to compare in silico VJ families to real ones.
######################################################
################## Plot dendrograms ##################
######################################################
# dendro_weighted_clust <- function(data_ec, msa_dir, plot_dir, linkage){
#
# #########################################
# ############# Prepare data ##############
# #########################################
#
# ### Unique reads
# data_ec_uniq <- plyr::ddply(data_ec,.(V_match, J_match, HVR_sequence, HVR_frame), nrow)
# data_ec_uniq <- data_ec_uniq %>% dplyr::rename(Counts = V1)
#
# vj_fam <- split(data_ec_uniq, paste0(data_ec_uniq$V_match, " / ", data_ec_uniq$J_match)) # Split data frame by VJ match into list of data frames
#
# ### Remove single-member VJ families
# single_idx <- unlist(lapply(vj_fam, function(x){nrow(x) == 1})) # Logic vector for one-membered families
# vj_sgl_member <- bind_rows(vj_fam[single_idx]) # Data frame with one member clonotypes
#
# ### List of data frames with VJ families with more than one clonotype
# vj_groups <- vj_fam[!single_idx]
# vj_groups <- lapply(vj_groups, function(x){
# df <- x
# rownames(df) <- 1:nrow(df)
# return(df)
# })
#
# ### Create DNAStringSet
# vj_groups_dss <- lapply(vj_groups, function(x){ # List of QualityScaledDNAStringSet with VJ families with more than one clonotype
# dss <- DNAStringSet(x[,"HVR_sequence"]) # Sequences
# ids <- rownames(x)
# names(dss) <- ids
# return(dss)
# })
#
#
# ########################################
# ############ Generate plots ############
# ########################################
#
# lapply(1:length(vj_groups), function(idx){
#
# v_name <- gsub("_.*", "", vj_groups[[idx]][1,"V_match"])
# j_name <- gsub("_.*", "", vj_groups[[idx]][1,"J_match"])
#
# ### MSA ###
#
# hvr_msa <- invisible(DNAStringSet(msaClustalW(vj_groups_dss[idx][[1]],
# gapOpening = 1)))
#
# msa_path <- paste0(msa_dir, "hvr_msa_", v_name, "_", j_name, ".fasta")
# writeXStringSet(hvr_msa, msa_path)
#
# ### LEVENSHTEIN DISTANCE ###
#
# df <- vj_groups_dss[idx][[1]]
# ld <- stringDist(df, method = "levenshtein", upper = TRUE, diag = TRUE)
#
# ### HIERARCHICAL CLUSTERING ###
# ld_hc <- hclust(ld, method = linkage, members = vj_groups[[idx]][,"Counts"])
#
# ### GGTREE ###
#
# p <- ggtree(ld_hc)
#
# d <- data.frame(label = rownames(vj_groups[[idx]]),
# HVR_frame = vj_groups[[idx]][, "HVR_frame"],
# Counts = vj_groups[[idx]][, "Counts"])
#
# mycols <- c("mediumseagreen", "orange2", "firebrick2")
# names(mycols) <- c("0", "1", "2")
# p2 <- p %<+% d +
# geom_tiplab() +
# geom_point(aes(color=factor(HVR_frame, levels = c(0, 1, 2)), size = Counts, x=x+(x*0.18))) +
# scale_color_manual(values = mycols, name = "HVR frame")
#
#
# m <- msaplot(p2, fasta = msa_path,
# offset = max(ld_hc$height) * 0.125,
# width = 2)
#
# nuc_colors <- c(NA, "#a2fa8c", "#ffd18c", "#f38d8a", "#8ab8f5")
# names(nuc_colors) <- c("-", "a", "c", "g", "t")
# m <- m +
# scale_fill_manual(values = nuc_colors, name = "Nucleotide") +
# ggtitle(paste(v_name, "/", j_name, "family"))
#
# ggsave(filename = paste0(plot_dir, v_name, "_", j_name, ".png"),
# plot = m,
# width = 10,
# height = 5 + nrow(vj_groups[[idx]])*0.06,
# limitsize = FALSE)
# })
# }
############ ERROR: stops at 82 dendrograms. Probably could not generate png in large VJ family (>1000 HVRs) as it could not allocate the file in memory.
# dendro_weighted_clust(vj_alg_good,
# msa_dir = "plots/dendrograms_sim_rep_weighted_hclust_wardD/MSA/",
# plot_dir = "plots/dendrograms_sim_rep_weighted_hclust_wardD/",
# linkage = "ward.D")
dendro_plots <- paste0(list.files("plots/dendrograms_sim_rep_weighted_hclust_wardD", full.names = TRUE, pattern = "*.png"))
names(dendro_plots) <- str_replace_all(dendro_plots,
c("\\.png" = "",
"plots/dendrograms_sim_rep_weighted_hclust_wardD/" = ""))
bsselect(dendro_plots, type = "img", selected = NULL,
show_tick = TRUE, frame_height = "100%",
dropup_auto = FALSE, size = 10,
height = 50)
Default CuReSim
error rates for Ion Torrent technology are 1% deletion, 0.5% insertion and 0.5% substitution, but these values are extracted from 2012 publications (source 1, 2, 3) that worked with old chips of the Ion Torrent PGM platform. Our TCR sequences were obtained with the newer Ion Torrent S5 platform (chip 530), which generates more high-quality reads than the PGM (source). However, S5 error rates could not be found in the literature.
Ideas for error rate values testing:
CuReSim
publication "precision and recall values were closer to the values obtained for the real dataset values when reads were generated with 0.5% deletions, 0.25% insertions, and 0.25% substitutions".CuReSim
default).CuReSim
default).We tried this last configuration:
CuReSim
results
### nohup java -jar CuReSim.jar -f ../input/00_first_test.fastq -d 0.0013 -i 0.0027 -s 0.0008 -o ../output/01_HCV_error_rates_d013_i027_s008.fastq -v &
curesim_fq <- ShortRead::readFastq("software/CuReSim/output/01_HCV_error_rates_d013_i027_s008.fastq") # Read .fq file and store in a ShortReadQ class object
reads_sim <- ShortRead::sread(curesim_fq) # DNAStringSet with the reads
quality_sim <- quality(curesim_fq) # BStringSet with the sequencing qualities
new.quality.class <- switch(class(quality_sim), # Convert from BStringSet to XStringQuality
SFastqQuality="SolexaQuality",
FastqQuality="PhredQuality",
"XStringQuality")
quality_sim <- as(quality_sim, new.quality.class)
sample_sim <- QualityScaledDNAStringSet(reads_sim, quality_sim) # Get QualityScaledDNAStringSet to use as a pairwiseAlignment input
names(sample_sim) <- as.character(curesim_fq@id) # Character vector with the reads IDs
sample_sim_rc <- reverseComplement(sample_sim)
# Readlength histogram
rl_after <- ggplot() +
aes(width(sample_sim_rc)) +
geom_histogram(color="dodgerblue", fill="dodgerblue", binwidth = 1) +
scale_y_continuous(name = "Counts", expand = c(0.01,0)) +
scale_x_continuous("Read length", expand = c(0.01,0)) +
ggtitle("After") +
theme_pubr()
###
rm(reads_sim, quality_sim, sample_sim, curesim_fq)
Readlength before and after CuReSim
:
rl_before + rl_after
######################################################
#################### VJ alignment ####################
######################################################
### VJ_alignment.R script executed via terminal (nohup)
print("Execution time: 14.691883502139 hours")
## [1] "Execution time: 14.691883502139 hours"
vj_alg_sim_01 <- readRDS("data/simulated_repertoires/01_VJ_alignment_HCV_error_rates.rds") # Data too big for datatable (RStudio crashes)
###################################################### (4)
################# VJ alignment results ###############
###################################################### (4)
alg_rep_data <- select(vj_alg_sim_01,
V_score,
J_score,
Cys_check,
Phe_check,
HVR_frame)
alg_rep_data$HVR_status <- ifelse((alg_rep_data$V_score < 0 | alg_rep_data$J_score < 0 | !alg_rep_data$Cys_check | !alg_rep_data$Phe_check),
"HVR not found",
"HVR found")
alg_rep_data$label <- NA
alg_rep_data[alg_rep_data$V_score < 0, "label"] <- "No V"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score < 0), "label"] <- "No J"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & !alg_rep_data$Cys_check & alg_rep_data$Phe_check), "label"] <- "No Cys"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & !alg_rep_data$Phe_check), "label"] <- "No Phe"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & !alg_rep_data$Cys_check & !alg_rep_data$Phe_check), "label"] <- "No Cys no Phe"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & alg_rep_data$Phe_check), "label"] <- paste("Frame", alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & alg_rep_data$Phe_check), "HVR_frame"])
PieDonut(alg_rep_data, aes(HVR_status,label),
showPieName=FALSE,
labelposition = 1,
showRatioThreshold = 0.009,
color = "white",
explode = 2,
explodeDonut = TRUE,
explodePie = TRUE,
r0 = 0,
r1 = 0.9,
start = pi/2,
maxx = 1.6,
explodePos = 0,
donutLabelSize = 3.5,
showRatioDonut = FALSE,
title = "VJ alignment (simulated repertoire, HCV error rates)")
df1 <- as.data.frame(table(alg_rep_data$HVR_status))
colnames(df1) <- c("Status", "# Reads")
df2 <- as.data.frame(table(alg_rep_data$label))
colnames(df2) <- c("Status", "# Reads")
knitr::kable(df1)
Status | # Reads |
---|---|
HVR found | 1595806 |
HVR not found | 48112 |
knitr::kable(df2)
Status | # Reads |
---|---|
Frame 0 | 628448 |
Frame 1 | 545218 |
Frame 2 | 422140 |
No Phe | 48112 |
Despite having considerably reduced error rates, out-of-frame HVRs still have a higher frequency than expected.
Let's compare the VJ alignment score and identity distributions among a real repertoire (111L, ~200k reads) and the two simulated ones (~1.5M reads each):
vj_alg_111L <- readRDS("data/vj_alignment_sample_111L.rds")
sc_id_comp <- do.call(rbind, list("111L" = select(vj_alg_111L, c("V_score", "V_identity", "J_score", "J_identity")),
"sim_00" = select(vj_alg_sim, c("V_score", "V_identity", "J_score", "J_identity")),
"sim_01" = select(vj_alg_sim_01, c("V_score", "V_identity", "J_score", "J_identity"))))
sc_id_comp$Sample <- as.factor(sub("\\..*", "", rownames(sc_id_comp)))
sc_id_comp.m <- melt(sc_id_comp, id.vars = "Sample")
calc_stat <- function(x) {
coef <- 1.5
n <- sum(!is.na(x))
# calculate quantiles
stats <- quantile(x, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
return(stats)
}
sc_id_comp_plt <- ggplot(sc_id_comp.m, aes(x = Sample, y = value, color = Sample)) +
stat_summary(fun.data = calc_stat, geom="boxplot") +
# geom_boxplot(outlier.shape = NA) +
# scale_y_continuous() +
# scale_x_continuous(name = "Bin", breaks = seq(-10, -1)) +
# scale_color_manual(name = "Reading frame", values = c("mediumseagreen", "orange", "orangered")) +
facet_wrap(. ~ variable, scales = "free") +
theme_pubr()
sc_id_comp_plt
## Warning: Removed 122 rows containing non-finite values (stat_summary).
In our reads, sequencing goes from C segment to V segment (C-J-HVR-V
). Simulated reads were reverse complemented to mimic this situation (J-HVR-V
). However, simulated reads lack the portion of segment C and therefore are ~100 bp shorter. Conclusions that could be extracted from this:
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] reshape2_1.4.4 apcluster_1.4.8 WeightedCluster_1.4-1
## [4] cluster_2.1.0 TraMineR_2.2-0.1 bsselectR_0.1.0
## [7] stringr_1.4.0 ggtree_2.0.4 plyr_1.8.6
## [10] webr_0.1.5 dplyr_1.0.2 msa_1.16.0
## [13] patchwork_1.0.1 ggpubr_0.2.1 magrittr_1.5
## [16] ggplot2_3.3.2 Biostrings_2.52.0 XVector_0.24.0
## [19] IRanges_2.18.3 S4Vectors_0.22.1 BiocGenerics_0.30.0
## [22] DT_0.16 immuneSIM_0.8.7
##
## loaded via a namespace (and not attached):
## [1] uuid_0.1-4 backports_1.2.0
## [3] Hmisc_4.2-0 systemfonts_0.3.2
## [5] igraph_1.2.6 lazyeval_0.2.2
## [7] splines_3.6.1 BiocParallel_1.18.1
## [9] crosstalk_1.1.0.1 GenomeInfoDb_1.20.0
## [11] digest_0.6.27 htmltools_0.5.0
## [13] checkmate_2.0.0 Metrics_0.1.4
## [15] readr_1.4.0 matrixStats_0.57.0
## [17] R.utils_2.10.1 officer_0.3.15
## [19] jpeg_0.1-8.1 colorspace_1.4-1
## [21] xfun_0.19 RCurl_1.98-1.2
## [23] crayon_1.3.4 jsonlite_1.7.1
## [25] rrtable_0.2.1 survival_3.2-7
## [27] zoo_1.8-8 ape_5.4-1
## [29] glue_1.4.2 polyclip_1.10-0
## [31] rvg_0.2.5 gtable_0.3.0
## [33] zlibbioc_1.30.0 DelayedArray_0.10.0
## [35] sjmisc_2.8.5 R.cache_0.14.0
## [37] DEoptimR_1.0-8 scales_1.1.1
## [39] ggthemes_4.2.0 miniUI_0.1.1.1
## [41] Rcpp_1.0.5 xtable_1.8-4
## [43] htmlTable_2.1.0 tmvnsim_1.0-2
## [45] tidytree_0.3.3 foreign_0.8-72
## [47] Formula_1.2-4 vcd_1.4-8
## [49] htmlwidgets_1.5.2 httr_1.4.2
## [51] RColorBrewer_1.1-2 acepack_1.4.1
## [53] ellipsis_0.3.1 pkgconfig_2.0.3
## [55] R.methodsS3_1.8.1 farver_2.0.3
## [57] nnet_7.3-12 labeling_0.4.2
## [59] tidyselect_1.1.0 rlang_0.4.8
## [61] later_1.1.0.1 munsell_0.5.0
## [63] tools_3.6.1 generics_0.1.0
## [65] devEMF_4.0-2 sjlabelled_1.1.7
## [67] moonBook_0.2.3 evaluate_0.14
## [69] fastmap_1.0.1 yaml_2.2.1
## [71] knitr_1.30 robustbase_0.93-6
## [73] zip_2.1.1 purrr_0.3.4
## [75] nlme_3.1-140 mime_0.9
## [77] R.oo_1.24.0 poweRlaw_0.70.6
## [79] pracma_2.2.9 xml2_1.3.2
## [81] compiler_3.6.1 rstudioapi_0.11
## [83] png_0.1-7 ggsignif_0.5.0
## [85] treeio_1.8.2 tibble_3.0.4
## [87] tweenr_1.0.1 stringi_1.5.3
## [89] highr_0.8 gdtools_0.2.2
## [91] lattice_0.20-38 Matrix_1.2-17
## [93] psych_2.0.9 vctrs_0.3.4
## [95] stringdist_0.9.6.3 pillar_1.4.6
## [97] lifecycle_0.2.0 BiocManager_1.30.10
## [99] editData_0.1.2 lmtest_0.9-38
## [101] bitops_1.0-6 data.table_1.13.2
## [103] insight_0.10.0 flextable_0.5.11
## [105] ztable_0.2.2 GenomicRanges_1.36.1
## [107] httpuv_1.5.4 hwriter_1.3.2
## [109] R6_2.5.0 latticeExtra_0.6-29
## [111] ShortRead_1.42.0 promises_1.1.1
## [113] gridExtra_2.3 boot_1.3-23
## [115] MASS_7.3-51.1 SummarizedExperiment_1.14.1
## [117] shinyWidgets_0.5.4 withr_2.3.0
## [119] Rsamtools_2.0.3 GenomicAlignments_1.20.1
## [121] mnormt_2.0.2 GenomeInfoDbData_1.2.1
## [123] hms_0.5.3 repmis_0.5
## [125] grid_3.6.1 rpart_4.1-15
## [127] tidyr_1.1.2 rmarkdown_2.5
## [129] rvcheck_0.1.8 ggforce_0.3.1
## [131] Biobase_2.44.0 shiny_1.5.0
## [133] base64enc_0.1-3