What's new?
  • Ran sequencing simulator with new error rates from the literature (lower than the first test): still have a lot of out-of-frame HVRs :(
  • Compared V and J alignment score and identity distributions among sample 111L and the two simulated repertoires.

######### LOAD PACKAGES ##########
library(immuneSIM)
library(DT)
library(Biostrings)
library(ggplot2)
library(ggpubr)
library(patchwork)
library(Biostrings)
library(msa)
library(dplyr)
library(webr)
library(plyr)
library(ggtree)
library(stringr)
library(bsselectR)
library(WeightedCluster)
library(apcluster)
library(reshape2)

1 Simulating TCR repertoires

immuneSIM R package

immuneSIM enables in silico generation of single and paired chain human and mouse B- and T-cell repertoires with user-defined tunable properties to provide the user with experimental-like (or aberrant) data to benchmark their repertoire analysis methods.

Figure 1. Package functionality overview.

Figure 1. Package functionality overview.

Methods overview

Figure 2. immuneSIM flowchart.

Figure 2. immuneSIM flowchart.

Parameters

  • number_of_seqs: Integer defining the number of sequences that should be simulated
  • vdj_list: List containing germline genes and their frequencies
  • species: String defining species for which repertoire should be simulated ("mm": mouse, "hs": human. Default: "mm").
  • receptor: String defining receptor type ("ig" or "tr". Default: "ig")
  • chain: String defining chain (for ig: "h","k","l", for tr: "b" or "a". Default: "h")
  • insertions_and_deletion_lengths: Data.frame containing np1, np2 sequences as well as deletion lengths. (Pooled from murine repertoire data, Greiff,2017) Note: This is a subset of 500000 observations of the dataframe used in the paper. The full dataframe which can be introduced here can be found on: (Git-Link)
  • user_defined_alpha: Numeric. Scaling parameter used for the simulation of powerlaw distribution (recommended range 2-5. Default: 2, https://en.wikipedia.org/wiki/Power_law)
  • name_repertoire: String defining chosen repertoire name recorded in the name_repertoire column of the output for identification.
  • length_distribution_rand: Vector containing lengths of immune receptor sequences based on immune repertoire data (Greiff, 2017).
  • random: Boolean. If TRUE repertoire will consist of fully random sequences, independent of germline genes.
  • shm.mode: String defining mode of somatic hypermutation simulation based on AbSim (options: 'none', 'data','poisson', 'naive', 'motif', 'wrc'. Default: 'none'). See AbSim documentation.
  • shm.prob: Numeric defining probability of a SHM (somatic hypermutation) occurring at each position.
  • vdj_noise: Numeric between 0,1, setting noise level to be introduced in provided V,D,J germline frequencies. 0 denotes no noise. (Default: 0)
  • vdj_dropout: Named vector containing entries V,D,J setting the number of germline genes to be dropped out. (Default: c("V"=0,"D"=0,"J"=0))
  • ins_del_dropout: String determining whether insertions and deletions should occur. Options: "", "no_insertions", "no_insertions_n1", "no_insertions_n2", "no_deletions_v", "no_deletions_d_5", "no_deletions_d_3", "no_deletions_j", "no_deletions_vd", "no_deletions". Default: "")
  • equal_cc: Boolean that if set TRUE will override user_defined_alpha and generate a clone count distribution that is equal for all sequences. Default: FALSE.
  • freq_update_time: Numeric determining whether simulated VDJ frequencies agree with input after set amount of sequences to correct for VDJ bias. Default: Update after 50 percent of sequences.
  • max_cdr3_length: Numeric defining maximal length of cdr3. (Default: 100)
  • min_cdr3_length: Numeric defining minimal length of cdr3. (Default: 6)
  • verbose: Boolean toggling printing of progress on and off (Default: FALSE)
  • airr_compliant: Boolean determining whether output repertoire should be named in an AIRR compliant manner (Default: TRUE). (http://docs.airr-community.org/en/latest/)

Results

Following the quickstart guide, we can easily generate a TCRβ repertoire with 1000 different sequences:

# sim_repertoire <- immuneSIM(
#         number_of_seqs = 1000,
#         species = "hs", 
#         receptor = "tr",
#         chain = "b",
#         verbose= TRUE)
# 
# save(sim_repertoire, file="data/simulated_repertoires/00_first_test")
#
# plot_report_repertoire(sim_repertoire, output_dir = "data/simulated_repertoires/")

load("data/simulated_repertoires/00_first_test")

The immuneSIM function outputs an R dataframe containing 20 columns and rows equal to the number of sequences simulated. Per sequence immuneSIM provides the following information:

  • Full VDJ sequence (nucleotide and amino acid): sequence, sequence_aa
  • CDR3 junctional sequence (nt and aa): junction, junction_aa
  • VDJ genes used in the recombination event: v_call, d_call, j_call
  • Nucleotide insertions VD and DJ: np1, np2
  • Length of deletion in V, D and J genes: del_v, del_d_5, del_d_3, del_j
  • CDR3 subsequences from V,D and J genes: v_sequence_alignment, d_sequence_alignment, j_sequence_alignment
  • Clonal frequency/count information: freqs, counts
  • Summary of somatic hypermutation (SHM) event simulated (only for BCR): shm_events
  • Given name of repertoire: name_repertoire
datatable(sim_repertoire,
              class = 'nowrap compact order-column',
              extensions = c('Buttons',  'FixedColumns'),
              options = list(pageLength = 10,
                             lengthMenu = c(10, 20, 50, 100),
                             dom = 'Blfrtip',
                             buttons = c('csv', 'excel', 'colvis'),
                             scrollX = TRUE,
                             fixedColumns = TRUE,
                             paging = TRUE))

It also generates three plots with the following information about the repertoire composition:

  • Aminoacid frequency per HVR position
  • VDJ length distribution
  • V, D and J segments usage

2 Ion Torrent sequencing simulation

CuReSim

CuReSim (Customized Read Simulator) is a customized tool which generates synthetic NGS reads, supporting read simulation for major letter-base sequencing platforms.

The main features of CuReSim include:

  • Genome (fasta) or read (fastq) as input file.
  • Choice between several error distributions.
  • A particular attention has been paid to special cases for which several introduced errors in the same read can result in less number of errors as expected due to compensatory changes.
  • Generation of diagrams to know exactly the simulated error model (R is required).

For CuReSim input a .fastq file with the simulated TCR sequences (multiplied by their respective counts) is generated. Specifically, reverse complement sequences are required to mimic our experimental data.

######## Generate .fastq file for CuReSim #######
seqs <- as.character(sim_repertoire$sequence)
counts <- sim_repertoire$counts

reads <- DNAStringSet(rep(seqs, counts))
names(reads) <- seq(1, length(reads))

reads_rc <- reverseComplement(reads)

# writeXStringSet(reads_rc, "software/CuReSim/input/00_first_test.fastq", format = "fastq")

print(paste0("Number of reads: ", length(reads_rc)))
## [1] "Number of reads: 1643918"

Heads up! immuneSIM generates full-length VDJ sequences (~350 bp), while our reads contain also part of the first exon of segment C, along with the barcode and adapters sequences (~450 bp). We could introduce manually those extra nucleotides to make the simulated data even more similar to the experimental sequences.

Methods overview

  1. Input file pre-processing: not required in our case since the input is not a genome, but directly the reads without errors.
  2. Indel generation: user-defined rates. By default an iterative algorithm mostly introduces indels in the longer homopolymers.
  3. Substitution generation: user-defined rate. By default substitution probability increases at the end of the read.
  4. Correction step: to avoid compensatory changes that reduce the number of expected errors (i.e. deletion of an inserted base or substitution of an insertion).

Parameters

Options are given with default values in brackets.

java -jar CuReSim.jar [options] -f <input_file> [options]

  • f (String) genome fasta file or reads fastq file. It is the only MANDATORY field
  • o (String) name of output fastq file [output.fastq]
  • n (integer) number of reads to generate [50000]
  • m (integer) read mean size [200 bases]
  • sd (float) standard deviation for read size [20.0]
  • r (integer) number of random reads [0]
  • d (float) deletion rate [0.01]
  • i (float) insertion rate [0.005]
  • s (float) substitution rate [0.005]
  • ui when option -ui is added uniform distribution is drawn for indels [homopolymers]
  • us when option -us is added uniform distribution is drawn for substitutions [exponential]
  • q (character) quality encoding character in fastq file [5]
  • v verbose mode generating diagrams. R software is required [false]
  • skip skip the correction step [false]
  • h print this help

Default values for deletion (1%), insertion (0.5%) and substitution (0.5%) rates are those typical of IonTorrent technology (source).

Results

Readlength distribution before and after sequencing simulation

rl_before <- ggplot() +
  aes(width(reads_rc)) +
  geom_histogram(color="dodgerblue", fill="dodgerblue", binwidth = 1) +
  scale_y_continuous(name = "Counts", expand = c(0.01,0)) +
  scale_x_continuous("Read length", expand = c(0.01,0)) +
  ggtitle("Before") +
  theme_pubr()
### nohup java -jar CuReSim.jar -f ../input/00_first_test.fastq -d 0.01 -i 0.005 -s 0.005 -o ../output/00_first_test.fastq -v &

curesim_fq <- ShortRead::readFastq("software/CuReSim/output/00_first_test.fastq") # Read .fq file and store in a ShortReadQ class object
reads_sim <- ShortRead::sread(curesim_fq) # DNAStringSet with the reads
quality_sim <- quality(curesim_fq) # BStringSet with the sequencing qualities

new.quality.class <- switch(class(quality_sim), # Convert from BStringSet to XStringQuality
                                      SFastqQuality="SolexaQuality",
                                      FastqQuality="PhredQuality",
                                      "XStringQuality")
quality_sim <- as(quality_sim, new.quality.class)

sample_sim <- QualityScaledDNAStringSet(reads_sim, quality_sim) # Get QualityScaledDNAStringSet to use as a pairwiseAlignment input
names(sample_sim) <- as.character(curesim_fq@id) # Character vector with the reads IDs

sample_sim_rc <- reverseComplement(sample_sim)

# Readlength histogram
rl_after <- ggplot() +
  aes(width(sample_sim_rc)) +
  geom_histogram(color="dodgerblue", fill="dodgerblue", binwidth = 1) +
  scale_y_continuous(name = "Counts", expand = c(0.01,0)) +
  scale_x_continuous("Read length", expand = c(0.01,0)) +
  ggtitle("After") +
  theme_pubr()

###
rm(reads_sim, quality_sim, sample_sim, curesim_fq)
rl_before + rl_after

3 Applying our pipeline

VJ alignment

######################################################
#################### VJ alignment ####################
######################################################

### VJ_alignment.R script executed via terminal (nohup)

print("Execution time: 14.5789240725173 hours")
## [1] "Execution time: 14.5789240725173 hours"
vj_alg_sim <- readRDS("data/simulated_repertoires/VJ_alignment_first_sim_rep.rds") # Data too big for datatable (RStudio crashes)

Heads up! Sample 111L had ~200,000 reads (~2 hours for VJ alignment), while our simulated repertoire of 1000 different HVR has a total count of ~1,600,000 reads. immuneSIM does not have a specific parameter for total counts or sequencing coverage.

Alignment summary

###################################################### (4)
################# VJ alignment results ###############
###################################################### (4)

alg_rep_data <- select(vj_alg_sim,
                       V_score,
                       J_score,
                       Cys_check,
                       Phe_check,
                       HVR_frame)

alg_rep_data$HVR_status <- ifelse((alg_rep_data$V_score < 0 | alg_rep_data$J_score < 0 | !alg_rep_data$Cys_check | !alg_rep_data$Phe_check), 
                                  "HVR not found", 
                                  "HVR found")
alg_rep_data$label <- NA
alg_rep_data[alg_rep_data$V_score < 0, "label"] <- "No V"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score < 0), "label"] <- "No J"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & !alg_rep_data$Cys_check & alg_rep_data$Phe_check), "label"] <- "No Cys"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & !alg_rep_data$Phe_check), "label"] <- "No Phe"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & !alg_rep_data$Cys_check & !alg_rep_data$Phe_check), "label"] <- "No Cys no Phe"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & alg_rep_data$Phe_check), "label"] <- paste("Frame",  alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & alg_rep_data$Phe_check), "HVR_frame"])

PieDonut(alg_rep_data, aes(HVR_status,label),
         showPieName=FALSE,
         labelposition = 1,
         showRatioThreshold = 0.009,
         color = "white",
         explode = 2,
         explodeDonut = TRUE,
         explodePie = TRUE,
         r0 = 0,
         r1 = 0.9,
         start = pi/2,
         maxx = 1.6,
         explodePos = 0,
         donutLabelSize = 3.5,
         showRatioDonut = FALSE,
         title = "VJ alignment (simulated repertoire)")

df1 <- as.data.frame(table(alg_rep_data$HVR_status))
colnames(df1) <- c("Status", "# Reads")
df2 <- as.data.frame(table(alg_rep_data$label))
colnames(df2) <- c("Status", "# Reads")
knitr::kable(df1)
Status # Reads
HVR found 1555379
HVR not found 88539
knitr::kable(df2)
Status # Reads
Frame 0 492201
Frame 1 514612
Frame 2 548566
No Cys 14071
No Cys no Phe 681
No Phe 73787

HVR can be delimited if V and J segments and their conserved Cys and Phe codons are identified in the read. This is the case for most of the reads (94.6%). However, the proportion of the three possible HVR reading frames does not resemble that of the real repertoires analyzed. Both samples 111H and 111L had a large proportion of HVRs in frame (frame 0), while the simulated repertoire reading frames have the same frequency.

Heads up! Sequencing error rates need to be tuned. We could try and estimate them from the VJ alignments of real repertoires and/or try a different Ion Torrent sequencing simulator.

Unique reads

Reads with delimited HVR (1,555,379 reads) can be grouped into unique clonotypes (same V and J match and HVR sequence). This results in 84,412 unique reads:

###################################################### (5)
################## Unique reads table ################
###################################################### (5)

vj_alg_good <- vj_alg_sim[!(vj_alg_sim$V_score < 0 | vj_alg_sim$J_score < 0 | !vj_alg_sim$Cys_check | !vj_alg_sim$Phe_check),]

vj_alg_good_splt <- split(vj_alg_good, paste0(vj_alg_good$V_match, " / ", 
                                   vj_alg_good$HVR_sequence, " / ", 
                                   vj_alg_good$J_match))

vj_alg_unique <- dplyr::bind_rows(lapply(vj_alg_good_splt, function(x){
  
  qual_mat <- do.call(rbind, lapply(x[,"HVR_quality"], function(y){strtoi(charToRaw(y),16L)-33}))
  qual_med <- round(robustbase::colMedians(qual_mat))
  qual_ascii <- rawToChar(as.raw(qual_med+33))
  df_row <- x[1,]
  df_row <- subset(df_row, select=-HVR_quality)
  df_row$HVR_quality_median <- qual_ascii
  df_row$Counts <- nrow(x)
  
  return(df_row)
  
}))


vj_alg_unique <- dplyr::select(vj_alg_unique,
                               c(V_match,
                                 J_match,
                                 HVR_sequence,
                                 HVR_quality_median,
                                 HVR_frame,
                                 Counts))


# DT::datatable(vj_alg_unique, # Data too big for datatables, although it can be generated
#               class = 'nowrap compact order-column',
#               extensions = c('Buttons',  'FixedColumns'),
#               options = list(pageLength = 10,
#                              lengthMenu = c(10, 20, 50, 100),
#                              dom = 'Blfrtip',
#                              buttons = c('csv', 'excel', 'colvis'),
#                              scrollX = TRUE,
#                              fixedColumns = TRUE,
#                              paging = TRUE))

VJ families

There are 371 VJ families in the sample. A heatmap with the number of unique HVR sequences per VJ family is shown here:

###################################################### (6)
######### Heatmap unique reads per VJ family #########
###################################################### (6)

vj_mat <- as.data.frame(table(vj_alg_unique[,c("V_match", "J_match")]))
vj_mat[vj_mat == 0] <- NA

vj_mat$V_match <- factor(vj_mat$V_match, levels = rev(str_sort(unique(vj_mat$V_match), numeric = TRUE)))
vj_mat$J_match <- factor(vj_mat$J_match, levels = str_sort(unique(vj_mat$J_match), numeric = TRUE))


vj_heatmap <- ggplot(vj_mat, aes(J_match, V_match, fill = Freq)) +
  geom_tile(color = "white") +
  coord_equal() +
  geom_text(aes(label = Freq), size = 3) +
  scale_x_discrete(expand = c(0,0), name = "J match") +
  scale_y_discrete(expand = c(0,0), name = "V match") +
  scale_fill_gradient(low = "aquamarine3", high = "red", name = "Unique HVR count", na.value = "gray90") +
  theme_pubr() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        text = element_text(size = 10))

vj_heatmap

One-membered families

53 VJ families with only one HVR sequence are excluded from the clustering step. These families are shown here:

data_ec <- vj_alg_good
### Unique reads
data_ec_uniq <- plyr::ddply(data_ec,.(V_match, J_match, HVR_sequence, HVR_frame), nrow)
data_ec_uniq <- data_ec_uniq %>% dplyr::rename(Counts = V1)

vj_fam <- split(data_ec_uniq, paste0(data_ec_uniq$V_match, " / ", data_ec_uniq$J_match)) # Split data frame by VJ match into list of data frames

### Single-member VJ families
single_idx <- unlist(lapply(vj_fam, function(x){nrow(x) == 1})) # Logic vector for one-membered families
vj_sgl_member <- bind_rows(vj_fam[single_idx]) # Data frame with one member clonotypes

DT::datatable(vj_sgl_member,
              class = 'nowrap compact order-column',
              extensions = c('Buttons',  'FixedColumns'),
              options = list(pageLength = 10,
                             lengthMenu = c(10, 20, 50, 100),
                             dom = 'Blfrtip',
                             buttons = c('csv', 'excel', 'colvis'),
                             scrollX = TRUE,
                             fixedColumns = TRUE,
                             paging = TRUE))

Therefore, 318 VJ families would undergo the clustering step.

Dendrograms

Dendrograms of 82/318 VJ families with more than one distinct HVR sequence are shown here (also available in /media/bacon/Carazo_TCRSeq_IonTorrentS5/03_sequenceAnalysis/april_2020/plots/dendrograms_sim_rep_weighted_hclust_wardD). The method used was a weighted hierarchical clustering with ward.D linkage, as it showed to be the best linkage method for sample 111H.0

Note: only 82 dendrograms are available because execution halts trying to generate plots for very large (>1000 HVRs) VJ families. Will not fix this since those dendrograms would be unintelligible anyway. 82 dendrograms should be a sufficient sample to compare in silico VJ families to real ones.

######################################################
################## Plot dendrograms ##################
###################################################### 

# dendro_weighted_clust <- function(data_ec, msa_dir, plot_dir, linkage){
#   
#   #########################################
#   ############# Prepare data ##############
#   #########################################
#   
#   ### Unique reads
#   data_ec_uniq <- plyr::ddply(data_ec,.(V_match, J_match, HVR_sequence, HVR_frame), nrow)
#   data_ec_uniq <- data_ec_uniq %>% dplyr::rename(Counts = V1)
#   
#   vj_fam <- split(data_ec_uniq, paste0(data_ec_uniq$V_match, " / ", data_ec_uniq$J_match)) # Split data frame by VJ match into list of data frames
# 
#   ### Remove single-member VJ families
#   single_idx <- unlist(lapply(vj_fam, function(x){nrow(x) == 1})) # Logic vector for one-membered families
#   vj_sgl_member <- bind_rows(vj_fam[single_idx]) # Data frame with one member clonotypes
#   
#   ### List of data frames with VJ families with more than one clonotype
#   vj_groups <- vj_fam[!single_idx] 
#   vj_groups <- lapply(vj_groups, function(x){
#     df <- x
#     rownames(df) <- 1:nrow(df)
#     return(df)
#   })
#   
#   ### Create DNAStringSet
#   vj_groups_dss <- lapply(vj_groups, function(x){ # List of QualityScaledDNAStringSet with VJ families with more than one clonotype
#   dss <- DNAStringSet(x[,"HVR_sequence"]) # Sequences
#   ids <- rownames(x)
#   names(dss) <- ids
#   return(dss)
#   })
#   
# 
#   ########################################
#   ############ Generate plots ############
#   ########################################
# 
#   lapply(1:length(vj_groups), function(idx){
# 
#     v_name <- gsub("_.*", "", vj_groups[[idx]][1,"V_match"])
#     j_name <- gsub("_.*", "", vj_groups[[idx]][1,"J_match"])
# 
#     ### MSA ###
# 
#     hvr_msa <- invisible(DNAStringSet(msaClustalW(vj_groups_dss[idx][[1]],
#                                         gapOpening = 1)))
# 
#     msa_path <- paste0(msa_dir, "hvr_msa_", v_name, "_", j_name, ".fasta")
#     writeXStringSet(hvr_msa, msa_path)
# 
#     ### LEVENSHTEIN DISTANCE ###
# 
#     df <- vj_groups_dss[idx][[1]]
#     ld <- stringDist(df, method = "levenshtein", upper = TRUE, diag = TRUE)
# 
#     ### HIERARCHICAL CLUSTERING ###
#     ld_hc <- hclust(ld, method = linkage, members = vj_groups[[idx]][,"Counts"])
# 
#     ### GGTREE ###
# 
#     p <- ggtree(ld_hc)
# 
#     d <- data.frame(label = rownames(vj_groups[[idx]]),
#                     HVR_frame = vj_groups[[idx]][, "HVR_frame"],
#                     Counts = vj_groups[[idx]][, "Counts"])
# 
#     mycols <- c("mediumseagreen", "orange2", "firebrick2")
#     names(mycols) <- c("0", "1", "2")
#     p2 <- p %<+% d +
#       geom_tiplab() +
#       geom_point(aes(color=factor(HVR_frame, levels = c(0, 1, 2)), size = Counts, x=x+(x*0.18))) +
#       scale_color_manual(values = mycols, name = "HVR frame")
#     
# 
#     m <- msaplot(p2, fasta = msa_path,
#             offset = max(ld_hc$height) * 0.125,
#             width = 2)
# 
#     nuc_colors <- c(NA, "#a2fa8c", "#ffd18c", "#f38d8a", "#8ab8f5")
#     names(nuc_colors) <- c("-", "a", "c", "g", "t")
#     m <- m +
#       scale_fill_manual(values = nuc_colors, name = "Nucleotide") +
#       ggtitle(paste(v_name, "/", j_name, "family"))
# 
#     ggsave(filename = paste0(plot_dir, v_name, "_", j_name, ".png"),
#            plot = m,
#            width = 10,
#            height = 5 + nrow(vj_groups[[idx]])*0.06,
#            limitsize = FALSE)
#     })
# }
############ ERROR: stops at 82 dendrograms. Probably could not generate png in large VJ family (>1000 HVRs) as it could not allocate the file in memory. 

# dendro_weighted_clust(vj_alg_good,
#                       msa_dir = "plots/dendrograms_sim_rep_weighted_hclust_wardD/MSA/",
#                       plot_dir = "plots/dendrograms_sim_rep_weighted_hclust_wardD/",
#                       linkage = "ward.D")
dendro_plots <- paste0(list.files("plots/dendrograms_sim_rep_weighted_hclust_wardD", full.names = TRUE, pattern = "*.png"))
names(dendro_plots) <- str_replace_all(dendro_plots, 
                                      c("\\.png" = "", 
                                        "plots/dendrograms_sim_rep_weighted_hclust_wardD/" = ""))

bsselect(dendro_plots, type = "img", selected = NULL,
         show_tick = TRUE, frame_height = "100%",
         dropup_auto = FALSE, size = 10,
         height = 50)

4 Tunning simulated sequencing error rates New

Default CuReSim error rates for Ion Torrent technology are 1% deletion, 0.5% insertion and 0.5% substitution, but these values are extracted from 2012 publications (source 1, 2, 3) that worked with old chips of the Ion Torrent PGM platform. Our TCR sequences were obtained with the newer Ion Torrent S5 platform (chip 530), which generates more high-quality reads than the PGM (source). However, S5 error rates could not be found in the literature.

Ideas for error rate values testing:

  • According to CuReSim publication "precision and recall values were closer to the values obtained for the real dataset values when reads were generated with 0.5% deletions, 0.25% insertions, and 0.25% substitutions".
  • In this publication three Ion Torrent PGM sequencing kits were compared in terms of error rates (deletion / insertion / substitution):
  • 100 bp One Touch: 0.8% / 0.84% / 0.04%
  • ~200 bp Manual: 1.98% / 2.69% / 0.17 %~ (discarded, higher indel rates than CuReSim default).
  • ~200 bp One Touch: 1.07% / 1.76% / 0.07%~ (discarded, higher indel rates than CuReSim default).
  • In this publication the error rates are estimated with hepatitis B virus genome: 0.13% / 0.27% / ~0.08%.

We tried this last configuration:

  • Deletion: 0.13%
  • Insertion: 0.27%
  • Substitution: 0.08%

CuReSim results

### nohup java -jar CuReSim.jar -f ../input/00_first_test.fastq -d 0.0013 -i 0.0027 -s 0.0008 -o ../output/01_HCV_error_rates_d013_i027_s008.fastq -v &

curesim_fq <- ShortRead::readFastq("software/CuReSim/output/01_HCV_error_rates_d013_i027_s008.fastq") # Read .fq file and store in a ShortReadQ class object
reads_sim <- ShortRead::sread(curesim_fq) # DNAStringSet with the reads
quality_sim <- quality(curesim_fq) # BStringSet with the sequencing qualities

new.quality.class <- switch(class(quality_sim), # Convert from BStringSet to XStringQuality
                                      SFastqQuality="SolexaQuality",
                                      FastqQuality="PhredQuality",
                                      "XStringQuality")
quality_sim <- as(quality_sim, new.quality.class)

sample_sim <- QualityScaledDNAStringSet(reads_sim, quality_sim) # Get QualityScaledDNAStringSet to use as a pairwiseAlignment input
names(sample_sim) <- as.character(curesim_fq@id) # Character vector with the reads IDs

sample_sim_rc <- reverseComplement(sample_sim)

# Readlength histogram
rl_after <- ggplot() +
  aes(width(sample_sim_rc)) +
  geom_histogram(color="dodgerblue", fill="dodgerblue", binwidth = 1) +
  scale_y_continuous(name = "Counts", expand = c(0.01,0)) +
  scale_x_continuous("Read length", expand = c(0.01,0)) +
  ggtitle("After") +
  theme_pubr()

###
rm(reads_sim, quality_sim, sample_sim, curesim_fq)

Readlength before and after CuReSim:

rl_before + rl_after

VJ alignment

######################################################
#################### VJ alignment ####################
######################################################
 
### VJ_alignment.R script executed via terminal (nohup)

print("Execution time: 14.691883502139 hours")
## [1] "Execution time: 14.691883502139 hours"
vj_alg_sim_01 <- readRDS("data/simulated_repertoires/01_VJ_alignment_HCV_error_rates.rds") # Data too big for datatable (RStudio crashes)

Alignment summary

###################################################### (4)
################# VJ alignment results ###############
###################################################### (4)

alg_rep_data <- select(vj_alg_sim_01,
                       V_score,
                       J_score,
                       Cys_check,
                       Phe_check,
                       HVR_frame)

alg_rep_data$HVR_status <- ifelse((alg_rep_data$V_score < 0 | alg_rep_data$J_score < 0 | !alg_rep_data$Cys_check | !alg_rep_data$Phe_check), 
                                  "HVR not found", 
                                  "HVR found")
alg_rep_data$label <- NA
alg_rep_data[alg_rep_data$V_score < 0, "label"] <- "No V"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score < 0), "label"] <- "No J"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & !alg_rep_data$Cys_check & alg_rep_data$Phe_check), "label"] <- "No Cys"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & !alg_rep_data$Phe_check), "label"] <- "No Phe"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & !alg_rep_data$Cys_check & !alg_rep_data$Phe_check), "label"] <- "No Cys no Phe"
alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & alg_rep_data$Phe_check), "label"] <- paste("Frame",  alg_rep_data[(alg_rep_data$V_score > 0 & alg_rep_data$J_score > 0 & alg_rep_data$Cys_check & alg_rep_data$Phe_check), "HVR_frame"])

PieDonut(alg_rep_data, aes(HVR_status,label),
         showPieName=FALSE,
         labelposition = 1,
         showRatioThreshold = 0.009,
         color = "white",
         explode = 2,
         explodeDonut = TRUE,
         explodePie = TRUE,
         r0 = 0,
         r1 = 0.9,
         start = pi/2,
         maxx = 1.6,
         explodePos = 0,
         donutLabelSize = 3.5,
         showRatioDonut = FALSE,
         title = "VJ alignment (simulated repertoire, HCV error rates)")

df1 <- as.data.frame(table(alg_rep_data$HVR_status))
colnames(df1) <- c("Status", "# Reads")
df2 <- as.data.frame(table(alg_rep_data$label))
colnames(df2) <- c("Status", "# Reads")
knitr::kable(df1)
Status # Reads
HVR found 1595806
HVR not found 48112
knitr::kable(df2)
Status # Reads
Frame 0 628448
Frame 1 545218
Frame 2 422140
No Phe 48112

Despite having considerably reduced error rates, out-of-frame HVRs still have a higher frequency than expected.

Let's compare the VJ alignment score and identity distributions among a real repertoire (111L, ~200k reads) and the two simulated ones (~1.5M reads each):

vj_alg_111L <- readRDS("data/vj_alignment_sample_111L.rds")

sc_id_comp <- do.call(rbind, list("111L" = select(vj_alg_111L, c("V_score", "V_identity", "J_score", "J_identity")),
                    "sim_00" = select(vj_alg_sim, c("V_score", "V_identity", "J_score", "J_identity")),
                    "sim_01" = select(vj_alg_sim_01, c("V_score", "V_identity", "J_score", "J_identity"))))

sc_id_comp$Sample <- as.factor(sub("\\..*", "", rownames(sc_id_comp)))

sc_id_comp.m <- melt(sc_id_comp, id.vars = "Sample")


calc_stat <- function(x) {
  coef <- 1.5
  n <- sum(!is.na(x))
  # calculate quantiles
  stats <- quantile(x, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
  names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
  return(stats)
}

sc_id_comp_plt <- ggplot(sc_id_comp.m, aes(x = Sample, y = value, color = Sample)) +
  stat_summary(fun.data = calc_stat, geom="boxplot") + 
  # geom_boxplot(outlier.shape = NA) +
  # scale_y_continuous() +
  # scale_x_continuous(name = "Bin", breaks = seq(-10, -1)) +
  # scale_color_manual(name = "Reading frame", values = c("mediumseagreen", "orange", "orangered")) +
  facet_wrap(. ~ variable, scales = "free") +
  theme_pubr()
sc_id_comp_plt
## Warning: Removed 122 rows containing non-finite values (stat_summary).
Outliers have been removed from these plots.

Outliers have been removed from these plots.

In our reads, sequencing goes from C segment to V segment (C-J-HVR-V). Simulated reads were reverse complemented to mimic this situation (J-HVR-V). However, simulated reads lack the portion of segment C and therefore are ~100 bp shorter. Conclusions that could be extracted from this:

  1. V score and identity is lower in 111H because reads are larger and V is located at the end.
  2. J score and identity is higher in 111H maybe because error rates are still too high in the simulated repertoire (?). Since simulated samples have shorter reads, J is located at the beginning and its score and identity should be higher than that of real repertoires, not lower.

Session info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.4        apcluster_1.4.8       WeightedCluster_1.4-1
##  [4] cluster_2.1.0         TraMineR_2.2-0.1      bsselectR_0.1.0      
##  [7] stringr_1.4.0         ggtree_2.0.4          plyr_1.8.6           
## [10] webr_0.1.5            dplyr_1.0.2           msa_1.16.0           
## [13] patchwork_1.0.1       ggpubr_0.2.1          magrittr_1.5         
## [16] ggplot2_3.3.2         Biostrings_2.52.0     XVector_0.24.0       
## [19] IRanges_2.18.3        S4Vectors_0.22.1      BiocGenerics_0.30.0  
## [22] DT_0.16               immuneSIM_0.8.7      
## 
## loaded via a namespace (and not attached):
##   [1] uuid_0.1-4                  backports_1.2.0            
##   [3] Hmisc_4.2-0                 systemfonts_0.3.2          
##   [5] igraph_1.2.6                lazyeval_0.2.2             
##   [7] splines_3.6.1               BiocParallel_1.18.1        
##   [9] crosstalk_1.1.0.1           GenomeInfoDb_1.20.0        
##  [11] digest_0.6.27               htmltools_0.5.0            
##  [13] checkmate_2.0.0             Metrics_0.1.4              
##  [15] readr_1.4.0                 matrixStats_0.57.0         
##  [17] R.utils_2.10.1              officer_0.3.15             
##  [19] jpeg_0.1-8.1                colorspace_1.4-1           
##  [21] xfun_0.19                   RCurl_1.98-1.2             
##  [23] crayon_1.3.4                jsonlite_1.7.1             
##  [25] rrtable_0.2.1               survival_3.2-7             
##  [27] zoo_1.8-8                   ape_5.4-1                  
##  [29] glue_1.4.2                  polyclip_1.10-0            
##  [31] rvg_0.2.5                   gtable_0.3.0               
##  [33] zlibbioc_1.30.0             DelayedArray_0.10.0        
##  [35] sjmisc_2.8.5                R.cache_0.14.0             
##  [37] DEoptimR_1.0-8              scales_1.1.1               
##  [39] ggthemes_4.2.0              miniUI_0.1.1.1             
##  [41] Rcpp_1.0.5                  xtable_1.8-4               
##  [43] htmlTable_2.1.0             tmvnsim_1.0-2              
##  [45] tidytree_0.3.3              foreign_0.8-72             
##  [47] Formula_1.2-4               vcd_1.4-8                  
##  [49] htmlwidgets_1.5.2           httr_1.4.2                 
##  [51] RColorBrewer_1.1-2          acepack_1.4.1              
##  [53] ellipsis_0.3.1              pkgconfig_2.0.3            
##  [55] R.methodsS3_1.8.1           farver_2.0.3               
##  [57] nnet_7.3-12                 labeling_0.4.2             
##  [59] tidyselect_1.1.0            rlang_0.4.8                
##  [61] later_1.1.0.1               munsell_0.5.0              
##  [63] tools_3.6.1                 generics_0.1.0             
##  [65] devEMF_4.0-2                sjlabelled_1.1.7           
##  [67] moonBook_0.2.3              evaluate_0.14              
##  [69] fastmap_1.0.1               yaml_2.2.1                 
##  [71] knitr_1.30                  robustbase_0.93-6          
##  [73] zip_2.1.1                   purrr_0.3.4                
##  [75] nlme_3.1-140                mime_0.9                   
##  [77] R.oo_1.24.0                 poweRlaw_0.70.6            
##  [79] pracma_2.2.9                xml2_1.3.2                 
##  [81] compiler_3.6.1              rstudioapi_0.11            
##  [83] png_0.1-7                   ggsignif_0.5.0             
##  [85] treeio_1.8.2                tibble_3.0.4               
##  [87] tweenr_1.0.1                stringi_1.5.3              
##  [89] highr_0.8                   gdtools_0.2.2              
##  [91] lattice_0.20-38             Matrix_1.2-17              
##  [93] psych_2.0.9                 vctrs_0.3.4                
##  [95] stringdist_0.9.6.3          pillar_1.4.6               
##  [97] lifecycle_0.2.0             BiocManager_1.30.10        
##  [99] editData_0.1.2              lmtest_0.9-38              
## [101] bitops_1.0-6                data.table_1.13.2          
## [103] insight_0.10.0              flextable_0.5.11           
## [105] ztable_0.2.2                GenomicRanges_1.36.1       
## [107] httpuv_1.5.4                hwriter_1.3.2              
## [109] R6_2.5.0                    latticeExtra_0.6-29        
## [111] ShortRead_1.42.0            promises_1.1.1             
## [113] gridExtra_2.3               boot_1.3-23                
## [115] MASS_7.3-51.1               SummarizedExperiment_1.14.1
## [117] shinyWidgets_0.5.4          withr_2.3.0                
## [119] Rsamtools_2.0.3             GenomicAlignments_1.20.1   
## [121] mnormt_2.0.2                GenomeInfoDbData_1.2.1     
## [123] hms_0.5.3                   repmis_0.5                 
## [125] grid_3.6.1                  rpart_4.1-15               
## [127] tidyr_1.1.2                 rmarkdown_2.5              
## [129] rvcheck_0.1.8               ggforce_0.3.1              
## [131] Biobase_2.44.0              shiny_1.5.0                
## [133] base64enc_0.1-3