Adapt babathesis to the TFG
This commit is contained in:
parent
fed667561e
commit
707b2adadc
|
@ -1,19 +1,13 @@
|
|||
#+TITLE: Machine Learning para corrección de errores en datos de secuenciación de ADN
|
||||
#+SUBTITLE: Trabajo de Fin de Grado
|
||||
#+AUTHOR: Amin Kasrou Aouam
|
||||
#+DATE: 26-06-2021
|
||||
#+PANDOC_OPTIONS: template:~/.pandoc/templates/eisvogel.latex
|
||||
#+DATE: 26 de Junio de 2021
|
||||
#+PANDOC_OPTIONS: template:assets/babathesis.latex
|
||||
#+PANDOC_OPTIONS: toc:t
|
||||
#+PANDOC_OPTIONS: bibliography:assets/bibliography.bib
|
||||
#+PANDOC_OPTIONS: citeproc:t
|
||||
#+PANDOC_OPTIONS: csl:assets/ieee.csl
|
||||
#+PANDOC_OPTIONS: pdf-engine:xelatex
|
||||
#+PANDOC_METADATA: link-citations:t
|
||||
#+PANDOC_METADATA: lang=es
|
||||
#+PANDOC_METADATA: titlepage:t
|
||||
#+PANDOC_METADATA: toc-own-page:t
|
||||
#+PANDOC_METADATA: table-use-row-colors:t
|
||||
#+PANDOC_METADATA: colorlinks:t
|
||||
#+PANDOC_METADATA: logo:/home/coolneng/Photos/Logos/UGR.png
|
||||
* Resumen
|
||||
|
||||
Las nuevas técnicas de secuenciación de ADN (NGS) han revolucionado la investigación en genómica. Estas tecnologías se basan en la secuenciación de millones de fragmentos de ADN en paralelo, cuya reconstrucción se basa en técnicas de bioinformática. Aunque estas técnicas se apliquen de forma habitual, presentan tasas de error significantes que son detrimentales para el análisis de regiones con alto grado de polimorfismo. En este estudio se implementa un nuevo método computacional, locimend, basado en /Deep Learning/ para la corrección de errores de secuenciación de ADN. Se aplica al análisis de la región determinante de complementariedad 3 (CDR3) del receptor de linfocitos T (TCR), generada /in silico/ y posteriorimente sometida a un simulador de secuenciación con el fin de producir errores de secuenciación. Empleando estos datos, entrenamos una red neuronal convolucional (CNN) con el objetivo de generar un modelo computacional que permita la detección y corrección de los errores de secuenciación.
|
||||
|
@ -30,8 +24,7 @@ Next generation sequencing (NGS) have revolutionised genomic research. These tec
|
|||
|
||||
* Introducción
|
||||
|
||||
** Técnicas de secuenciación de alto rendimiento
|
||||
** Sistema inmunitario
|
||||
En los últimos años se ha
|
||||
|
||||
La capacidad del sistema inmunitario adaptativo para responder a cualquiera de los numerosos antígenos extraños potenciales a los que puede estar expuesta una persona depende de los receptores altamente polimórficos expresados por las células B (inmunoglobulinas) y las células T (receptores de células T [TCR]). La especificidad de las células T viene determinada principalmente por la secuencia de aminoácidos codificada en los bucles de la tercera región determinante de la complementariedad (CDR3). cite:pmid19706884
|
||||
|
||||
|
|
Binary file not shown.
|
@ -7,6 +7,9 @@
|
|||
% Use 'KOMA-Script Book' as the document class
|
||||
\documentclass[toc=bibliography,toc=indentunnumbered,listof=totoc]{scrbook}
|
||||
|
||||
% Use Spanish as language
|
||||
\usepackage[spanish]{babel}
|
||||
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -62,7 +65,7 @@
|
|||
%\KOMAoptions{parskip=half+}
|
||||
|
||||
% Don't stretch the content to fill entire pages
|
||||
\raggedbottom
|
||||
\raggedbottom{}
|
||||
|
||||
% Don't break paragraphs because of a single line
|
||||
\PassOptionsToPackage{defaultlines=2,all}{nowidow}
|
||||
|
@ -111,7 +114,7 @@
|
|||
UprightFont = {*-Regular},
|
||||
ItalicFont = {*-Italic},
|
||||
BoldFont = {*-Semibold},
|
||||
BoldItalicFont = {*-SemiboldItalic},
|
||||
BoldItalicFont = {*-Semibold Italic},
|
||||
Numbers = {OldStyle},
|
||||
PunctuationSpace = 1.125
|
||||
]
|
||||
|
@ -120,7 +123,7 @@
|
|||
\setsansfont{URW Classico}%
|
||||
[
|
||||
UprightFont = {*-Regular},
|
||||
ItalicFont = {*-Italic},
|
||||
ItalicFont = {*-Italic Italic},
|
||||
BoldFont = {*-Bold},
|
||||
Numbers = {Proportional,Lining},
|
||||
Scale = MatchUppercase
|
||||
|
@ -221,7 +224,6 @@
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
% Make sure footnote marks are separated by commas and kerned properly
|
||||
\usepackage[multiple=true,mult-fn-sep=${}^{,\kern-0.07em}$]{fnpct}
|
||||
|
||||
% Change the font used for footnotes
|
||||
%\addtokomafont{footnote}{\sffamily}
|
||||
|
@ -265,8 +267,8 @@
|
|||
|
||||
% Fix kerning problems for backslashes and redefine underscores in hyperlinks
|
||||
\makeatletter
|
||||
\let\UrlSpecialsOld\UrlSpecials
|
||||
\def\UrlSpecials{\UrlSpecialsOld\do\/{\Url@slash}\do\_{\Url@underscore}}%
|
||||
\let\UrlSpecialsOld\UrlSpecials{}
|
||||
\def\UrlSpecials{\UrlSpecialsOld\do/{\Url@slash}\do\_{\Url@underscore}}%
|
||||
\def\Url@slash{\@ifnextchar/{\kern+0.05em\mathchar47\kern-0.10em}%
|
||||
{\kern0.08em\mathchar47\penalty\UrlBigBreakPenalty}}
|
||||
\def\Url@underscore{\nfss@text{\leavevmode \kern.06em\vbox{\hrule height 0.12ex width 0.4em}}}
|
||||
|
@ -284,7 +286,7 @@
|
|||
\PassOptionsToPackage{backend=biber}{biblatex}
|
||||
|
||||
% Bibliography style (e.g. 'phys' or 'nature')
|
||||
\PassOptionsToPackage{style=bababib}{biblatex}
|
||||
\PassOptionsToPackage{style=phys}{biblatex}
|
||||
|
||||
% Citation style (e.g. 'plain' or 'superscript')
|
||||
\PassOptionsToPackage{autocite=plain}{biblatex}
|
||||
|
@ -292,10 +294,32 @@
|
|||
% Enable multiple bibliographies with separate numbering
|
||||
\PassOptionsToPackage{defernumbers=true}{biblatex}
|
||||
|
||||
% Pandoc references
|
||||
% Format for cross-references with \cref
|
||||
\PassOptionsToPackage{noabbrev}{cleveref}
|
||||
\newcommand{\crefrangeconjunction}{--}
|
||||
|
||||
\newlength{\cslhangindent}
|
||||
\setlength{\cslhangindent}{1.5em}
|
||||
\newlength{\csllabelwidth}
|
||||
\setlength{\csllabelwidth}{3em}
|
||||
\newenvironment{CSLReferences}[2] % #1 hanging-ident, #2 entry spacing
|
||||
{% don't indent paragraphs
|
||||
\setlength{\parindent}{0pt}
|
||||
% turn on hanging indent if param 1 is 1
|
||||
\ifodd #1 \everypar{\setlength{\hangindent}{\cslhangindent}}\ignorespaces\fi
|
||||
% set entry spacing
|
||||
\ifnum #2 > 0
|
||||
\setlength{\parskip}{#2\baselineskip}
|
||||
\fi
|
||||
}%
|
||||
{}
|
||||
\usepackage{calc}
|
||||
\newcommand{\CSLBlock}[1]{#1\hfill\break}
|
||||
\newcommand{\CSLLeftMargin}[1]{\parbox[t]{\csllabelwidth}{#1}}
|
||||
\newcommand{\CSLRightInline}[1]{\parbox[t]{\linewidth - \csllabelwidth}{#1}\break}
|
||||
\newcommand{\CSLIndent}[1]{\hspace{\cslhangindent}#1}
|
||||
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -328,8 +352,8 @@
|
|||
%\AtBeginDocument{\let\nabla=𝛁}
|
||||
\AtBeginDocument%
|
||||
{
|
||||
\let\epsilon=\varepsilon
|
||||
\let\phi=\varphi
|
||||
\let\epsilon=\varepsilon{}
|
||||
\let\phi=\varphi{}
|
||||
}
|
||||
|
||||
% Change the font used for tables
|
||||
|
@ -351,7 +375,7 @@
|
|||
\makeatother
|
||||
|
||||
% Replace \cite with the more flexible \autocite
|
||||
\let\cite=\autocite
|
||||
\let\cite=\autocite{}
|
||||
|
||||
% Define a custom color palette
|
||||
\definecolor{whiteish}{rgb}{1.000, 0.964, 0.859}
|
||||
|
@ -403,17 +427,53 @@
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
% Import bibliographies
|
||||
\addbibresource{references.bib}
|
||||
\addbibresource{bibliography.bib}
|
||||
|
||||
\begin{document}
|
||||
% UGR titlepage
|
||||
\begin{titlepage}
|
||||
\newlength{\centeroffset}
|
||||
\setlength{\centeroffset}{-0.5\oddsidemargin}
|
||||
\addtolength{\centeroffset}{0.5\evensidemargin}
|
||||
\thispagestyle{empty}
|
||||
|
||||
\noindent\hspace*{\centeroffset}
|
||||
\begin{minipage}{\textwidth}
|
||||
\centering
|
||||
\includegraphics[width=0.9\textwidth]{assets/logo_ugr}\\[1cm]
|
||||
|
||||
\textsc{ \Large TRABAJO FIN DE GRADO\\[0.2cm]}
|
||||
\textsc{ GRADO DE INGENIERÍA EN INFORMÁTICA}\\[1cm]
|
||||
% Upper part of the page
|
||||
%
|
||||
% Title
|
||||
{\huge\bfseries $title$\\}
|
||||
\noindent\rule[-1ex]{\textwidth}{3pt}\\[3.5ex]
|
||||
|
||||
{\large\bfseries }
|
||||
\end{minipage}
|
||||
|
||||
\vspace{0.3cm}
|
||||
\noindent\hspace*{\centeroffset}\begin{minipage}{\textwidth}
|
||||
\centering
|
||||
|
||||
\textbf{Autor}\\ {$author$}\\[2.5ex]
|
||||
\textbf{Directores}\\
|
||||
{Carlos Cano Gutiérrez}\\
|
||||
{María Soledad Benítez Cantos}\\[2cm]
|
||||
\includegraphics[width=0.3\textwidth]{assets/logo-ceuta.jpg}\\[0.1cm]
|
||||
\textsc{Facultad de Educación, Tecnología y Economía de Ceuta}\\
|
||||
\textsc{---}\\
|
||||
Granada, $date$
|
||||
\end{minipage}
|
||||
\end{titlepage}
|
||||
|
||||
\frontmatter
|
||||
\include{chapters/abstract}
|
||||
\include{chapters/preface}
|
||||
\listoftables
|
||||
\listoffigures
|
||||
\tableofcontents
|
||||
\mainmatter
|
||||
\include{chapters/introduction}
|
||||
\include{chapters/test}
|
||||
\include{chapters/conclusion}
|
||||
\backmatter
|
||||
\printbibliography
|
||||
\mainmatter{}
|
||||
$body$
|
||||
\backmatter{}
|
||||
\printbibliography{}
|
||||
\end{document}
|
||||
|
|
|
@ -1,26 +1,261 @@
|
|||
@article{10.1093/molbev/msy224,
|
||||
author = {Flagel, Lex and Brandvain, Yaniv and Schrider, Daniel R},
|
||||
title = "{The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference}",
|
||||
journal = {Molecular Biology and Evolution},
|
||||
volume = {36},
|
||||
number = {2},
|
||||
pages = {220-238},
|
||||
year = {2018},
|
||||
month = {12},
|
||||
abstract = "{Population-scale genomic data sets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date, most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g., only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here, we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNNs are capable of outperforming expert-derived statistical methods and offer a new path forward in cases where no likelihood approach exists.}",
|
||||
issn = {0737-4038},
|
||||
doi = {10.1093/molbev/msy224},
|
||||
url = {https://doi.org/10.1093/molbev/msy224},
|
||||
eprint = {https://academic.oup.com/mbe/article-pdf/36/2/220/27736968/msy224.pdf},
|
||||
author = {Flagel, Lex and Brandvain, Yaniv and Schrider, Daniel R},
|
||||
title = "{The Unreasonable Effectiveness of Convolutional Neural
|
||||
Networks in Population Genetic Inference}",
|
||||
journal = {Molecular Biology and Evolution},
|
||||
volume = 36,
|
||||
number = 2,
|
||||
pages = {220-238},
|
||||
year = 2018,
|
||||
month = 12,
|
||||
abstract = "{Population-scale genomic data sets have given researchers
|
||||
incredible amounts of information from which to infer
|
||||
evolutionary histories. Concomitant with this flood of data,
|
||||
theoretical and methodological advances have sought to extract
|
||||
information from genomic sequences to infer demographic events
|
||||
such as population size changes and gene flow among closely
|
||||
related populations/species, construct recombination maps, and
|
||||
uncover loci underlying recent adaptation. To date, most
|
||||
methods make use of only one or a few summaries of the input
|
||||
sequences and therefore ignore potentially useful information
|
||||
encoded in the data. The most sophisticated of these
|
||||
approaches involve likelihood calculations, which require
|
||||
theoretical advances for each new problem, and often focus on
|
||||
a single aspect of the data (e.g., only allele frequency
|
||||
information) in the interest of mathematical and computational
|
||||
tractability. Directly interrogating the entirety of the input
|
||||
sequence data in a likelihood-free manner would thus offer a
|
||||
fruitful alternative. Here, we accomplish this by representing
|
||||
DNA sequence alignments as images and using a class of deep
|
||||
learning methods called convolutional neural networks (CNNs)
|
||||
to make population genetic inferences from these images. We
|
||||
apply CNNs to a number of evolutionary questions and find that
|
||||
they frequently match or exceed the accuracy of current
|
||||
methods. Importantly, we show that CNNs perform accurate
|
||||
evolutionary model selection and parameter estimation, even on
|
||||
problems that have not received detailed theoretical
|
||||
treatments. Thus, when applied to population genetic
|
||||
alignments, CNNs are capable of outperforming expert-derived
|
||||
statistical methods and offer a new path forward in cases
|
||||
where no likelihood approach exists.}",
|
||||
issn = {0737-4038},
|
||||
doi = {10.1093/molbev/msy224},
|
||||
url = {https://doi.org/10.1093/molbev/msy224},
|
||||
eprint = {https://academic.oup.com/mbe/article-pdf/36/2/220/27736968/msy224.pdf},
|
||||
}
|
||||
|
||||
@Article{pmid19706884,
|
||||
Author="Robins, H. S. and Campregher, P. V. and Srivastava, S. K. and Wacher, A. and Turtle, C. J. and Kahsai, O. and Riddell, S. R. and Warren, E. H. and Carlson, C. S. ",
|
||||
Title="{{C}omprehensive assessment of {T}-cell receptor beta-chain diversity in alphabeta {T} cells}",
|
||||
Journal="Blood",
|
||||
Year="2009",
|
||||
Volume="114",
|
||||
Number="19",
|
||||
Pages="4099--4107",
|
||||
Month="Nov"
|
||||
Author = "Robins, H. S. and Campregher, P. V. and Srivastava, S. K.
|
||||
and Wacher, A. and Turtle, C. J. and Kahsai, O. and Riddell,
|
||||
S. R. and Warren, E. H. and Carlson, C. S. ",
|
||||
Title = "{{C}omprehensive assessment of {T}-cell receptor beta-chain
|
||||
diversity in alphabeta {T} cells}",
|
||||
Journal = "Blood",
|
||||
Year = 2009,
|
||||
Volume = 114,
|
||||
Number = 19,
|
||||
Pages = "4099--4107",
|
||||
Month = "Nov"
|
||||
}
|
||||
|
||||
@article {Nurk2021.05.26.445798,
|
||||
author = {Nurk, Sergey and Koren, Sergey and Rhie, Arang and
|
||||
Rautiainen, Mikko and Bzikadze, Andrey V. and Mikheenko, Alla
|
||||
and Vollger, Mitchell R. and Altemose, Nicolas and Uralsky,
|
||||
Lev and Gershman, Ariel and Aganezov, Sergey and Hoyt,
|
||||
Savannah J. and Diekhans, Mark and Logsdon, Glennis A. and
|
||||
Alonge, Michael and Antonarakis, Stylianos E. and Borchers,
|
||||
Matthew and Bouffard, Gerard G. and Brooks, Shelise Y. and
|
||||
Caldas, Gina V. and Cheng, Haoyu and Chin, Chen-Shan and Chow,
|
||||
William and de Lima, Leonardo G. and Dishuck, Philip C. and
|
||||
Durbin, Richard and Dvorkina, Tatiana and Fiddes, Ian T. and
|
||||
Formenti, Giulio and Fulton, Robert S. and Fungtammasan,
|
||||
Arkarachai and Garrison, Erik and Grady, Patrick G.S. and
|
||||
Graves-Lindsay, Tina A. and Hall, Ira M. and Hansen, Nancy F.
|
||||
and Hartley, Gabrielle A. and Haukness, Marina and Howe,
|
||||
Kerstin and Hunkapiller, Michael W. and Jain, Chirag and Jain,
|
||||
Miten and Jarvis, Erich D. and Kerpedjiev, Peter and Kirsche,
|
||||
Melanie and Kolmogorov, Mikhail and Korlach, Jonas and
|
||||
Kremitzki, Milinn and Li, Heng and Maduro, Valerie V. and
|
||||
Marschall, Tobias and McCartney, Ann M. and McDaniel, Jennifer
|
||||
and Miller, Danny E. and Mullikin, James C. and Myers, Eugene
|
||||
W. and Olson, Nathan D. and Paten, Benedict and Peluso, Paul
|
||||
and Pevzner, Pavel A. and Porubsky, David and Potapova, Tamara
|
||||
and Rogaev, Evgeny I. and Rosenfeld, Jeffrey A. and Salzberg,
|
||||
Steven L. and Schneider, Valerie A. and Sedlazeck, Fritz J.
|
||||
and Shafin, Kishwar and Shew, Colin J. and Shumate, Alaina and
|
||||
Sims, Yumi and Smit, Arian F. A. and Soto, Daniela C. and
|
||||
Sovi{\'c}, Ivan and Storer, Jessica M. and Streets, Aaron and
|
||||
Sullivan, Beth A. and Thibaud-Nissen, Fran{\c c}oise and
|
||||
Torrance, James and Wagner, Justin and Walenz, Brian P. and
|
||||
Wenger, Aaron and Wood, Jonathan M. D. and Xiao, Chunlin and
|
||||
Yan, Stephanie M. and Young, Alice C. and Zarate, Samantha and
|
||||
Surti, Urvashi and McCoy, Rajiv C. and Dennis, Megan Y. and
|
||||
Alexandrov, Ivan A. and Gerton, Jennifer L. and
|
||||
O{\textquoteright}Neill, Rachel J. and Timp, Winston and Zook,
|
||||
Justin M. and Schatz, Michael C. and Eichler, Evan E. and
|
||||
Miga, Karen H. and Phillippy, Adam M.},
|
||||
title = {The complete sequence of a human genome},
|
||||
elocation-id = {2021.05.26.445798},
|
||||
year = 2021,
|
||||
doi = {10.1101/2021.05.26.445798},
|
||||
publisher = {Cold Spring Harbor Laboratory},
|
||||
abstract = {In 2001, Celera Genomics and the International Human Genome
|
||||
Sequencing Consortium published their initial drafts of the
|
||||
human genome, which revolutionized the field of genomics.
|
||||
While these drafts and the updates that followed effectively
|
||||
covered the euchromatic fraction of the genome, the
|
||||
heterochromatin and many other complex regions were left
|
||||
unfinished or erroneous. Addressing this remaining 8\% of the
|
||||
genome, the Telomere-to-Telomere (T2T) Consortium has finished
|
||||
the first truly complete 3.055 billion base pair (bp) sequence
|
||||
of a human genome, representing the largest improvement to the
|
||||
human reference genome since its initial release. The new
|
||||
T2T-CHM13 reference includes gapless assemblies for all 22
|
||||
autosomes plus Chromosome X, corrects numerous errors, and
|
||||
introduces nearly 200 million bp of novel sequence containing
|
||||
2,226 paralogous gene copies, 115 of which are predicted to be
|
||||
protein coding. The newly completed regions include all
|
||||
centromeric satellite arrays and the short arms of all five
|
||||
acrocentric chromosomes, unlocking these complex regions of
|
||||
the genome to variational and functional studies for the first
|
||||
time.Competing Interest StatementAF and CSC are employees of
|
||||
DNAnexus; IS, JK, MWH, PP, and AW are employees of Pacific
|
||||
Biosciences; FJS has received travel funds to speak at events
|
||||
hosted by Pacific Biosciences; SK and FJS have received travel
|
||||
funds to speak at events hosted by Oxford Nanopore
|
||||
Technologies. WT has licensed two patents to Oxford Nanopore
|
||||
Technologies (US 8748091 and 8394584).},
|
||||
URL = {https://www.biorxiv.org/content/early/2021/05/27/2021.05.26.445798},
|
||||
eprint = {https://www.biorxiv.org/content/early/2021/05/27/2021.05.26.445798.full.pdf},
|
||||
journal = {bioRxiv}
|
||||
}
|
||||
|
||||
@ARTICLE{10.3389/fgene.2020.00900,
|
||||
AUTHOR = {Wang, Luotong and Qu, Li and Yang, Longshu and Wang, Yiying
|
||||
and Zhu, Huaiqiu},
|
||||
TITLE = {NanoReviser: An Error-Correction Tool for Nanopore
|
||||
Sequencing Based on a Deep Learning Algorithm},
|
||||
JOURNAL = {Frontiers in Genetics},
|
||||
VOLUME = 11,
|
||||
PAGES = 900,
|
||||
YEAR = 2020,
|
||||
URL = {https://www.frontiersin.org/article/10.3389/fgene.2020.00900},
|
||||
DOI = {10.3389/fgene.2020.00900},
|
||||
ISSN = {1664-8021},
|
||||
ABSTRACT = {Nanopore sequencing is regarded as one of the most
|
||||
promising third-generation sequencing (TGS) technologies.
|
||||
Since 2014, Oxford Nanopore Technologies (ONT) has developed a
|
||||
series of devices based on nanopore sequencing to produce very
|
||||
long reads, with an expected impact on genomics. However, the
|
||||
nanopore sequencing reads are susceptible to a fairly high
|
||||
error rate owing to the difficulty in identifying the DNA
|
||||
bases from the complex electrical signals. Although several
|
||||
basecalling tools have been developed for nanopore sequencing
|
||||
over the past years, it is still challenging to correct the
|
||||
sequences after applying the basecalling procedure. In this
|
||||
study, we developed an open-source DNA basecalling reviser,
|
||||
NanoReviser, based on a deep learning algorithm to correct the
|
||||
basecalling errors introduced by current basecallers provided
|
||||
by default. In our module, we re-segmented the raw electrical
|
||||
signals based on the basecalled sequences provided by the
|
||||
default basecallers. By employing convolution neural networks
|
||||
(CNNs) and bidirectional long short-term memory (Bi-LSTM)
|
||||
networks, we took advantage of the information from the raw
|
||||
electrical signals and the basecalled sequences from the
|
||||
basecallers. Our results showed NanoReviser, as a
|
||||
post-basecalling reviser, significantly improving the
|
||||
basecalling quality. After being trained on standard ONT
|
||||
sequencing reads from public E. coli and human NA12878
|
||||
datasets, NanoReviser reduced the sequencing error rate by
|
||||
over 5% for both the E. coli dataset and the human dataset.
|
||||
The performance of NanoReviser was found to be better than
|
||||
those of all current basecalling tools. Furthermore, we
|
||||
analyzed the modified bases of the E. coli dataset and added
|
||||
the methylation information to train our module. With the
|
||||
methylation annotation, NanoReviser reduced the error rate by
|
||||
7% for the E. coli dataset and specifically reduced the error
|
||||
rate by over 10% for the regions of the sequence rich in
|
||||
methylated bases. To the best of our knowledge, NanoReviser is
|
||||
the first post-processing tool after basecalling to accurately
|
||||
correct the nanopore sequences without the time-consuming
|
||||
procedure of building the consensus sequence. The NanoReviser
|
||||
package is freely available at <ext-link ext-link-type="uri"
|
||||
xlink:href="https://github.com/pkubioinformatics/NanoReviser"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/pkubioinformatics/NanoReviser</ext-link>.}
|
||||
}
|
||||
|
||||
|
||||
|
||||
@Article{Davis2021,
|
||||
author = {Davis, Eric M. and Sun, Yu and Liu, Yanling and Kolekar,
|
||||
Pandurang and Shao, Ying and Szlachta, Karol and Mulder,
|
||||
Heather L. and Ren, Dongren and Rice, Stephen V. and Wang,
|
||||
Zhaoming and Nakitandwe, Joy and Gout, Alexander M. and
|
||||
Shaner, Bridget and Hall, Salina and Robison, Leslie L. and
|
||||
Pounds, Stanley and Klco, Jeffery M. and Easton, John and Ma,
|
||||
Xiaotu},
|
||||
title = {SequencErr: measuring and suppressing sequencer errors in
|
||||
next-generation sequencing data},
|
||||
journal = {Genome Biology},
|
||||
year = 2021,
|
||||
month = {Jan},
|
||||
day = 25,
|
||||
volume = 22,
|
||||
number = 1,
|
||||
pages = 37,
|
||||
abstract = {There is currently no method to precisely measure the
|
||||
errors that occur in the sequencing instrument/sequencer,
|
||||
which is critical for next-generation sequencing applications
|
||||
aimed at discovering the genetic makeup of heterogeneous
|
||||
cellular populations.},
|
||||
issn = {1474-760X},
|
||||
doi = {10.1186/s13059-020-02254-2},
|
||||
url = {https://doi.org/10.1186/s13059-020-02254-2}
|
||||
}
|
||||
|
||||
@article{HEATHER20161,
|
||||
title = {The sequence of sequencers: The history of sequencing DNA},
|
||||
journal = {Genomics},
|
||||
volume = 107,
|
||||
number = 1,
|
||||
pages = {1-8},
|
||||
year = 2016,
|
||||
issn = {0888-7543},
|
||||
doi = {https://doi.org/10.1016/j.ygeno.2015.11.003},
|
||||
url = {https://www.sciencedirect.com/science/article/pii/S0888754315300410},
|
||||
author = {James M. Heather and Benjamin Chain},
|
||||
keywords = {DNA, RNA, Sequencing, Sequencer, History},
|
||||
abstract = {Determining the order of nucleic acid residues in
|
||||
biological samples is an integral component of a wide variety
|
||||
of research applications. Over the last fifty years large
|
||||
numbers of researchers have applied themselves to the
|
||||
production of techniques and technologies to facilitate this
|
||||
feat, sequencing DNA and RNA molecules. This time-scale has
|
||||
witnessed tremendous changes, moving from sequencing short
|
||||
oligonucleotides to millions of bases, from struggling towards
|
||||
the deduction of the coding sequence of a single gene to rapid
|
||||
and widely available whole genome sequencing. This article
|
||||
traverses those years, iterating through the different
|
||||
generations of sequencing technology, highlighting some of the
|
||||
key discoveries, researchers, and sequences along the way.}
|
||||
}
|
||||
|
||||
|
||||
|
||||
@Article{vanDijk2014,
|
||||
author = {van Dijk, Erwin L. and Auger, H{\'e}l{\`e}ne and
|
||||
Jaszczyszyn, Yan and Thermes, Claude},
|
||||
title = {Ten years of next-generation sequencing technology},
|
||||
journal = {Trends in Genetics},
|
||||
year = 2014,
|
||||
month = {Sep},
|
||||
day = 01,
|
||||
publisher = {Elsevier},
|
||||
volume = 30,
|
||||
number = 9,
|
||||
pages = {418-426},
|
||||
issn = {0168-9525},
|
||||
doi = {10.1016/j.tig.2014.07.001},
|
||||
url = {https://doi.org/10.1016/j.tig.2014.07.001}
|
||||
}
|
||||
|
|
Binary file not shown.
After Width: | Height: | Size: 24 KiB |
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue