Add single cell data challenges section

2022-10-31 17:19:36 +01:00 · 2022-10-31 17:19:36 +01:00 · c7fd4536af
parent 8be295ab52
commit c7fd4536af
2 changed files with 39 additions and 1 deletions
--- a/docs/Notes.org
+++ b/docs/Notes.org
@ -41,7 +41,8 @@ The rectified linear unit function or ReLu is a non-linear function that acts li
 [[attachment:_20221011_095153screenshot.png]]

 It is the /de facto/ activation function used for Deep Learning networks, due to its many advantages. It is *computationally simple*, allows *sparse representations* (it outputs zero values), has a *linear behavior* (easier to optimize and avoids vanishing gradients).
-
+*** Softmax
+The softmax function returns a probability distribution for each one of the $k$ provided inputs. The sum of all these probabilities amounts to 1.
 * Transformers
 Deep learning models that are design to process a connected set of units (e.g. tokens in a sequence or pixels in an image) only using the mechanism of *self-attention*.
 They are simple models that still haven't reached their performance limit, as it is currently only being limited by computational resources. They are also *extremely generic*, they have been mostly used for NLP but they can be exploited for more tasks, which is very useful for *multi-modal learning*.
@ -163,6 +164,42 @@ They can model dependencies over the whole range of the input sequence (unlike C
 ** Modeling of graph data
 Transformers are able to interpret graph data, as they see a sentence as a *fully connected* graph of words. In the case of NLP, it is possible to use full attention as the number of nodes (and subsequently of edges) is small enough which makes it computationally tractable.
 However, this approach is not possible to interpret most types of graph data such as biological networks. In that case, we need to apply sparse attention (e.g. evaluate the local neighbours).
+* Single cell data
+:PROPERTIES:
+:ID:       4507db70-1e9b-4b40-a3f3-febd6276a1b2
+:END:
+Since its inception, single cell data has enabled the exploration of DNA, RNA or epigenetic marks at the finest resolution possible. Single-cell RNA sequencing (scRNA-seq) enables transcriptome-wide gene expression measurement in a cell, which allows the characterization of different cell types and even the transition between cell states (e.g. interesting for tumor progression). These new techniques have generated a substantial amount of enthusiasm due to the new possibilities that they enable. The advances in these techniques, during the last two decades, have generated vast quantities of data, which have to be analyzed in a computationally efficient and *statistically sound manner* [cite:@Lähnemann2020].
+Nevertheless, single cell data science brings a set of unique challenges to the table:
+
+- Limited amounts of genetic material
+- Higher technical noise due to amplification
+- Higher dimensionality of data
+
+#+CAPTION: Single-cell vs bulk sequencing
+#+LABEL: single-cell-vs-bulk
+#+ATTR_HTML: :width 50%
+[[attachment:_20221020_154236screenshot.png]]
+
+The most urgent challenges that need to be addressed are:
+
+- *Quantifying the uncertainty of measurements and analyses*: less material implies that there is a higher uncertainty and thus some common tasks (e.g. SNP calling) have to be performed with methodological care.
+- *Benchmark methods that highlight relevant metrics*
+- *Scaling to higher dimensional data*: data integration methods need to scale to different types of context information.
+- *Data integration of multimodal data and samples*
+- *Varying levels of resolution*: operations with different granularity on cell types and states.
+
+** Single-cell transcriptomics
+
+Multiple challenges are present in single-cell RNA-seq such as:
+
+- *Sparsity in scRNA-seq*: these zero values can be due to methodological noise or due to biologically-true absence of expression. The degree of sparsity depends on the platform, sequencing depth and the expression of the gene, which makes it difficult to determine to which category does a zero belong (and furthermore hinders data imputation techniques).
+- *Discovering complex patterns in differential gene expression*: currently differential expression detection relies on clustering before differential analysis, without taking into account the uncertainty of cell assignment. Furthermore, methods of comparison of cell-type specific changes across samples are emerging, and these newer methods need more flexible statistical frameworks to identify complex patterns across samples.
+- *Mapping single cells to a reference atlas*: to classify cells into cell types/states it is essential to have reliable reference systems, with a resolution down to cell state. A computationally and statistically soudn method has to be developed that works at multiple resolutions, while taking into account transient cell states and quantifying the uncertainty of the mapping.
+- *Generalizing trajectory inference*: several biological processes (e.g. differentiation, cancer expansion) can be represented as dynamic changes in cell type/state. A trajectory is a potential path that a cell can undergo in /pseudotime/, trajectory inference describes cell state dynamics. These techniques are still in their infancy and validation methods still have to be developed.
+- *Finding patterns in spatially resolved measurements*: single cell transcriptomics retain the spatial coordinates of transcripts, novel methods are being developed to extract useful information from this new type of data (e.g. spatial dependence of genes).
+- *Integration of multimodal single-cell data*: biological processes are complex and dynamic and to analyze multiple types of measurements need to be performed. The data from these sources has to be linked in a biologically meaningful way, while accounting for batch effects.
+- *Validating and benchmarking tools for single-cell data*: due to the advances in sc-seq, a systematic benchmarking and evaluation of these methods is becoming very pressing.
+
 * Literature review
 ** CpG Transformer for imputation of single-cell methylomes
 *** DNA methylation methodologies
@ -255,4 +292,5 @@ Furthermore, its performance is hindered in areas of high sparsity as it is not
 * Glossary
 - Temperature: hyperparameter of neural networks used to control the randomness of predictions, by scaling the logits prior to applying softmax: $\frac{logits}{temperature}$. The higher the temperature, the network is more easily excited and thus results in more diversity and mistakes.
 - Ablation: removal of components of the input to evaluate their significance
+- Manifold learning: class of unsupervised estimators that seeks to describe datasets as low-dimensional manifolds embedded in high-dimensional spaces.
 * References
--- a/docs/Notes.pdf
+++ b/docs/Notes.pdf