2021-06-03 18:56:18 +02:00
|
|
|
# locimend
|
|
|
|
|
|
|
|
locimend is a tool that corrects DNA sequencing errors using Deep Learning.
|
|
|
|
|
2021-07-07 00:05:12 +02:00
|
|
|
The goal is to provide a correct DNA sequence, when a sequence containing errors is provided.
|
|
|
|
|
|
|
|
It provides both a command-line program and a REST API.
|
|
|
|
|
2021-06-03 18:56:18 +02:00
|
|
|
## Technologies
|
|
|
|
|
|
|
|
- Tensorflow
|
|
|
|
- Biopython
|
2021-07-07 00:05:12 +02:00
|
|
|
- FastAPI
|
2021-06-03 18:56:18 +02:00
|
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
|
|
This project uses [Nix](https://nixos.org/) to ensure reproducible
|
|
|
|
builds.
|
|
|
|
|
|
|
|
1. Install Nix (compatible with MacOS, Linux and
|
|
|
|
[WSL](https://docs.microsoft.com/en-us/windows/wsl/about)):
|
|
|
|
|
|
|
|
```bash
|
|
|
|
curl -L https://nixos.org/nix/install | sh
|
|
|
|
```
|
|
|
|
|
|
|
|
2. Clone the repository:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
git clone https://git.coolneng.duckdns.org/coolneng/locimend
|
|
|
|
```
|
|
|
|
|
|
|
|
3. Change the working directory to the project:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
cd locimend
|
|
|
|
```
|
|
|
|
|
|
|
|
4. Enter the nix-shell:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
nix-shell
|
|
|
|
```
|
|
|
|
|
2021-06-25 19:16:23 +02:00
|
|
|
5. Install the dependencies via poetry:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
poetry install
|
|
|
|
```
|
|
|
|
|
2021-06-03 18:56:18 +02:00
|
|
|
After running these commands, you will find yourself in a shell that
|
|
|
|
contains all the needed dependencies.
|
2021-06-10 13:28:32 +02:00
|
|
|
|
|
|
|
## Usage
|
|
|
|
|
2021-07-07 00:05:12 +02:00
|
|
|
### Training the model
|
|
|
|
|
|
|
|
The following command creates the trains the Deep Learning model and shows the accuracy and AUC:
|
2021-06-10 13:28:32 +02:00
|
|
|
|
|
|
|
```bash
|
2021-07-07 00:05:12 +02:00
|
|
|
poetry run python locimend/main.py train <data file> <label file>
|
2021-06-10 13:28:32 +02:00
|
|
|
```
|
2021-07-07 00:05:12 +02:00
|
|
|
|
|
|
|
- <data file>: FASTQ file containing the sequences with errors
|
|
|
|
- <label file>: FASTQ file containing the sequences without errors
|
|
|
|
|
|
|
|
Both files must contain the canonical and read simulated sequences in the same positions (same row).
|
|
|
|
|
|
|
|
A dataset is provided to train the model, in order to proceed execute the following command:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
poetry run python locimend/main.py train data/curesim-HVR.fastq data/HVR.fastq
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Inference
|
|
|
|
|
|
|
|
A trained model is provided, which can be used to infer the correct sequences. There are two ways to interact with it:
|
|
|
|
|
|
|
|
- Command-line execution
|
|
|
|
- REST API
|
|
|
|
|
|
|
|
#### Command-line
|
|
|
|
|
|
|
|
The following command will infer the correct sequence, and print it:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
poetry run python locimend/main.py infer "<DNA sequence>"
|
|
|
|
```
|
|
|
|
|
|
|
|
#### REST API
|
|
|
|
|
|
|
|
It is also possible to serve the model via a REST API, to start the web server run the following command:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
poetry run api
|
|
|
|
```
|
|
|
|
|
|
|
|
The API can be accessed at http://localhost:8000, with either a GET or POST request:
|
|
|
|
|
|
|
|
| Request | Endpoint | Payload |
|
|
|
|
|:----:|:-----:|:-----:|
|
|
|
|
| GET | / | Sequence as a path parameter (in the URL) |
|
|
|
|
| POST | /| JSON |
|
|
|
|
|
|
|
|
For a POST request the JSON must have the following structure:
|
|
|
|
|
|
|
|
```json
|
|
|
|
{"sequence": "<DNA sequence>"}
|
|
|
|
```
|
|
|
|
|