DeconSeq. DECONtamination of SEQuence data.

FAQ

If you can't find the answer to your question, take a look at the manual or the Q&A site.

What is DeconSeq?

DeconSeq is a publicly available tool that is able to automatically detect and efficiently remove sequence condaminations from genomic and metagenomic datasets. It is easily configurable and provides a user-friendly interface. The interactive web interface facilitates visualizations of the results and export functionality for subsequent data processing, and is available at http://edwards.sdsu.edu/deconseq or by clicking on "Use DeconSeq" in the menu above.


How can I cite DeconSeq?

If you use DeconSeq, please cite:
Schmieder R and Edwards R: Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS ONE 2011, 6:e17288. [PMID: 21408061]

@article{schmieder_deconseq,
	title = {Fast identification and removal of sequence contamination from genomic and metagenomic datasets},
	volume = {6},
	issn = {1932-6203},
	url = {http://www.ncbi.nlm.nih.gov/pubmed/21408061},
	doi = {10.1371/journal.pone.0017288},
	number = {3},
	journal = {{PloS} One},
	author = {Robert Schmieder and Robert Edwards},
	year = {2011},
	note = {{PMID:} 21408061},
	pages = {e17288}
}



Why should I use DeconSeq?

Sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, possibly causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants presents a necessary step for all metagenomic projects.

There are several advantages in using DeconSeq to pre-process sequence data:
   - Removal of potential sequence contaminations improve the reliability of downstream data analysis
   - The web application that allows users to pre-process their datasets without installing any software
   - It takes about 10-15 minutes to screen an average size metagenome for human contamination


How is it different from other programs?

DeconSeq offers features that are unique to the program such as the coverage-identity plots that help to easily define the thresholds or the possibility to retain sequences that are similar to possible contaminants and the sampled organism(s).


How does it work?

The flowchart describes the basic steps of DeconSeq.
DeconSeq flowchart


Is there a standalone version of the program?

Yes, there is a standalone version of DeconSeq available. The Perl code and modified BWA-SW source code (under "Downloads") can be used to run DeconSeq as a standalone version, if required.


What file formats does DeconSeq support?

You can submit files in FASTA or FASTQ format using the web version and FASTA using the standalone version. The files can also be compressed in ZIP or GZIP format (only web version).


What is the maximum number of sequences that I can submit to the web version?

There is no limit on the number of sequences that you can submit. However, there is a limit for the file size that you can upload. The current web-service allows files up to 600 MB. If you compress your data, you can submit around 2 GB of sequence data.


Where can I set the threshold parameters?

The DeconSeq web interface does not require the setting of threshold parameters (such as query coverage or alignment identity) before the data is processed. Instead, the threshold parameters are set after the data is processed, which allows the user to choose parameters appropriate for their dataset and does not require them to submit and process the same data with modified parameters for several times. The DeconSeq standalone version requires the thresholds as input prior to data processing.


What threshold values should I use?

The identity threshold should be set according to your expected error rate. This means that if your data set has an average error rate of 2%, then your identity threshold should be set to 97% [= 100% - (error rate + 1% margin)] or below. The base N in your query sequence always mismatches the reference sequence and therefore, sequences with Ns cannot be aligned with 100% identity (except if they occur at the ends and the alignment stops before).
The coverage value should be selected based on the quality of your data. If your sequences are likely to have many errors at the 3'-end, then the alignment might not fully cover the query sequence. A value between 90% and 95% should be selected if unsure.
The coverage vs. identity plots can be helpful for the selection of the threshold values. The bar chart at the top and right show how the sequences were aligned. The higher the bars, the more sequence were aligned with this coverage or identity value. For example, if you see high bars at the right chart for 100% to 98% and low bars for 97% and below, then you should set your identity threshold to max 98%. If you selected Retain database as well, the second plot will help you to select appropriate thresholds while considering the sequences matching to the Remove and Retain databases. If e.g. you only see red dots (with red lines) in the top right corner, then you might want to select your thresholds covering this whole area.


How long do you keep the data submitted to the web version?

You as the user can select if you want us to keep the data accessible for one day (24 hours) or one week (168 hours). You can also request to delete the data after you are done, or if you want us to keep it for a longer time period.


How were the databases for the web version generated?

The web-based version of DeconSeq offers preprocessed reference databases for a variety of complete genomes such as human, bacterial and viral genomes. To reduce the number of false positive matches that might be introduced due to the long stretches of Ns that will be randomly replaced by A, C, G or T during database indexing, the genome sequences were split at stretches of 200 or more Ns. The separated sequences were then filtered for read duplicates to reduce redundancy in the sequence data and for short sequences that contained more than 5% of ambiguous base N. Due to the size restriction for databases that can be created with BWA and to decrease the memory usage on the computing cluster, the genome data was split into smaller files that require a maximum of 1.5 GB of memory per chunk. The results for the split databases are automatically joined before generating the output.


I can't find the right database for my data set. What should I do?

If you want to idenifty contaminats your data set using data that is not listed as database, you can contact us and we can add the database to the web version. Please note, that we do not provide databases for the standalone version. If you want to create your own database, follow the steps described under "Manual".


Why is DeconSeq so fast?

DeconSeq is based on the BWA-SW aligner. We modified the code to fit our own needs, while leaving the core alignment functions unchanged. The BWA-SW alignment algorithm is used to search for Z significant matches and then stops. By default, Z=1 and this means the program stops after finding the first of all the possible hits. In contrast, most other programs such as BLAST search for all suboptimal matches as well and are therefore much slower. However, one significant hit already tells us if the query sequence is similar to a reference sequence (the possible condamination) or not.


How was the BWA-SW source code modified?

DeconSeq uses a modified version of the BWA-SW source code. The file bwtsw2_aux.c was modified to generate an alternative output, which presents a lightweight tab-separated output format containing only the necessary data required by DeconSeq (query identifier, reference identifier, query coverage and alignment identity). The file bwtsw2_aux.c was additionally modified to force a mismatch when aligning the ambiguous base N in query sequences instead of randomly replacing it by A, C, G or T and possibly resulting in a match (BWA-SW default). The files stdaln.c, stdaln.h and bwtsw2_aux.c were modified to include "R" for replacements in an extended version of the Cigar string, instead of using "M" for both match and replacement (mismatch). The files bwtsw2_main.c and bwtsw2.h were modified to fix the double defined parameter -s (changed to -s and -S), and to add the new parameters -A (generate alternative output), -R (output extended version of Cigar string with replacements) and -M (force to mismatch Ns in query sequence). The modified version of BWA-SW is made available as part of the DeconSeq source.