|
@@ -4,7 +4,7 @@
|
|
|
%# BASICS
|
|
%# BASICS
|
|
|
%- \acs{DNA} STRUCTURE
|
|
%- \acs{DNA} STRUCTURE
|
|
|
%- DATA TYPES
|
|
%- DATA TYPES
|
|
|
-% - BAM/FASTQ
|
|
|
|
|
|
|
+% - BAM/\ac{FASTQ}
|
|
|
% - NON STANDARD
|
|
% - NON STANDARD
|
|
|
%- COMPRESSION APPROACHES
|
|
%- COMPRESSION APPROACHES
|
|
|
% - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
|
|
% - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
|
|
@@ -40,7 +40,7 @@ Some common fileformats would be:
|
|
|
\begin{itemize}
|
|
\begin{itemize}
|
|
|
% which is relevant?
|
|
% which is relevant?
|
|
|
\item{FASTA}
|
|
\item{FASTA}
|
|
|
- \item{FASTQ}
|
|
|
|
|
|
|
+ \item{\ac{FASTQ}}
|
|
|
\item{twoBit}
|
|
\item{twoBit}
|
|
|
\item{SAM/BAM}
|
|
\item{SAM/BAM}
|
|
|
\item{VCF}
|
|
\item{VCF}
|
|
@@ -52,9 +52,9 @@ Since methods to store this kind of Data are still in development, there are man
|
|
|
%rewrite:
|
|
%rewrite:
|
|
|
In order to not go beyond the scope, this paper will only focuse on compression tools which are using standard formats.
|
|
In order to not go beyond the scope, this paper will only focuse on compression tools which are using standard formats.
|
|
|
|
|
|
|
|
-\section{FASTQ}
|
|
|
|
|
|
|
+\section{\ac{FASTQ}}
|
|
|
Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
|
|
Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
|
|
|
-FASTQ files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of FASTQ format is as follows:
|
|
|
|
|
|
|
+\ac{FASTQ} files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of \ac{FASTQ} format is as follows:
|
|
|
\texttt{
|
|
\texttt{
|
|
|
Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.\\
|
|
Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.\\
|
|
|
Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.\\
|
|
Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.\\
|
|
@@ -65,9 +65,9 @@ The quality values have no fixed type, to name a few there is the sanger format,
|
|
|
The quality value shows the estimated probability of error in the sequencing process.
|
|
The quality value shows the estimated probability of error in the sequencing process.
|
|
|
[...]
|
|
[...]
|
|
|
|
|
|
|
|
-\section{SAM/BAM}
|
|
|
|
|
|
|
+\section{Sequence Alignment Map}
|
|
|
% src https://github.com/samtools/samtools
|
|
% src https://github.com/samtools/samtools
|
|
|
-\ac{SAM} often seen in its compressed, binary representation \ac{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 as defined in RFC1345. The structure is more complex than the one in FASTQ and described best, accompanied by an example:
|
|
|
|
|
|
|
+\ac{SAM} often seen in its compressed, binary representation \ac{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 as defined in RFC1345. The structure is more complex than the one in \ac{FASTQ} and described best, accompanied by an example:
|
|
|
|
|
|
|
|
\begin{figure}[ht]
|
|
\begin{figure}[ht]
|
|
|
\centering
|
|
\centering
|
|
@@ -89,7 +89,7 @@ The regulare expression, shown above, filters touple of characters from a to z i
|
|
|
%- allows viewing BAM data (localy and remote via ftp/http)
|
|
%- allows viewing BAM data (localy and remote via ftp/http)
|
|
|
%- file extention: <filename>.bam.bai
|
|
%- file extention: <filename>.bam.bai
|
|
|
|
|
|
|
|
-%- stores more data than FASTQ
|
|
|
|
|
|
|
+%- stores more data than \ac{FASTQ}
|
|
|
|
|
|
|
|
% src: https://support.illumina.com/help/BS_App_RNASeq_Alignment_OLH_1000000006112/Content/Source/Informatics/BAM-Format.htm
|
|
% src: https://support.illumina.com/help/BS_App_RNASeq_Alignment_OLH_1000000006112/Content/Source/Informatics/BAM-Format.htm
|
|
|
%- allignment section includes
|
|
%- allignment section includes
|
|
@@ -102,6 +102,6 @@ The regulare expression, shown above, filters touple of characters from a to z i
|
|
|
|
|
|
|
|
%- BAM index files nameschema: <filename>.bam.bai
|
|
%- BAM index files nameschema: <filename>.bam.bai
|
|
|
|
|
|
|
|
-\section{CRAM - Compressed Reference-oriented Ailgnment Map}
|
|
|
|
|
|
|
+\section{Compressed Reference-oriented Ailgnment Map}
|
|
|
\ac{CRAM} was developed as an alternative to the \ac{SAM} and \ac{BAM} Format. It specification is maintained by \ac{GA4GH}. It features both lossy and lossless compression mode. Since it is not relevant to this work, the lossy compression is ignored from here on. Even though it is part of \ac{GA4GH} suite, the file format can be used independently.\\
|
|
\ac{CRAM} was developed as an alternative to the \ac{SAM} and \ac{BAM} Format. It specification is maintained by \ac{GA4GH}. It features both lossy and lossless compression mode. Since it is not relevant to this work, the lossy compression is ignored from here on. Even though it is part of \ac{GA4GH} suite, the file format can be used independently.\\
|
|
|
The format saves data in containers which consist out of slices. Each slice is represented by a line in the file. Container and slices each store metadata in a header. Data is stored as blocks in slices, in a compressed form.
|
|
The format saves data in containers which consist out of slices. Each slice is represented by a line in the file. Container and slices each store metadata in a header. Data is stored as blocks in slices, in a compressed form.
|