il y a 3 ans · 77f847a059
--- a/latex/tex/kapitel/k3_datatypes.tex
+++ b/latex/tex/kapitel/k3_datatypes.tex
@@ -4,7 +4,7 @@
 
															 %# BASICS
														
 
															 %- \acs{DNA} STRUCTURE
														
 
															 %- DATA TYPES
														
 
															-% - BAM/FASTQ
														
 
															+% - BAM/\ac{FASTQ}
														
 
															 % - NON STANDARD
														
 
															 %- COMPRESSION APPROACHES
														
 
															 % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
														
@@ -40,7 +40,7 @@ Some common fileformats would be:
 
															 \begin{itemize}
														
 
															 % which is relevant? 
														
 
															   \item{FASTA}
														
 
															-  \item{FASTQ}
														
 
															+  \item{\ac{FASTQ}}
														
 
															   \item{twoBit}
														
 
															   \item{SAM/BAM}
														
 
															   \item{VCF}
														
@@ -52,9 +52,9 @@ Since methods to store this kind of Data are still in development, there are man
 
															 %rewrite:
														
 
															 In order to not go beyond the scope, this paper will only focuse on compression tools which are using standard formats.
														
 
															-\section{FASTQ}
														
 
															+\section{\ac{FASTQ}}
														
 
															 Is a text base format for storing sequenced data. It saves nucleotides as letters and in extend to that, the quality values are saved.
														
 
															-FASTQ files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of FASTQ format is as follows:
														
 
															+\ac{FASTQ} files are split into multiples of four, each four lines contain the informations for one sequence. The exact structure of \ac{FASTQ} format is as follows:
														
 
															 \texttt{
														
 
															 Line 1: Sequence identifier aka. Title, starting with an @ and an optional description.\\
														
 
															 Line 2: The seuqence consisting of nucleoids, symbolized by A, T, G and C.\\
														
@@ -65,9 +65,9 @@ The quality values have no fixed type, to name a few there is the sanger format,
 
															 The quality value shows the estimated probability of error in the sequencing process.
														
 
															 [...]
														
 
															-\section{SAM/BAM}
														
 
															+\section{Sequence Alignment Map}
														
 
															 % src https://github.com/samtools/samtools
														
 
															-\ac{SAM} often seen in its compressed, binary representation \ac{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 as defined in RFC1345. The structure is more complex than the one in FASTQ and described best, accompanied by an example:
														
 
															+\ac{SAM} often seen in its compressed, binary representation \ac{BAM} with the fileextension \texttt{.bam}, is part of the SAMtools package, a uitlity tool for processing SAM/BAM and CRAM files. The SAM/BAM file is a text based format delimited by TABs. It uses 7-bit US-ASCII, to be precise Charset ANSI X3.4-1968 as defined in RFC1345. The structure is more complex than the one in \ac{FASTQ} and described best, accompanied by an example:
														
 
															 \begin{figure}[ht]
														
 
															   \centering
														
@@ -89,7 +89,7 @@ The regulare expression, shown above, filters touple of characters from a to z i
 
															 %- allows viewing BAM data (localy and remote via ftp/http)
														
 
															 %- file extention: <filename>.bam.bai
														
 
															-%- stores more data than FASTQ
														
 
															+%- stores more data than \ac{FASTQ}
														
 
															 % src: https://support.illumina.com/help/BS_App_RNASeq_Alignment_OLH_1000000006112/Content/Source/Informatics/BAM-Format.htm
														
 
															 %- allignment section includes
														
@@ -102,6 +102,6 @@ The regulare expression, shown above, filters touple of characters from a to z i
 
															 %- BAM index files nameschema: <filename>.bam.bai 
														
 
															-\section{CRAM - Compressed Reference-oriented Ailgnment Map}
														
 
															+\section{Compressed Reference-oriented Ailgnment Map}
														
 
															 \ac{CRAM} was developed as an alternative to the \ac{SAM} and \ac{BAM} Format. It specification is maintained by \ac{GA4GH}. It features both lossy and lossless compression mode. Since it is not relevant to this work, the lossy compression is ignored from here on. Even though it is part of \ac{GA4GH} suite, the file format can be used independently.\\
														
 
															 The format saves data in containers which consist out of slices. Each slice is represented by a line in the file. Container and slices each store metadata in a header. Data is stored as blocks in slices, in a compressed form.