Преглед изворни кода

pushed algos, switched to datatypes for wendnesday gumbel meeting

u пре 3 година
родитељ
комит
06f289d79a

+ 0 - 0
latex/tex/bilder/k3/01_sam-structure.png → latex/tex/bilder/k3/sam-structure.png


BIN
latex/tex/bilder/kapitel3/iso25010.pdf


BIN
latex/tex/bilder/kapitel3/modell_point_to_point.pdf


BIN
latex/tex/bilder/kapitel3/nasa_rover.jpg


BIN
latex/tex/bilder/kapitel3/ws-wsdl20-fehler.pdf


+ 16 - 11
latex/tex/kapitel/k3_datatypes.tex

@@ -17,24 +17,29 @@
 %- \acs{DNA}S STOCHASTICAL ATTRIBUTES 
 %- IMPACT ON COMPRESSION
 
+% todo: use this https://www.reddit.com/r/bioinformatics/comments/7wfdra/eli5_what_are_the_differences_between_fastq_and/
+% bigger picture - structure chapters like this:
+% what is it/how does it work
+% where are limits (e.g. BAM)
+% what is our focus (and maybe 'why')
+
 \chapter{Datatypes}
-% \section{overview}
-As described in previous chapters \ac{DNA} can be represented by a String with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the encoding defines that other possible letters require more space per letter. So storing a single \textit{A} in ASCII encoding requires 4 bit (excluding the magic bytes in the fileheader), whereas only two bits are needed to save a letter with a four letter alphabet e.g.: \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. More common Text encodings like unicode require even more storage spcae per letter. So settling with ASCII has improvement capabilitie but is, on the other side, more efficient than using bulkier alternatives like unicode.
+% \section{}
+As described in previous chapters \ac{DNA} can be represented by a String with the buildingblocks A,T,G and C. Using a common fileformat for saving text would be impractical because the ammount of characters or symbols in the used alphabet, defines how many bits are used to store any symbol. 
+Storing a single \textit{A} with ASCII encoding requires 8 bit (excluding magic bytes and the \ac{EOF}) since there are at least 2 \times 8 or 128 displayable symbols.Since the \ac{DNA} buildingblocks only require a minimum of four letters, two bits are needed e.g.: \texttt{00 -> A, 01 -> T, 10 -> G, 11 -> C}. More common Text encodings like unicode require 16 bits per letter. So settling with ASCII has improvement capabilitie but is, on the other side, more efficient than using bulkier alternatives like unicode.
 \\
-Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes therefor one has to gather information on which types exist and how they function by themself. In order to not go beyond scope, this work will focus only on fileformats that fullfill two factors://
+Several people and groups have developed different fileformats to store genomes. Unfortunally for this work, there is no defined standard filetype or set of filetypes therefor one has to gather information on which types exist and how they function by themself. In order to not go beyond scope, this work will focus only on fileformats that fullfill two factors:\\
 1. it has reputation, either through a scientific paper that proove its superiority by comparison with other, relevant tools or through a broad ussage of the format.//
 2. 
 \begin{itemize}
+% which is relevant? 
+  \item{FASTA}
+  \item{twoBit}
   \item{FASTQ}
   \item{SAM/BAM}
-  %\item{...}
+  \item{VCF}
+  \item{BED}
 \end{itemize}
-
-%BAM : Contains sequence, quality, mapping, signal, etc
-%FASTQ : Contains sequence, quality  
-%VCF : Contains only variant calls 
-%BED : Contains target region information 
-%XML : Contains spectrum used to call the genotype
 % src: http://help.oncokdm.com/en/articles/1195700-what-is-a-bam-fastq-vcf-and-bed-file
 
 Since methods to store this kind of Data are still in development, there are many more filetypes. The few, mentioned above are used by different organisations and researchers and backed by a scientific publication. % todo find sources to both points in last sentence
@@ -58,7 +63,7 @@ SAM Sequence Alignment/Map format, often just called BAM like its fileextension,
 
 \begin{figure}[ht]
   \centering
-  \includegraphics[width=15cm]{k_datatypes/01_sam-structure.png}
+  \includegraphics[width=15cm]{k3/sam-structure.png}
   \caption{SAM/BAM file structure example}
   \label{k_datatypes:bam-struct}
 \end{figure}

+ 15 - 8
latex/tex/kapitel/k4_algorithms.tex

@@ -18,10 +18,12 @@
 %- IMPACT ON COMPRESSION
 
 \chapter{Compression aproaches}
+% begin with entropy encoding/shannons source coding theorem
+
 The process of compressing data serves the goal to generate an output that is smaller than its input data. In many cases, like in gene compressing, the compression is idealy lossless. This means it is possible for every compressed data, to receive the full information that were available in the origin data, by decompressing it. Lossy compression on the other hand, might excludes parts of data in the compression process, in order to increase the compression rate. The excluded parts are typicaly not necessary to transmit the origin information. This works with certain audio and pictures files or network protocols that are used to transmit video/audio streams live.
 For \acs{DNA} a lossless compression is needed. To be preceice a lossy compression is not possible, because there is no unnecessary data. Every nucleotide and its position is needed for the sequenced \acs{DNA} to be complete.
 
-\subsection{Huffman encoding}
+\section{Huffman encoding}
 % list of algos and the tools that use them
 The well known Huffman coding, is used in several Tools for genome compression. This subsection should give the reader a general impression how this algorithm works, without going into detail. To use Huffman coding one must first define an alphabet, in our case a four letter alphabet, containing \texttt{A, C, G and T} is sufficient. The basic structure is symbolized as a tree. With that, a few simple rules apply to the structure:
 % binary view for alphabet
@@ -33,14 +35,19 @@ The well known Huffman coding, is used in several Tools for genome compression.
   \item every symbol got a weight, the weight is defined by the frequency the symbol occours in the input text
   \item the less weight a node has, the higher the probability is, that this node is read next in the symbol sequence
 \end{itemize}
-The process of compromising starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
-Following one path results in the binary representation for one symbol. For an alphabet like the one described above, an binary representation could initially look like this \texttt{A -> 00, C -> 01, G -> 10, T -> 11} with a sequence that has this distribution \texttt{A -> 10, C - 8, G -> 3, T -> 2} with a corresponding tree the compromised data would look like this: \texttt{}
-
-% begriffdef. alphabet,
-% leafs
-% weights
-% paths
+The process of compressing starts with the nodes with the lowest weight and buids up to the hightest. Each step adds nodes to a tree where the most left branch should be the shortest and the most right the longest. The most left branch ends with the symbol with the highest weight, therefore occours the most in the input data.
+Following one path results in the binary representation for one symbol. For an alphabet like the one described above, the binary representation encoded in ASCI is shown here \texttt{A -> 01000001, C -> 01000011, G -> 01010100, T -> 00001010}. An imaginary sequence, that has this distribution of characters \texttt{A -> 10, C -> 8, G -> 4, T -> 2}. From this information a weighting would be calculated for each character by dividing one by the characters occurence. With a corresponding tree, created from with the weights, the binary data for each symbol would change to this \texttt{A -> 0, C -> 11, T -> 100, G -> 101}. Besides the compressed data, the information contained in the tree msut be saved for the decompression process.
 
 % (genomic squeeze <- official | inofficial -> GDC, GRS). Further \ac{ANS} or rANS ... TBD.
 
+\section{Arithmetic coding}
+Arithmetic coding is an approach to solve the problem of waste of memory, due to the overhead which is created by encoding certain lenghts of alphabets in binary. Encoding a three-letter alphabet requires at least two bit per letter. Since there are four possilbe combinations with two bits, one combination is not used, so the full potential is not exhausted. Looking at it from another perspective, less storage would be required, if there would be a possibility to encode two letters in the alphabet with one bit and the other one with a two byte combination. This approache is not possible because the letters would not be clearly distinguishable. The two bit letter could be interpreted either as the letter it should represent or as two one bit letters.
+% check this wording 'simulating' with sources 
+% this is called subdividing
+Arithmetic coding works by simulating a n-letter binary encoding for a n-letter alphabet. This is possible by projecting the input text on a floatingpoint number. Every character in the alphabet is represented by an intervall between two floating point number in the space between 0.0 and 1.0 (exclusively), which is determined by its distribution in the input text (intervall start) and the the start of the next character (intervall end). To encode a sequence of characters, the intervall start of the character is noted, its intervall is split into smaller intervalls with the ratios of the initial intervalls between 0.0 and 1.0. With this, the second character is choosen. This process is repeated for until a intervall for the last character is choosen.\\
+To encode in binary, the binary floating point representation of a number inside the intervall, for the last character is calculated, by using a similar process, described above, called subdividing.
+% its finite subdividing because processors bottleneck floatingpoints 
+
+
 \section{Probability aproaches}
+