|
|
@@ -17,8 +17,174 @@
|
|
|
%- \acs{DNA}S STOCHASTICAL ATTRIBUTES
|
|
|
%- IMPACT ON COMPRESSION
|
|
|
|
|
|
+% Structure:
|
|
|
+% - Focus/Goal (why and what)
|
|
|
+% - Procedure (what and how)
|
|
|
+% . Specs and used tools
|
|
|
+
|
|
|
%\chapter{Analysis for Possible Compression Improvements}
|
|
|
-\chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes}
|
|
|
+\chapter{Environment and Procedure to Determine the State of The Art Efficiency and Compressionratio of Relevant Tools}
|
|
|
+% goal define
|
|
|
+Since improvements must be meassured, defining a baseline which would need to be beaten bevorhand is neccesary. Others have dealt with this task several times with common algorithms and tools, and published their results. But since the test case, that need to be build for this work, is rather uncommon in its compilation, the available data are not very usefull. Therefore new testdata must be created.\\
|
|
|
+The goal of this is, to determine a baseline for efficiendy and effectivity of state of the art tools, used to compress \ac{DNA}. This baseline is set by two important factors:
|
|
|
+
|
|
|
+\begin{itemize}
|
|
|
+ \item Efficiency: \textbf{Duration} the Process had run for
|
|
|
+ \item Effectivity: The difference in \textbf{Size} between input and compressed data
|
|
|
+\end{itemize}
|
|
|
+
|
|
|
+As a third point, the compliance that files were compressed losslessly should be verified. This is done by comparing the source file to a copy that got compressed and than decompressed again. If one of the two processes should operate lossy, a difference between the source file and the copy a difference in size should be recognizeable.
|
|
|
+
|
|
|
+%environment, test setup, raw results
|
|
|
+\section{Sever specifications and test environment}
|
|
|
+To be able to recreate this in the future, relevant specifications and the commands that reveiled this information are listed in this section.\\
|
|
|
+
|
|
|
+Reading from /proc/cpuinfo reveals processor spezifications. Since most of the information displayed in the seven entrys is redundant, only the last entry is shown. Below are relevant specifications listed:
|
|
|
+
|
|
|
+\noindent
|
|
|
+\begin{lstlisting}[language=bash]
|
|
|
+ cat /proc/cpuinfo
|
|
|
+\end{lstlisting}
|
|
|
+\begin{itemize}
|
|
|
+ \item available logical processors: 0 - 7
|
|
|
+ \item vendor: GenuineIntel
|
|
|
+ \item cpu family: 6
|
|
|
+ \item model nr, name: 58, Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
|
|
|
+ \item microcode: 0x15
|
|
|
+ \item MHz: 2280.874
|
|
|
+ \item cache size: 8192 KB
|
|
|
+ \item cpu cores: 4
|
|
|
+ \item fpu and fpu exception: yes
|
|
|
+ \item address sizes: 36 bits physical, 48 bits virtual
|
|
|
+\end{itemize}
|
|
|
+
|
|
|
+Full CPU secificaiton can be found in appendix.%todo finish
|
|
|
+
|
|
|
+% explanation on some entry: https://linuxwiki.de/proc/cpuinfo
|
|
|
+%\begin{em}
|
|
|
+%processor : 7
|
|
|
+%vendor\_id : GenuineIntel
|
|
|
+%cpu family : 6
|
|
|
+%model : 58
|
|
|
+%model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
|
|
|
+%stepping : 9
|
|
|
+%microcode : 0x15
|
|
|
+%cpu MHz : 2280.874
|
|
|
+%cache size : 8192 KB
|
|
|
+%physical id : 0
|
|
|
+%siblings : 8
|
|
|
+%core id : 3
|
|
|
+%cpu cores : 4
|
|
|
+%apicid : 7
|
|
|
+%initial apicid : 7
|
|
|
+%fpu : yes
|
|
|
+%fpu\_exception : yes
|
|
|
+%cpuid level : 13
|
|
|
+%wp : yes
|
|
|
+%flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant\_tsc arch\_perfmon pebs bts rep\_good nopl xtopology nonstop\_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds\_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4\_1 sse4\_2 x2apic popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand lahf\_lm cpuid\_fault epb pti tpr\_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
|
|
|
+%vmx flags : vnmi preemption\_timer invvpid ept\_x\_only flexpriority tsc\_offset vtpr mtf vapic ept vpid unrestricted\_guest
|
|
|
+%bugs : cpu\_meltdown spectre\_v1 spectre\_v2 spec\_store\_bypass l1tf mds swapgs itlb\_multihit srbds mmio\_unknown
|
|
|
+%bogomips : 6784.88
|
|
|
+%clflush size : 64
|
|
|
+%cache\_alignment : 64
|
|
|
+%address sizes : 36 bits physical, 48 bits virtual
|
|
|
+%power management:
|
|
|
+%\end{em}
|
|
|
+
|
|
|
+The installed \ac{RAM} was offering a total of 16GB with four 4GB instances.
|
|
|
+For this paper relevant specifications are listed below:
|
|
|
+\noindent Command used to list
|
|
|
+\begin{lstlisting}[language=bash]
|
|
|
+ dmidecode --type 17
|
|
|
+\end{lstlisting}
|
|
|
+
|
|
|
+\begin{itemize}
|
|
|
+ \item{Total/Data Width: 64 bits}
|
|
|
+ \item{Size: 4GB}
|
|
|
+ \item{Type: DDR3}
|
|
|
+ \item{Type Detail: Synchronous}
|
|
|
+ \item{Speed/Configured Memory Speed: 1600 Megatransfers/s}
|
|
|
+\end{itemize}
|
|
|
+
|
|
|
+%dmidecode --type 17
|
|
|
+% ...
|
|
|
+%Handle 0x0062, DMI type 17, 34 bytes
|
|
|
+%Memory Device
|
|
|
+% Array Handle: 0x0056
|
|
|
+% Error Information Handle: Not Provided
|
|
|
+% Total Width: 64 bits
|
|
|
+% Data Width: 64 bits
|
|
|
+% Size: 4 GB
|
|
|
+% Form Factor: DIMM
|
|
|
+% Set: None
|
|
|
+% Locator: DIMM B2
|
|
|
+% Bank Locator: BANK 3
|
|
|
+% Type: DDR3
|
|
|
+% Type Detail: Synchronous
|
|
|
+% Speed: 1600 MT/s
|
|
|
+% Manufacturer: Samsung
|
|
|
+% Serial Number: 148A8133
|
|
|
+% Asset Tag: 9876543210
|
|
|
+% Part Number: M378B5273CH0-CK0
|
|
|
+% Rank: 2
|
|
|
+% Configured Memory Speed: 1600 MT/s
|
|
|
+%
|
|
|
+
|
|
|
+\section{Operating System and Additionally Installed Packages}
|
|
|
+To leave the testing environment in a consistent state, not project specific processes running in the background, should be avoided.
|
|
|
+Due to following circumstances, a current Linux distribution was choosen as a suiteable operating system:
|
|
|
+\begin{itemize}
|
|
|
+ \item{factors that interfere with a consistent efficiency value should be avoided}
|
|
|
+ \item{packages, support and user experience should be present to an reasonable ammount}
|
|
|
+\end{itemize}
|
|
|
+Some backround processes will run while the compression analysis is done. This is owed to the demand of an increasingly complex operating system to execute complex programs. Considering that different tools will be exeuted in this environment, minimizing the backround processes would require building a custom operating system or configuring an existing one to fit this specific use case. The boundary set by the time limitation for this work rejects named alternatives.
|
|
|
+%By comparing the values of explaied factors, a sweet spot can be determined:
|
|
|
+% todo: add preinstalled package/programm count and other specs
|
|
|
+Choosing \textbf{Debian GNU/Linux} version \textbf{11} features enough packages to run every tool without spending to much time on the setup.\\
|
|
|
+The graphical user interface and most other optional packages were ommited. The only additional package added in the installation process is the ssh server package. Further a list of packages required by the compression tools were installed. At last, some additional packages were installed for the purpose of simplifying work processes and increasing the safety of the environment.
|
|
|
+\begin{itemize}
|
|
|
+ \item{installation process: ssh-server}
|
|
|
+ \item{tool requirements:, git, libhts-dev, autoconf, automake, cmake, make, gcc, perl, zlib1g-dev, libbz2-dev, liblzma-dev, libcurl4-gnutls-dev, libssl-dev, libncurses5-dev, libomp-dev}
|
|
|
+ \item{additional packages: ufw, rsync, screen, sudo}
|
|
|
+\end{itemize}
|
|
|
+
|
|
|
+A complete list of installed packages as well as individual versions can be found in appendix. % todo appendix
|
|
|
+
|
|
|
+%user@debian raw$\ cat /etc/os-release
|
|
|
+%PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
|
|
|
+%NAME="Debian GNU/Linux"
|
|
|
+%VERSION_ID="11"
|
|
|
+%VERSION="11 (bullseye)"
|
|
|
+%VERSION_CODENAME=bullseye
|
|
|
+%ID=debian
|
|
|
+%HOME_URL="https://www.debian.org/"
|
|
|
+%SUPPORT_URL="https://www.debian.org/support"
|
|
|
+%BUG_REPORT_URL="https://bugs.debian.org/"
|
|
|
+
|
|
|
+\section{Selection, Receivement, and Preperation of Testdata}
|
|
|
+Following criteria is requiered for testdata to be appropriate:
|
|
|
+\begin{itemize}
|
|
|
+ \item{The testfile is in a format that all or at least most of the tools can work with.}
|
|
|
+ \item{The file is publicly available and free to use.}
|
|
|
+\end{itemize}
|
|
|
+Since there are multiple open \ac{FTP} servers which distribute a varety of files, finding a suiteable one is rather easy. The ensembl databse featured defined criteria, so the first suiteable were choosen: Homo\_sapiens.GRCh38.dna.chromosome. This sample includes over 20 chromosomes, whereby considering the filenames, one chromosome was contained in a single file. After retrieving and unpacking the files, write priviledges on them was withdrawn. So no tool could alter any file contents.\\
|
|
|
+\noindent Following tools and parameters where used in this process:
|
|
|
+\begin{lstlisting}[language=bash]
|
|
|
+ \$ wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{2,3,4,5,6,7,8,9,10}.fa.gz
|
|
|
+ \$ gzip -d ./*
|
|
|
+ \$ chmod -w ./*
|
|
|
+\end{lstlisting}
|
|
|
+
|
|
|
+The choosen tools are able to handle the \ac{FASTA} format. However some, like samtools, require to convert \ac{FASTA} into another format like \ac{SAM}.\\ Simply comparing the size is not sufficient, therefore both files are temporarly stripped from metadata and formating, so the raw data of both files can be compared.
|
|
|
+
|
|
|
+% remove metadata: grep -E 'A|C|G|N' <sourcefile> > <destfile>
|
|
|
+% remove newlines: tr -d '\n'
|
|
|
+
|
|
|
+% convert just once. test for losslessness?
|
|
|
+% get testdata: wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{2,3,4,5,6,7,8,9,10}.fa.gz
|
|
|
+% unzip it: gzip -d ./*
|
|
|
+% withdraw write priv: chmod -w ./*
|
|
|
+
|
|
|
|
|
|
% first thoughts:
|
|
|
% - just save one nuceleotide every n bits
|
|
|
@@ -31,61 +197,3 @@
|
|
|
|
|
|
% - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'
|
|
|
|
|
|
-The first attempt to determine feasability of this project consists of setting basevalues, a further improvement can be meassured by. For this to be recreateable, a few specifications must be known:\\
|
|
|
-CPU Core information `cat /proc/cpuinfo`\\
|
|
|
-
|
|
|
-Output for the last core:\\
|
|
|
-
|
|
|
-processor : 15
|
|
|
-vendor\_id : AuthenticAMD
|
|
|
-cpu family : 23
|
|
|
-model : 1
|
|
|
-model name : AMD EPYC Processor (with IBPB)
|
|
|
-stepping : 2
|
|
|
-microcode : 0x1000065
|
|
|
-cpu MHz : 2400.000
|
|
|
-cache size : 512 KB
|
|
|
-physical id : 15
|
|
|
-siblings : 1
|
|
|
-core id : 0
|
|
|
-cpu cores : 1
|
|
|
-apicid : 15
|
|
|
-initial apicid : 15
|
|
|
-fpu : yes
|
|
|
-fpu\_exception : yes
|
|
|
-cpuid level : 13
|
|
|
-wp : yes
|
|
|
-flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr\_opt pdpe1gb rdtscp lm rep\_good nopl cpuid extd\_apicid tsc\_known\_freq pni pclmulqdq ssse3 fma cx16 sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand hypervisor lahf\_lm cmp\_legacy cr8\_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr\_core ssbd ibpb vmmcall fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha\_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt\_ssbd arat arch\_capabilities
|
|
|
-bugs : sysret\_ss\_attrs null\_seg spectre\_v1 spectre\_v2 spec\_store\_bypass
|
|
|
-bogomips : 4800.00
|
|
|
-TLB size : 1024 4K pages
|
|
|
-clflush size : 64
|
|
|
-cache\_alignment : 64
|
|
|
-address sizes : 48 bits physical, 48 bits virtual
|
|
|
-power management:\\
|
|
|
-Memory capacity (and more todo list):
|
|
|
-`dmidecode --type memory` || `dmidecode --type 17`
|
|
|
-\section{Pool of Tools}
|
|
|
-For an initial test, a small pool of three tools was choosen.
|
|
|
-\begin{itemize}
|
|
|
- \item Samtools
|
|
|
- \item GeCo
|
|
|
- \item genie
|
|
|
-\end{itemize}
|
|
|
-Each of this tools comply with the criteria choosen in \autoref{chap:filetypes}.\\
|
|
|
-To test each tool, the same set of data were used. The genome of a homo sapien id: GRCh38 were chosen due to its size TODO: find more exact criteria for testdata.
|
|
|
-The Testdata is available via an open FTP Server, hotsed by ensembl. Source:\url{http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/}\\
|
|
|
-Testparameters that were focused on:
|
|
|
-\begin{itemize}
|
|
|
- \item Efficiency: \textbf{Duration} the Process had run for
|
|
|
- \item Effectivity: The difference in \textbf{Size} between input and compressed data
|
|
|
- \item todo: fehlerquote!
|
|
|
-\end{itemize}
|
|
|
-First was captured by
|
|
|
-TODO choose:
|
|
|
-- a linux tool to output the exact runtime (time <cmd>)
|
|
|
-- a alteration in the c code that outputs the time at start and end of the process runtime.
|
|
|
-\section{Installation}
|
|
|
-\section{Alteration of Code to Determine Runtime}
|
|
|
-\section{Execution}
|
|
|
-\section{Data analysis}
|