k5_feasability.tex 3.4 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
  1. %SUMMARY
  2. %- ABSTRACT
  3. %- INTRODUCTION
  4. %# BASICS
  5. %- \acs{DNA} STRUCTURE
  6. %- DATA TYPES
  7. % - BAM/FASTQ
  8. % - NON STANDARD
  9. %- COMPRESSION APPROACHES
  10. % - SAVING DIFFERENCES WITH GIVEN BASE \acs{DNA}
  11. % - HUFFMAN ENCODING
  12. % - PROBABILITY APPROACHES (WITH BASE?)
  13. %
  14. %# COMPARING TOOLS
  15. %-
  16. %# POSSIBLE IMPROVEMENT
  17. %- \acs{DNA}S STOCHASTICAL ATTRIBUTES
  18. %- IMPACT ON COMPRESSION
  19. %\chapter{Analysis for Possible Compression Improvements}
  20. \chapter{Feasibillity Analysis for New Algorithm Considering Stochastic Organisation of Genomes}
  21. % first thoughts:
  22. % - just save one nuceleotide every n bits
  23. % - save checksum for whole genome
  24. % - use algorithms (from new discoveries) to recreate genome
  25. % - check checksum -> finished : retry
  26. % - can run recursively and threaded
  27. % - im falle von testdata: hetzer, dedizierter hardware, auf server compilen, specs aufschreiben -> 'lscpu' || 'cat /proc/cpuinfo'
  28. The first attempt to determine feasability of this project consists of setting basevalues, a further improvement can be meassured by. For this to be recreateable, a few specifications must be known:\\
  29. CPU Core information `cat /proc/cpuinfo`\\
  30. Output for the last core:\\
  31. processor : 15
  32. vendor\_id : AuthenticAMD
  33. cpu family : 23
  34. model : 1
  35. model name : AMD EPYC Processor (with IBPB)
  36. stepping : 2
  37. microcode : 0x1000065
  38. cpu MHz : 2400.000
  39. cache size : 512 KB
  40. physical id : 15
  41. siblings : 1
  42. core id : 0
  43. cpu cores : 1
  44. apicid : 15
  45. initial apicid : 15
  46. fpu : yes
  47. fpu\_exception : yes
  48. cpuid level : 13
  49. wp : yes
  50. flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr\_opt pdpe1gb rdtscp lm rep\_good nopl cpuid extd\_apicid tsc\_known\_freq pni pclmulqdq ssse3 fma cx16 sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f16c rdrand hypervisor lahf\_lm cmp\_legacy cr8\_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr\_core ssbd ibpb vmmcall fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha\_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt\_ssbd arat arch\_capabilities
  51. bugs : sysret\_ss\_attrs null\_seg spectre\_v1 spectre\_v2 spec\_store\_bypass
  52. bogomips : 4800.00
  53. TLB size : 1024 4K pages
  54. clflush size : 64
  55. cache\_alignment : 64
  56. address sizes : 48 bits physical, 48 bits virtual
  57. power management:\\
  58. Memory capacity (and more todo list):
  59. `dmidecode --type memory` || `dmidecode --type 17`
  60. \section{Pool of Tools}
  61. For an initial test, a small pool of three tools was choosen.
  62. \begin{itemize}
  63. \item Samtools
  64. \item GeCo
  65. \item genie
  66. \end{itemize}
  67. Each of this tools comply with the criteria choosen in \autoref{chap:filetypes}.\\
  68. To test each tool, the same set of data were used. The genome of a homo sapien id: GRCh38 were chosen due to its size TODO: find more exact criteria for testdata.
  69. The Testdata is available via an open FTP Server, hotsed by ensembl. Source:\url{http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/}\\
  70. Testparameters that were focused on:
  71. \begin{itemize}
  72. \item Efficiency: \textbf{Duration} the Process had run for
  73. \item Effectivity: The difference in \textbf{Size} between input and compressed data
  74. \item todo: fehlerquote!
  75. \end{itemize}
  76. First was captured by
  77. TODO choose:
  78. - a linux tool to output the exact runtime (time <cmd>)
  79. - a alteration in the c code that outputs the time at start and end of the process runtime.
  80. \section{Installation}
  81. \section{Alteration of Code to Determine Runtime}
  82. \section{Execution}
  83. \section{Data analysis}