Skip to content

Latest commit

 

History

History
284 lines (192 loc) · 12.8 KB

File metadata and controls

284 lines (192 loc) · 12.8 KB

HashTable

Project of searching and optimizing hash functions & hash table.

Table of Contents

Installation

To start program you need to use Makefile and run program

  1. Clone repository
  2. Run Makefile (compile program), write command in main directory in repo
  3. Run program(before you need to create this file)

Configure Makefile to change hash function or data

make all will do everything, set TEST,SIZE,NAME_HASH to manipulate tests and CFLAGS, PERF, PERF_FLAGS to perf profiling

git clone https://github.com/khmelnitskiianton/HashTable.git
cd ./HashTable
make init    #create folders for temporary objects
make all     #for histograms
make test    #for perf test

Extra programs

GCC compilier, Makefile for collection, Python & Seaborn to draw histograms.

sudo apt update && sudo apt upgrade                           #update
sudo apt install build-essential                              #gcc
sudo apt install make                                         #makefile
sudo apt install python3 -y                                   #python
python -m pip install --upgrade pip                           #pip
pip install seaborn                                           #seaborn
sudo apt-get install linux-tools-common linux-tools-generic   #perf
sudo apt-get install hotspot                                  #hotspot

Description

In this project I search diffrent hash functions for uniformity and optimize hash table.

In first part I take many hash functions and search them for homogenious distribution. Data base consists of words from William Shakespeare. The Tragedy Of Hamlet, Prince Of Denmark (5k unique words).

Hash Table bases on double linked list. Element has hash, key, value.

int         HT_Ctor (HashTable_t* myHashTable, size_t Size, size_t (*HashFunction) (HT_Key_t));
int         HT_Dtor (HashTable_t* myHashTable);
int         HT_Add  (HashTable_t* myHashTable, HT_Key_t Key, HT_Value_t Value);
DLL_Node_t* HT_Find (HashTable_t* myHashTable, HT_Key_t Key, HT_Value_t Value);

In second part I optimizes speed of functions using aligning, SIMD and assembler inserts.

First Part

In first part we searching diffrent hash functions & load factors.

Functions:

  1. Hash returns 0.
  2. Hash that returns ascii code of word first letter.
  3. Hash returns length of word.
  4. Hash returns sum of ascii codes of letters.
  5. Hash returns (sum of ascii codes of letters)/(length of word).
  6. ROL Hash. hash[i] = ROL(hash[i-1]) xor str[i]
  7. ROR Hash. hash[i] = ROR(hash[i-1]) xor str[i]
  8. CRC32 Hash.
  9. ElfHash

For searching uniformity I plot histograms AmountOfCollusions(HashIndex). Diagrams show distribution of collusions in current hash function, there are spikes in some functions that affect on speed of working with hash table.

Test size of hash table for big tests was prime number!

Check theory in wiki

Histograms

  1. 0 Hash : Size: 10, Max Collusion: 4788.
  1. First Letter Hash : Size: 128, Max Collusion: 461.
  1. Length Word Hash : Size 30, Max Collustion: 929.
  1. Sum of letters Hash : Size 1500, Max Collusion: 31.
  1. (Sum of letters)/Length Hash : Size 179, Max Collusion: 542.
  1. ROR Hash : Size 6007, Max Collusion: 27.
  1. ROL Hash : Size 6007, Max Collusion: 9.
  1. CRC32 Hash : Size 6007, Max Collusion: 5.
  1. ElfHash : Size 6007, Max Collusion: 6.

Uniformity

Uniformity of the distribution of hash values can be evaluated by the chi-squared test.

Theory source: chi-squred test for hash

This test is a goodness-of-fit measure: it's the actual distribution of items in buckets versus the expected (or uniform) distribution of items.

After processing all function I plot histogram of chi-squared test:

After analysing we can see if hash function's chi tends to $\frac{m}{n} = 0.79$ is better in uniformity, its distribution is more homogeneous. Best in uniformity functions is CRC32, ElfHash and ROL Hash.

Second Part

System:

  • Linux Mint 21.3 Cinnamon
  • 12th Gen Intel Core i5-12450H x 8, CPU Temperature: 50 $^\circ C$
  • GCC x86-64 -O3 -msse4.1 -msse4.2 -mavx2 -mavx

I want to speed up my hash table. So I do stress tests: load big texts and finding some keys many times. After finding weak point I try to optimize it with help of SIMD, ASM inserts.

Results of profiling are calculated by Perf tool and vizualized by HotSpot.

I find in loop unknown 512 times and all words 512 times in StressTest().

I use Guide Perf to profile my hash table. After see console version Perf I decide to work in HotSpot where information is vizualized in application with hierarchy.

Console version:

HotSpot version:

Analysing Profilier: (Size=6007, Hash: Elf Hash)

I don't optimize functions like InsertData and Dtor/Ctor because they are single and use specific functions to work with files.

That's why most weak points are HT_Find(), DLL_Find(), DLL_Compare(), ElfHash, strcmp(),

I check time of working with hash table with __rdtsc()

First time of stress test - 3056114162 ticks

Ticks aren't fixed because cpu frequency isn't constant. That's why I have deviation $\pm 5$ %

Optimization

  1. Inline Optimization:

First I decide to make search functions inline because this functions are called everytime but its body small, thats why it waste time only on calling. So I put this functions in headers and use inline

GCC does not embed any functions when optimization is not being done, except if you specify the always_inline attribute for the function inline __attribute__((always_inline)). That's why if you work with -O0 GCC will not done insertion with only inline (I find out it in profilier).

New time of stress test - 3006040316 ticks. Inlining has a little boost in speed, because program doesn't waste time on calls.

  1. Hash Optimization:

First I optimized hash by rewriting ElfHash on asm. It gives zero effect on workload.

That's why I change Elf Hash for CRC32 Hash.

First version is dry with many cycles to process table.

Second version add const table of polynom, speed equal to Elf Hash.

Third version has SSE intrinsic CRC32 _mm_crc32_u8 (crc, char), 2727707675 ticks:

size_t crc = 0xFFFFFFFFUL;
for (size_t i = 0; i < length; i++)
    crc = _mm_crc32_u8 (crc, str[i]); 
return crc ^ 0xFFFFFFFFUL;

I try to do it on assembler with asm():

asm(
    ".intel_syntax noprefix                     \n"
    "   add     %[len], %[str]                  \n"
    "   mov     edx, 4294967295                 \n"   
    ".for_loop:                                 \n"  
    "   mov     %[crc], rdx                     \n"  
    "   add     %[str], 1                       \n" 
    "   crc32   %[crc], byte ptr [%[str]-1]     \n" 
    "   mov     rdx, %[crc]                     \n"
    "   cmp     %[len], %[str]                  \n"
    "   jne     .for_loop                       \n"        
    "   not     %[crc]                          \n" 
    ".att_syntax noprefix                       \n"
    : [crc] "=r" (crc)
    : [str] "r"  (str), [len] "r"  (length)
);

It has educational aim. Diffrence between writing with _mm_crc32_u8() or with asm() is zero.

Also I tried to write with _mm_crc32_u64():

size_t crc = 0xFFFFFFFFUL;    
crc = _mm_crc32_u64 (crc, *((uint64_t*) str));
crc = _mm_crc32_u64 (crc, *(((uint64_t*) str)+1));
return crc ^ 0xFFFFFFFFUL; 

First result is strange, time is more than 2 times longer. I search reason of this (because 2 repetitions couldn't be longer than loop).Then I found out that aligning plays an important role, because SIMD instructions depend on cash, and buffer that upload to cash depends on address position(article).

Spoiler: best speed is achieved when I choose setup _mm_crc32_u64() + aligned_alloc() then _mm_crc32_u8() + calloc()

That's why first action is align buffer that I get from file of words. I use aligned_alloc(ALIGNING, bytes) + memset() than copy my words from text buffer to new, aligned buffer and work with it.

New time of stress test - 2196783765 ticks finally result from changing Elf Hash to CRC32 and optimized it with intrinsic.

I choose optimal aligning of 16 bytes. If I use more it will be no boost. If I use 8 or less it will be delaying. So I confirmed results of article.

  1. STRCMP Optimization:

After all optimizations the most workload process is strcmp(). I use AVX instructions.

Intrinsics:

In my dictionary longest word has 14 symbols, that's why I use for 16byte words __m128

__m128i str1 = _mm_load_si128((const __m128i *) (val1.Key)); 
__m128i str2 = _mm_load_si128((const __m128i *) (val2.Key));
__m128i cmp  = _mm_cmpeq_epi8 (str1, str2);
int result   = _mm_movemask_epi8 (cmp);
return ((result == 0xFFFF) && (val1.Value == val2.Value));

_mm_load_si128() work faster than another load functions, but it needs aligning 16 bytes, without it it will be load of misaligned address aka SegFault

New time of stress test - 1894493994 ticks

Final report of Perf:

Results

After all optimizations I get boost 1.8x - 1.9x (depends on current speed of my cpu)! Its incrediable, I outrun GCC optimization -O3 almost by 1.9 times!

Optimization Ticks Boost(compare with begining)
Start with -O3 3606891139 1.00x
Inlining 3006040316 1.20x
Switch to Crc32 2727707675 1.32x
Aligning 2636887006 1.36x
Vectorization Crc32 2196783765 1.64x
Vectorization strcmp() 1894493994 1.90x

In this project I search hash functions, use profilier (Perf & HotSpot) to see weak points in my program, then I use inlining, SIMD instructions, aligning and ASM inserts to assess the impact on time of working of hash table.

$DedInsideCoeff = \frac{boost}{amount \space asm \space strings} \cdot 100 = \frac{190}{12} = 15.8$!