#5525 closed change (rejected)

[emscripten] Consider using Robin Hood hashing

Reported by:	trev	Assignee:
Priority:	P3	Milestone:
Module:	Core	Keywords:	closed-in-favor-of-gitlab
Cc:	sergz	Blocked By:
Blocking:		Platform:	Unknown / Cross platform
Ready:	yes	Confidential:	no
Tester:	Unknown	Verified working:	no
Review URL(s):

Description (last modified by trev)

Background

Our current hashing approach tends to produce lengthy lookup chains, around 70 elements in extreme cases. This means suboptimal lookup performance.

There are some articles on Robin Hood hashing which might help here:

These explain a number of possible approaches to improve performance:

actual Robin Hood hashing (when inserting, moving existing elements to ensure a "fair" distribution)
linear probing (we are currently using the more complicated quadratic probing)
backward shift deletion instead of "tombstone" entries
upper limit on the probe count, which allows removing boundary checks on lookup
prime numbers rather than powers of two for the number of slots

What to change

We need to test which of the changes are actually beneficial in our case. Expected effects:

Robin Hood hashing requires storing an additional value for each element, either the key hash or the element's distance from its ideal position. Both require 4 additional bytes per hash bucket. On the other hand, higher load factors should be possible so we might still see a win in terms of memory used. Also, storing the key hash, while requiring to recalculate the element's distance, can speed up lookups (we can compare hashes before doing slow string comparison on the key).
Insertion performance decreases slightly, which will affect startup times. It is to be measured whether the difference is relevant.
Linear probing is more efficient than quadratic probing, both in terms of calculation complexity and memory access locality. It tends to create longer lookup chains however, something that should be alleviated by Robin Hood hashing.
Deletions are probably not common enough in our case to justify optimizing.
Removing boundary checks on lookup sounds like a nice optimization, so overallocating slightly might actually be worth it.
It's doubtful that the Emscripten compiler will optimize the modulo operation for prime numbers. Then again, so far we didn't even bother optimizing this operation for powers of two. Powers of two have the additional advantage of reducing memory fragmentation however, which should be more important than the performance benefits in our case.

Summary:

Description:

You may use WikiFormatting here.

=== Background ===
Our current hashing approach tends to produce lengthy lookup chains, around 70 elements in extreme cases. This means suboptimal lookup performance.

There are some articles on Robin Hood hashing which might help here:

* https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/
* http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/
* https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/

These explain a number of possible approaches to improve performance:

* actual Robin Hood hashing (when inserting, moving existing elements to ensure a "fair" distribution)
* linear probing (we are currently using the more complicated quadratic probing)
* backward shift deletion instead of "tombstone" entries
* upper limit on the probe count, which allows removing boundary checks on lookup
* prime numbers rather than powers of two for the number of slots

=== What to change ===
We need to test which of the changes are actually beneficial in our case. Expected effects:

* Robin Hood hashing requires storing an additional value for each element, either the key hash or the element's distance from its ideal position. Both require 4 additional bytes per hash bucket. On the other hand, higher load factors should be possible so we might still see a win in terms of memory used. Also, storing the key hash, while requiring to recalculate the element's distance, can speed up lookups (we can compare hashes before doing slow string comparison on the key).
* Insertion performance decreases slightly, which will affect startup times. It is to be measured whether the difference is relevant.
* Linear probing is more efficient than quadratic probing, both in terms of calculation complexity and memory access locality. It tends to create longer lookup chains however, something that should be alleviated by Robin Hood hashing.
* Deletions are probably not common enough in our case to justify optimizing.
* Removing boundary checks on lookup sounds like a nice optimization, so overallocating slightly might actually be worth it.
* It's doubtful that the Emscripten compiler will optimize the modulo operation for prime numbers. Then again, so far we didn't even bother optimizing this operation for powers of two. Powers of two have the additional advantage of reducing memory fragmentation however, which should be more important than the performance benefits in our case.

Type:

Priority:

Milestone:

Module:

Keywords:

Cc:

Blocked By:

Blocking:

Platform:

Ready:

Confidential:

Tester:

Verified working:

Review URL(s):

Context Navigation

#5525 closed change (rejected)

[emscripten] Consider using Robin Hood hashing

Description (last modified by trev)

Background

What to change

Attachments (0)

Change History (2)

comment:1 Changed on 08/17/2017 at 11:40:12 AM by trev

comment:2 Changed on 08/29/2019 at 05:43:52 PM by sebastian

Add Comment

Modify Ticket

Changed by kzar

Download in other formats:

Summary:
Description:	You may use WikiFormatting here. === Background === Our current hashing approach tends to produce lengthy lookup chains, around 70 elements in extreme cases. This means suboptimal lookup performance. There are some articles on Robin Hood hashing which might help here: * https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/ * http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/ * https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ These explain a number of possible approaches to improve performance: * actual Robin Hood hashing (when inserting, moving existing elements to ensure a "fair" distribution) * linear probing (we are currently using the more complicated quadratic probing) * backward shift deletion instead of "tombstone" entries * upper limit on the probe count, which allows removing boundary checks on lookup * prime numbers rather than powers of two for the number of slots === What to change === We need to test which of the changes are actually beneficial in our case. Expected effects: * Robin Hood hashing requires storing an additional value for each element, either the key hash or the element's distance from its ideal position. Both require 4 additional bytes per hash bucket. On the other hand, higher load factors should be possible so we might still see a win in terms of memory used. Also, storing the key hash, while requiring to recalculate the element's distance, can speed up lookups (we can compare hashes before doing slow string comparison on the key). * Insertion performance decreases slightly, which will affect startup times. It is to be measured whether the difference is relevant. * Linear probing is more efficient than quadratic probing, both in terms of calculation complexity and memory access locality. It tends to create longer lookup chains however, something that should be alleviated by Robin Hood hashing. * Deletions are probably not common enough in our case to justify optimizing. * Removing boundary checks on lookup sounds like a nice optimization, so overallocating slightly might actually be worth it. * It's doubtful that the Emscripten compiler will optimize the modulo operation for prime numbers. Then again, so far we didn't even bother optimizing this operation for powers of two. Powers of two have the additional advantage of reducing memory fragmentation however, which should be more important than the performance benefits in our case.
Type:		Priority:
Milestone:		Module:
Keywords:		Cc:
Blocked By:		Blocking:
Platform:		Ready:
Confidential:		Tester:
Verified working:
Review URL(s):