Opened 21 months ago

Last modified 21 months ago

#5525 new change

[emscripten] Consider using Robin Hood hashing

Reported by: trev Assignee:
Priority: P3 Milestone:
Module: Core Keywords:
Cc: sergz Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by trev)


Our current hashing approach tends to produce lengthy lookup chains, around 70 elements in extreme cases. This means suboptimal lookup performance.

There are some articles on Robin Hood hashing which might help here:

These explain a number of possible approaches to improve performance:

  • actual Robin Hood hashing (when inserting, moving existing elements to ensure a "fair" distribution)
  • linear probing (we are currently using the more complicated quadratic probing)
  • backward shift deletion instead of "tombstone" entries
  • upper limit on the probe count, which allows removing boundary checks on lookup
  • prime numbers rather than powers of two for the number of slots

What to change

We need to test which of the changes are actually beneficial in our case. Expected effects:

  • Robin Hood hashing requires storing an additional value for each element, either the key hash or the element's distance from its ideal position. Both require 4 additional bytes per hash bucket. On the other hand, higher load factors should be possible so we might still see a win in terms of memory used. Also, storing the key hash, while requiring to recalculate the element's distance, can speed up lookups (we can compare hashes before doing slow string comparison on the key).
  • Insertion performance decreases slightly, which will affect startup times. It is to be measured whether the difference is relevant.
  • Linear probing is more efficient than quadratic probing, both in terms of calculation complexity and memory access locality. It tends to create longer lookup chains however, something that should be alleviated by Robin Hood hashing.
  • Deletions are probably not common enough in our case to justify optimizing.
  • Removing boundary checks on lookup sounds like a nice optimization, so overallocating slightly might actually be worth it.
  • It's doubtful that the Emscripten compiler will optimize the modulo operation for prime numbers. Then again, so far we didn't even bother optimizing this operation for powers of two. Powers of two have the additional advantage of reducing memory fragmentation however, which should be more important than the performance benefits in our case.

Change History (1)

comment:1 Changed 21 months ago by trev

  • Description modified (diff)
Note: See TracTickets for help on using tickets.