Opened 2 years ago

Closed 3 months ago

#5525 closed change (rejected)

[emscripten] Consider using Robin Hood hashing

Reported by: trev Assignee:
Priority: P3 Milestone:
Module: Core Keywords: closed-in-favor-of-gitlab
Cc: sergz Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by trev)

Background

Our current hashing approach tends to produce lengthy lookup chains, around 70 elements in extreme cases. This means suboptimal lookup performance.

There are some articles on Robin Hood hashing which might help here:

These explain a number of possible approaches to improve performance:

  • actual Robin Hood hashing (when inserting, moving existing elements to ensure a "fair" distribution)
  • linear probing (we are currently using the more complicated quadratic probing)
  • backward shift deletion instead of "tombstone" entries
  • upper limit on the probe count, which allows removing boundary checks on lookup
  • prime numbers rather than powers of two for the number of slots

What to change

We need to test which of the changes are actually beneficial in our case. Expected effects:

  • Robin Hood hashing requires storing an additional value for each element, either the key hash or the element's distance from its ideal position. Both require 4 additional bytes per hash bucket. On the other hand, higher load factors should be possible so we might still see a win in terms of memory used. Also, storing the key hash, while requiring to recalculate the element's distance, can speed up lookups (we can compare hashes before doing slow string comparison on the key).
  • Insertion performance decreases slightly, which will affect startup times. It is to be measured whether the difference is relevant.
  • Linear probing is more efficient than quadratic probing, both in terms of calculation complexity and memory access locality. It tends to create longer lookup chains however, something that should be alleviated by Robin Hood hashing.
  • Deletions are probably not common enough in our case to justify optimizing.
  • Removing boundary checks on lookup sounds like a nice optimization, so overallocating slightly might actually be worth it.
  • It's doubtful that the Emscripten compiler will optimize the modulo operation for prime numbers. Then again, so far we didn't even bother optimizing this operation for powers of two. Powers of two have the additional advantage of reducing memory fragmentation however, which should be more important than the performance benefits in our case.

Change History (2)

comment:1 Changed 2 years ago by trev

  • Description modified (diff)

comment:2 Changed 3 months ago by sebastian

  • Keywords closed-in-favor-of-gitlab added
  • Resolution set to rejected
  • Status changed from new to closed

Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.

Note: See TracTickets for help on using tickets.