Opened 2 years ago

Last modified 7 months ago

#6180 closed change

[emscripten] consider using UTF-8 internally — at Initial Version

Reported by: hfiguiere Assignee:
Priority: Unknown Milestone:
Module: Core Keywords: closed-in-favor-of-gitlab
Cc: sergz, rjeschke Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29721753/
https://codereview.adblockplus.org/29723641/

Description

Background

Currently we have strings in UCS-2 because that's how JavaScript has them. It is inefficient for several reasons:

  1. we download the filter list and convert it to UCS-2
  2. we convert the filter list from UCS-2 to UTF-8 to calculate the md5 checksum
  3. statistics show that filters, even in locales like Chinese, are mostly ASCII, which mean that we use twice the memory space than necessary.
  4. we don't perform operations that would be slow with a multi-byte charset.
  5. UCS-2 doesn't even allow the proper character space for internationalization. We'd have to use UCS-4 (4 bytes) or UTF-16.

What to change

  • Change the String classes to use UTF-8.
  • Change everywhere else to use UTF-8.
  • Change the bindings to convert where needed into JavaScript. The more C++ we use the less conversion will happen.
  • Change the downloader to actually retrieve the bytes and not text.

Change History (0)

Note: See TracTickets for help on using tickets.