Opened 3 years ago

Closed 12 months ago

#6180 closed change (rejected)

[emscripten] consider using UTF-8 internally

Reported by: hfiguiere Assignee:
Priority: Unknown Milestone:
Module: Core Keywords: closed-in-favor-of-gitlab
Cc: sergz, rjeschke Blocked By:
Blocking: #6452 Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by sergz)


Currently we have strings in UCS-2 because that's how JavaScript has them. It is inefficient for several reasons:

  1. we download the filter list and convert it to UCS-2
  2. we convert the filter list from UCS-2 to UTF-8 to calculate the md5 checksum
  3. statistics show that filters, even in locales like Chinese, are mostly ASCII, which mean that we use twice the memory space than necessary.
  4. we don't perform operations that would be slow with a multi-byte charset.
  5. UCS-2 doesn't even allow the proper character space for internationalization. We'd have to use UCS-4 (4 bytes) or UTF-16.

What to change

  • Change the String classes to use UTF-8
    • pay attention to String::toLower and check that there is no deviations from it.
  • Change everywhere else to use UTF-8.
  • Change the bindings to convert where needed into JavaScript. The more C++ we use the less conversion will happen.
  • Change the downloader to actually retrieve the bytes and not text.

Change History (7)

comment:1 Changed 3 years ago by fhd

  • Cc trev removed

comment:2 Changed 2 years ago by sergz

  • Cc rjeschke added
  • Owner set to sergz

I'm working on ABP_TEXT macro and choosing at compile time the type of internally stored value of String (let's see whether it will be enough).

comment:3 Changed 2 years ago by sergz

  • Blocking 6452 added
  • Description modified (diff)
  • Owner sergz deleted
  • Review URL(s) modified (diff)

It does not accomplish the issue, the added codereview merely allows String to be an UTF-8 string internally (it's even not complete because toLower is not properly implemented). So far it will be useful for #6452 - Use INI parser natively from emscripten branch of adblockpluscore.

comment:5 Changed 2 years ago by hfiguiere

  • Review URL(s) modified (diff)

comment:6 Changed 2 years ago by abpbot

A commit referencing this issue has landed:
Issue 6180 - Add a few missing ABP_TEXT() macros

comment:7 Changed 12 months ago by sebastian

  • Keywords closed-in-favor-of-gitlab added
  • Resolution set to rejected
  • Status changed from new to closed

Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.

Note: See TracTickets for help on using tickets.