Opened 15 months ago

Last modified 11 months ago

#6180 new change

[emscripten] consider using UTF-8 internally

Reported by: hfiguiere Assignee:
Priority: Unknown Milestone:
Module: Core Keywords:
Cc: sergz, rjeschke Blocked By:
Blocking: #6452 Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29721753/
https://codereview.adblockplus.org/29723641/

Description (last modified by sergz)

Background

Currently we have strings in UCS-2 because that's how JavaScript has them. It is inefficient for several reasons:

  1. we download the filter list and convert it to UCS-2
  2. we convert the filter list from UCS-2 to UTF-8 to calculate the md5 checksum
  3. statistics show that filters, even in locales like Chinese, are mostly ASCII, which mean that we use twice the memory space than necessary.
  4. we don't perform operations that would be slow with a multi-byte charset.
  5. UCS-2 doesn't even allow the proper character space for internationalization. We'd have to use UCS-4 (4 bytes) or UTF-16.

What to change

  • Change the String classes to use UTF-8
    • pay attention to String::toLower and check that there is no deviations from it.
  • Change everywhere else to use UTF-8.
  • Change the bindings to convert where needed into JavaScript. The more C++ we use the less conversion will happen.
  • Change the downloader to actually retrieve the bytes and not text.

Change History (6)

comment:1 Changed 14 months ago by fhd

  • Cc trev removed

comment:2 Changed 11 months ago by sergz

  • Cc rjeschke added
  • Owner set to sergz

I'm working on ABP_TEXT macro and choosing at compile time the type of internally stored value of String (let's see whether it will be enough).

comment:3 Changed 11 months ago by sergz

  • Blocking 6452 added
  • Description modified (diff)
  • Owner sergz deleted
  • Review URL(s) modified (diff)

It does not accomplish the issue, the added codereview merely allows String to be an UTF-8 string internally (it's even not complete because toLower is not properly implemented). So far it will be useful for #6452 - Use INI parser natively from emscripten branch of adblockpluscore.

comment:5 Changed 11 months ago by hfiguiere

  • Review URL(s) modified (diff)

comment:6 Changed 11 months ago by abpbot

A commit referencing this issue has landed:
Issue 6180 - Add a few missing ABP_TEXT() macros

Note: See TracTickets for help on using tickets.