Opened on 12/08/2017 at 03:33:28 PM
Closed on 08/29/2019 at 05:43:52 PM
#6180 closed change (rejected)
[emscripten] consider using UTF-8 internally
Reported by: | hfiguiere | Assignee: | |
---|---|---|---|
Priority: | Unknown | Milestone: | |
Module: | Core | Keywords: | closed-in-favor-of-gitlab |
Cc: | sergz, rjeschke | Blocked By: | |
Blocking: | #6452 | Platform: | Unknown / Cross platform |
Ready: | no | Confidential: | no |
Tester: | Unknown | Verified working: | no |
Review URL(s): |
https://codereview.adblockplus.org/29721753/ |
Description (last modified by sergz)
Background
Currently we have strings in UCS-2 because that's how JavaScript has them. It is inefficient for several reasons:
- we download the filter list and convert it to UCS-2
- we convert the filter list from UCS-2 to UTF-8 to calculate the md5 checksum
- statistics show that filters, even in locales like Chinese, are mostly ASCII, which mean that we use twice the memory space than necessary.
- we don't perform operations that would be slow with a multi-byte charset.
- UCS-2 doesn't even allow the proper character space for internationalization. We'd have to use UCS-4 (4 bytes) or UTF-16.
What to change
- Change the String classes to use UTF-8
- pay attention to String::toLower and check that there is no deviations from it.
- Change everywhere else to use UTF-8.
- Change the bindings to convert where needed into JavaScript. The more C++ we use the less conversion will happen.
- Change the downloader to actually retrieve the bytes and not text.
Attachments (0)
Change History (7)
comment:1 Changed on 12/21/2017 at 11:26:28 AM by fhd
- Cc trev removed
comment:2 Changed on 03/12/2018 at 06:13:19 PM by sergz
- Cc rjeschke added
- Owner set to sergz
comment:3 Changed on 03/13/2018 at 06:33:26 PM by sergz
It does not accomplish the issue, the added codereview merely allows String to be an UTF-8 string internally (it's even not complete because toLower is not properly implemented). So far it will be useful for #6452 - Use INI parser natively from emscripten branch of adblockpluscore.
comment:4 Changed on 03/15/2018 at 03:53:19 PM by abpbot
A commit referencing this issue has landed:
Issue 6180 - use ABP_TEXT everywhere in order to let String be a UTF-8 string
comment:6 Changed on 03/16/2018 at 12:54:10 PM by abpbot
A commit referencing this issue has landed:
Issue 6180 - Add a few missing ABP_TEXT() macros
comment:7 Changed on 08/29/2019 at 05:43:52 PM by sebastian
- Keywords closed-in-favor-of-gitlab added
- Resolution set to rejected
- Status changed from new to closed
Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.
I'm working on ABP_TEXT macro and choosing at compile time the type of internally stored value of String (let's see whether it will be enough).