Opened on 12/08/2017 at 03:33:28 PM

Closed on 08/29/2019 at 05:43:52 PM

#6180 closed change (rejected)

[emscripten] consider using UTF-8 internally

Reported by: hfiguiere Assignee:
Priority: Unknown Milestone:
Module: Core Keywords: closed-in-favor-of-gitlab
Cc: sergz, rjeschke Blocked By:
Blocking: #6452 Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

https://codereview.adblockplus.org/29721753/
https://codereview.adblockplus.org/29723641/

Description (last modified by sergz)

Background

Currently we have strings in UCS-2 because that's how JavaScript has them. It is inefficient for several reasons:

  1. we download the filter list and convert it to UCS-2
  2. we convert the filter list from UCS-2 to UTF-8 to calculate the md5 checksum
  3. statistics show that filters, even in locales like Chinese, are mostly ASCII, which mean that we use twice the memory space than necessary.
  4. we don't perform operations that would be slow with a multi-byte charset.
  5. UCS-2 doesn't even allow the proper character space for internationalization. We'd have to use UCS-4 (4 bytes) or UTF-16.

What to change

  • Change the String classes to use UTF-8
    • pay attention to String::toLower and check that there is no deviations from it.
  • Change everywhere else to use UTF-8.
  • Change the bindings to convert where needed into JavaScript. The more C++ we use the less conversion will happen.
  • Change the downloader to actually retrieve the bytes and not text.

Attachments (0)

Change History (7)

comment:1 Changed on 12/21/2017 at 11:26:28 AM by fhd

  • Cc trev removed

comment:2 Changed on 03/12/2018 at 06:13:19 PM by sergz

  • Cc rjeschke added
  • Owner set to sergz

I'm working on ABP_TEXT macro and choosing at compile time the type of internally stored value of String (let's see whether it will be enough).

comment:3 Changed on 03/13/2018 at 06:33:26 PM by sergz

  • Blocking 6452 added
  • Description modified (diff)
  • Owner sergz deleted
  • Review URL(s) modified (diff)

It does not accomplish the issue, the added codereview merely allows String to be an UTF-8 string internally (it's even not complete because toLower is not properly implemented). So far it will be useful for #6452 - Use INI parser natively from emscripten branch of adblockpluscore.

comment:4 Changed on 03/15/2018 at 03:53:19 PM by abpbot

comment:5 Changed on 03/15/2018 at 04:14:08 PM by hfiguiere

  • Review URL(s) modified (diff)

comment:6 Changed on 03/16/2018 at 12:54:10 PM by abpbot

A commit referencing this issue has landed:
Issue 6180 - Add a few missing ABP_TEXT() macros

comment:7 Changed on 08/29/2019 at 05:43:52 PM by sebastian

  • Keywords closed-in-favor-of-gitlab added
  • Resolution set to rejected
  • Status changed from new to closed

Sorry, but we switched to GitLab. If this issue is still relevant, please file it again in the new issue tracker.

Add Comment

Modify Ticket

Change Properties
Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none).
 
Note: See TracTickets for help on using tickets.