#6648 closed change (invalid)

Support Unicode-aware regular expressions

Reported by: mjethani Assignee:
Priority: Unknown Milestone:
Module: Core Keywords:
Cc: kzar, sebastian, sergz, agiammarchi, greiner, hfiguiere, arthur Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: no Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by mjethani)

Background

With ES2015 JavaScript introduced Unicode-aware regular expressions. While Unicode characters are supported in JavaScript regular expressions, special characters like the . (dot) operator, quantifiers, characters classes, and so on are treated differently in Unicode-aware regular expressions. There may be cases where we want filter authors to be able to specify Unicode-aware regular expressions where normal regular expressions don't cut it.

What to change

There are three options:

  1. Allow a trailing u flag after the regular expression, as in /foo/u$domain=example.com
  2. Always set the u flag in the RegExp object that is created internally
  3. Introduce a new unicode option, similar to matchCase

It should be noted that this only matters for regular expression filters, not for text filters. i.e. text filters still work even if they contain Unicode characters (modulo issues like #6647).

Resolution

After the changes in #6647, URLs are always ASCII only, as any non-ASCII characters in the host name are encoded in Punycode by the browser and those in the path and query string are encoded using percent-encoding.

Change History (9)

comment:1 Changed 14 months ago by mjethani

  • Description modified (diff)

comment:2 Changed 14 months ago by arthur

  • Cc arthur added

comment:3 Changed 14 months ago by sebastian

Can we benchmark the performance of filter matching with EasyList on real webites with/without unicode-aware regular expressions? If there are no measurable performance implications, I think it would make most sense to always add the u flag for regular expression filters at least.

I cannot think of any scenario were non-unicode semantics are preferable, and since there are only few regular expression filters in EasyList we don't have to be too concerned about backwards compatibility and memory usage. Performance is still a concern, however, since those regular expressions are potentially executed quite often.

comment:4 Changed 14 months ago by agiammarchi

My 2 cents. It's easy to feature-detect u support so I think option 2 might be the best option because browsers incapable of the u flag can work without it, and browsers capable will work with u enforced as default.

Point 1 means my expectations, as filter author, might not be fulfilled if my browser is not u compatible (unless we feature detect compatibility and inform the user u flag will be ignored).

Last, but not least, u flag is meant to fix what users expect when unicode is in the string, so that having it as default makes more sense, with capable browsers, than asking users to remember to use u and explain when and why. Without u but set when possible automatically, filters are also share-able and backward compatible.

comment:5 follow-up: Changed 14 months ago by kzar

I agree with Sebastian, if performance isn't a problem why not just enable Unicode matching always? I guess we should test with Chrome 49 and Firefox 51 first thought to make sure it works.

comment:6 Changed 14 months ago by sebastian

Is this change still relevant?

new URL('https://ä/ä?ä').href => "https://xn--4ca/%C3%A4?%C3%A4"

It seems the browser URL-encodes non-ascii characters given in the path and query string, and with #6647 we stopped converting the domain part back from Punycode to Unicode. So as far as I understand, there will never be any non-ascii character that is matched against a filter?

comment:7 in reply to: ↑ 5 Changed 14 months ago by hfiguiere

Replying to kzar:

I agree with Sebastian, if performance isn't a problem why not just enable Unicode matching always? I guess we should test with Chrome 49 and Firefox 51 first thought to make sure it works.

MDN says Chrome 50 at least, and Firefox 46 for the 'u' flag'. (And Edge 12)

comment:8 Changed 14 months ago by mjethani

@sebastian you're right, this is no longer relevant after the changes in #6647. I'm closing this if nobody minds.

comment:9 Changed 14 months ago by mjethani

  • Description modified (diff)
  • Resolution set to invalid
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.