Opened 16 months ago

Last modified 11 months ago

#7296 closed change

Implement lightweight URL parsing — at Version 13

Reported by: mjethani Assignee: mjethani
Priority: P2 Milestone:
Module: Core Keywords:
Cc: greiner Blocked By:
Blocking: #7000 Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Ross Verified working: yes
Review URL(s):

https://codereview.adblockplus.org/30013574/

Description (last modified by mjethani)

Background

In the extension we use the URL object just to extract the protocol and the hostname from a URL. This is not very efficient as it must "canonicalize" (normalize) and then parse the entire URL. Extraction of the protocol and the hostname can be implemented using a regular expression instead:

let [, protocol, hostname] = /^([^:]+:)(?:\/\/(?:[^/]*@)?(\[[^\]]*\]|[^:/]+))?/
                             .exec(url);

What to change

To be determined.

See patch.

Change History (13)

comment:1 Changed 16 months ago by mjethani

  • Blocking 7000 added

comment:2 Changed 16 months ago by mjethani

  • Description modified (diff)

comment:3 Changed 16 months ago by mjethani

  • Description modified (diff)

comment:4 Changed 16 months ago by mjethani

  • Description modified (diff)

comment:5 Changed 16 months ago by greiner

  • Cc greiner added

Are we expecting any issues with IDNs, FQDNs or differently encoded (i.e. punycode or URL-encoding) URLs here? What about URLs that contain username and password or a port?

While I tried the regex and it seems to be able to handle any of those cases, we should probably make sure that our tests cover all possible cases that are allowed by the URL standard.

comment:6 Changed 16 months ago by mjethani

For the regular expression in the description (which may not be exactly what we end up using), I went for the easy route: I looked at the implementation of URL parsing in Chromium (which is apparently based on Mozilla). I may have got it wrong of course. The best way to ensure that it works as intended is to write tests that compare against the output from the native URL object.

comment:7 Changed 16 months ago by mjethani

Regarding Punycode, we dropped the assumption in Core that URLs will be anything but properly encoded (#6647) (which means the hostname part is expected to be Punycode).

comment:8 Changed 16 months ago by mjethani

  • Description modified (diff)
  • Summary changed from Implement hostname extraction using a regular expression to Implement lightweight URL parsing

comment:9 Changed 16 months ago by mjethani

  • Review URL(s) modified (diff)

comment:10 Changed 16 months ago by mjethani

I have an implementation now. I need to polish it up and add some more tests.

comment:11 Changed 16 months ago by mjethani

  • Owner set to mjethani

comment:12 Changed 16 months ago by mjethani

  • Description modified (diff)

comment:13 Changed 16 months ago by mjethani

  • Description modified (diff)
Note: See TracTickets for help on using tickets.