Opened on 02/17/2019 at 07:00:17 AM
Closed on 02/26/2019 at 05:31:44 AM
Last modified on 07/25/2019 at 02:21:59 PM
#7296 closed change (fixed)
Implement lightweight URL parsing for canonicalized URLs
Reported by: | mjethani | Assignee: | mjethani |
---|---|---|---|
Priority: | P2 | Milestone: | |
Module: | Core | Keywords: | |
Cc: | greiner | Blocked By: | |
Blocking: | #7000 | Platform: | Unknown / Cross platform |
Ready: | yes | Confidential: | no |
Tester: | Ross | Verified working: | yes |
Review URL(s): |
Description (last modified by mjethani)
Background
In the extension we use the URL object just to extract the protocol and the hostname from a URL. This is not very efficient as it must "canonicalize" (normalize) and then parse the entire URL. Extraction of the protocol and the hostname can be implemented using a regular expression instead:
let [, protocol, hostname] = /^([^:]+:)(?:\/\/(?:[^/]*@)?(\[[^\]]*\]|[^:/]+))?/ .exec(url);
Note: This would work only for URLs that are already canonicalized (i.e. http:example.com:80?foo has been converted to http://example.com/?foo), such as those received by the onBeforeRequest handler in the extension.
What to change
Implement a function parseURL() that uses the above regular expression to extract the protocol and the hostname from a URL. It should return an object containing the properties href, protocol, and hostname and a toString() method that returns the value of the href property.
For any url that is a canonicalized URL, all of the following should be true:
- parseURL(url).href === new URL(url).href
- parseURL(url).protocol === new URL(url).protocol
- parseURL(url).hostname === new URL(url).hostname
- parseURL(url).toString() === new URL(url).toString()
For the tests, pick up some ideas from Chromium. Note however that parseURL() should expect an already canonicalized URL; accordingly, the tests should use only canonicalized URLs.
Attachments (0)
Change History (20)
comment:1 Changed on 02/17/2019 at 07:01:05 AM by mjethani
- Blocking 7000 added
comment:5 Changed on 02/19/2019 at 11:29:54 AM by greiner
- Cc greiner added
comment:6 Changed on 02/19/2019 at 06:22:12 PM by mjethani
For the regular expression in the description (which may not be exactly what we end up using), I went for the easy route: I looked at the implementation of URL parsing in Chromium (which is apparently based on Mozilla). I may have got it wrong of course. The best way to ensure that it works as intended is to write tests that compare against the output from the native URL object.
comment:7 Changed on 02/19/2019 at 06:27:00 PM by mjethani
Regarding Punycode, we dropped the assumption in Core that URLs will be anything but properly encoded (#6647) (which means the hostname part is expected to be Punycode).
comment:8 Changed on 02/21/2019 at 11:31:05 AM by mjethani
- Description modified (diff)
- Summary changed from Implement hostname extraction using a regular expression to Implement lightweight URL parsing
comment:10 Changed on 02/21/2019 at 11:35:11 AM by mjethani
I have an implementation now. I need to polish it up and add some more tests.
comment:11 Changed on 02/21/2019 at 11:35:22 AM by mjethani
- Owner set to mjethani
comment:12 Changed on 02/21/2019 at 02:07:12 PM by mjethani
- Description modified (diff)
comment:13 Changed on 02/23/2019 at 07:38:15 AM by mjethani
- Description modified (diff)
comment:14 Changed on 02/23/2019 at 08:46:47 AM by mjethani
- Description modified (diff)
- Priority changed from Unknown to P2
- Ready set
comment:15 Changed on 02/23/2019 at 08:47:03 AM by mjethani
- Status changed from new to reviewing
comment:16 Changed on 02/23/2019 at 08:57:01 AM by mjethani
- Description modified (diff)
comment:17 Changed on 02/23/2019 at 08:58:18 AM by mjethani
- Summary changed from Implement lightweight URL parsing to Implement lightweight URL parsing for canonicalized URLs
comment:18 Changed on 02/26/2019 at 05:30:25 AM by abpbot
A commit referencing this issue has landed:
Issue 7296 - Implement lightweight URL parsing
comment:19 Changed on 02/26/2019 at 05:31:44 AM by mjethani
- Resolution set to fixed
- Status changed from reviewing to closed
comment:20 Changed on 07/25/2019 at 02:21:59 PM by Ross
- Tester changed from Unknown to Ross
- Verified working set
Done. Does not look to have caused any regressions.
ABP 0.9.15.2340
Microsoft Edge 44.17763.1.0 / Windows 10 1809
ABP 3.5.2.2340
Chrome 49.0.2623.75 / Windows 10 1809
Chrome 75.0.3770.142 / Windows 10 1809
Opera 36.0.2130.65 / Windows 10 1809
Opera 62.0.3331.72 / Windows 10 1809
Firefox 51.0 / Windows 10 1809
Firefox 68.0 / Windows 10 1809
Firefox Mobile 68.0 / Android 7.2.2
Are we expecting any issues with IDNs, FQDNs or differently encoded (i.e. punycode or URL-encoding) URLs here? What about URLs that contain username and password or a port?
While I tried the regex and it seems to be able to handle any of those cases, we should probably make sure that our tests cover all possible cases that are allowed by the URL standard.