Opened on 02/02/2016 at 10:28:12 AM

Closed on 10/04/2018 at 01:52:55 PM

#3610 closed defect (worksforme)

UnicodeDecodeError during generation of static content with non ASCII filename

Reported by:	saroyanm	Assignee:
Priority:	P4	Milestone:
Module:	Sitescripts	Keywords:	cms
Cc:	kzar, sebastian	Blocked By:
Blocking:		Platform:	Unknown / Cross platform
Ready:	yes	Confidential:	no
Tester:	Unknown	Verified working:	no
Review URL(s):

Description (last modified by sebastian)

Environment

Python 2.7.6

How to reproduce

Download attached folder and unzip
Make an initial commit
Generate static content using CMS
Run test server and go to http://localhost:5000/test%C3%BC.pdf.

Observed behaviour

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13 error being thrown on step #3 while test server on step #4 doesn't show any error and works fine.

Expected behaviour

Both static content generation script and test server should behave consistently.

Change History (8)

Changed on 02/02/2016 at 10:28:58 AM by saroyanm

Attachment ascii-issue.zip added

comment:1 Changed on 02/02/2016 at 10:29:30 AM by saroyanm

Description modified (diff)

comment:2 Changed on 02/04/2016 at 01:10:39 PM by sebastian

Cc trev kzar sebastian added
Description modified (diff)
Priority changed from Unknown to P4
Ready set

comment:3 Changed on 02/04/2016 at 02:02:05 PM by trev

It seems that the problem is the ZIP format here. hg archive -t uzip will encode file names as UTF-8, zipfile module doesn't decode them however. The documentation says:

There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write().

I guess that the same is true for reading the file names - these have to be decoded from UTF-8 manually. In my tests, ZipFile.namelist() will return the names without decoding them. The calls to ZipFile.getinfo() and ZipFile.read() most definitely require manual encoding.

comment:4 Changed on 02/04/2016 at 04:35:19 PM by sebastian

It seems that the reason it works when reading files from the local file system directly, is that Python automatically encodes filenames, e.g. when calling open(), according to sys.getfilesystemencoding(). This however, isn't always UTF-8.

So it seems that if we want to ensure consistent behavior here, we have to manually decode/encode filenames in both cases. But apparently Windows uses Unicode natively. So no idea what happens if you pass encoded UTF-8 to a file API there.

Last edited on 02/04/2016 at 04:52:11 PM by sebastian

comment:5 Changed on 12/21/2017 at 11:29:03 AM by fhd

Cc trev removed

comment:6 Changed on 10/04/2018 at 01:49:57 PM by atudor

tried to reproduce on both Ubuntu 14.04.5 and MacOS 10.12.6. Could NOT reproduce.
Everything works as expected. Closing the ticket now.

Last edited on 10/04/2018 at 01:50:45 PM by atudor

comment:7 Changed on 10/04/2018 at 01:52:55 PM by atudor

Resolution set to worksforme
Status changed from new to closed

Context Navigation

#3610 closed defect (worksforme)

UnicodeDecodeError during generation of static content with non ASCII filename

Description (last modified by sebastian)

Environment

How to reproduce

Observed behaviour

Expected behaviour

Attachments (1)

Change History (8)

Changed on 02/02/2016 at 10:28:58 AM by saroyanm

comment:1 Changed on 02/02/2016 at 10:29:30 AM by saroyanm

comment:2 Changed on 02/04/2016 at 01:10:39 PM by sebastian

comment:3 Changed on 02/04/2016 at 02:02:05 PM by trev

comment:4 Changed on 02/04/2016 at 04:35:19 PM by sebastian

comment:5 Changed on 12/21/2017 at 11:29:03 AM by fhd

comment:6 Changed on 10/04/2018 at 01:49:57 PM by atudor

comment:7 Changed on 10/04/2018 at 01:52:55 PM by atudor

Add Comment

Modify Ticket

Changed by kzar

Download in other formats:

Summary:
Description:	You may use WikiFormatting here. === Environment === Python 2.7.6 === How to reproduce === 1. Download [https://issues.adblockplus.org/attachment/ticket/3610/ascii-issue.zip attached folder] and unzip 2. Make an initial commit 3. Generate static content [https://hg.adblockplus.org/cms/file/tip/README.md#l71 using CMS] 4. Run [https://hg.adblockplus.org/cms/file/tip/README.md#l29 test server] and go to http://localhost:5000/test%C3%BC.pdf. === Observed behaviour === `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13` error being thrown on step `#3` while test server on step `#4` doesn't show any error and works fine. === Expected behaviour === Both static content generation script and test server should behave consistently.
Type:		Priority:
Milestone:		Module:
Keywords:		Cc:
Blocked By:		Blocking:
Platform:		Ready:
Confidential:		Tester:
Verified working:
Review URL(s):