Opened on 02/02/2016 at 10:28:12 AM

Closed on 10/04/2018 at 01:52:55 PM

#3610 closed defect (worksforme)

UnicodeDecodeError during generation of static content with non ASCII filename

Reported by: saroyanm Assignee:
Priority: P4 Milestone:
Module: Sitescripts Keywords: cms
Cc: kzar, sebastian Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by sebastian)

Environment

Python 2.7.6

How to reproduce

  1. Download attached folder and unzip
  2. Make an initial commit
  3. Generate static content using CMS
  4. Run test server and go to http://localhost:5000/test%C3%BC.pdf.

Observed behaviour

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13 error being thrown on step #3 while test server on step #4 doesn't show any error and works fine.

Expected behaviour

Both static content generation script and test server should behave consistently.

Attachments (1)

ascii-issue.zip (874 bytes) - added by saroyanm on 02/02/2016 at 10:28:58 AM.

Download all attachments as: .zip

Change History (8)

Changed on 02/02/2016 at 10:28:58 AM by saroyanm

comment:1 Changed on 02/02/2016 at 10:29:30 AM by saroyanm

  • Description modified (diff)

comment:2 Changed on 02/04/2016 at 01:10:39 PM by sebastian

  • Cc trev kzar sebastian added
  • Description modified (diff)
  • Priority changed from Unknown to P4
  • Ready set

comment:3 Changed on 02/04/2016 at 02:02:05 PM by trev

It seems that the problem is the ZIP format here. hg archive -t uzip will encode file names as UTF-8, zipfile module doesn't decode them however. The documentation says:

There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write().

I guess that the same is true for reading the file names - these have to be decoded from UTF-8 manually. In my tests, ZipFile.namelist() will return the names without decoding them. The calls to ZipFile.getinfo() and ZipFile.read() most definitely require manual encoding.

comment:4 Changed on 02/04/2016 at 04:35:19 PM by sebastian

It seems that the reason it works when reading files from the local file system directly, is that Python automatically encodes filenames, e.g. when calling open(), according to sys.getfilesystemencoding(). This however, isn't always UTF-8.

So it seems that if we want to ensure consistent behavior here, we have to manually decode/encode filenames in both cases. But apparently Windows uses Unicode natively. So no idea what happens if you pass encoded UTF-8 to a file API there.

Last edited on 02/04/2016 at 04:52:11 PM by sebastian

comment:5 Changed on 12/21/2017 at 11:29:03 AM by fhd

  • Cc trev removed

comment:6 Changed on 10/04/2018 at 01:49:57 PM by atudor

tried to reproduce on both Ubuntu 14.04.5 and MacOS 10.12.6. Could NOT reproduce.
Everything works as expected. Closing the ticket now.

Last edited on 10/04/2018 at 01:50:45 PM by atudor

comment:7 Changed on 10/04/2018 at 01:52:55 PM by atudor

  • Resolution set to worksforme
  • Status changed from new to closed

Add Comment

Modify Ticket

Change Properties
Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none).
 
Note: See TracTickets for help on using tickets.