Opened 5 years ago

Closed 2 years ago

#3610 closed defect (worksforme)

UnicodeDecodeError during generation of static content with non ASCII filename

Reported by: saroyanm Assignee:
Priority: P4 Milestone:
Module: Sitescripts Keywords: cms
Cc: kzar, sebastian Blocked By:
Blocking: Platform: Unknown / Cross platform
Ready: yes Confidential: no
Tester: Unknown Verified working: no
Review URL(s):

Description (last modified by sebastian)


Python 2.7.6

How to reproduce

  1. Download attached folder and unzip
  2. Make an initial commit
  3. Generate static content using CMS
  4. Run test server and go to http://localhost:5000/test%C3%BC.pdf.

Observed behaviour

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13 error being thrown on step #3 while test server on step #4 doesn't show any error and works fine.

Expected behaviour

Both static content generation script and test server should behave consistently.

Attachments (1) (874 bytes) - added by saroyanm 5 years ago.

Download all attachments as: .zip

Change History (8)

Changed 5 years ago by saroyanm

comment:1 Changed 5 years ago by saroyanm

  • Description modified (diff)

comment:2 Changed 5 years ago by sebastian

  • Cc trev kzar sebastian added
  • Description modified (diff)
  • Priority changed from Unknown to P4
  • Ready set

comment:3 Changed 5 years ago by trev

It seems that the problem is the ZIP format here. hg archive -t uzip will encode file names as UTF-8, zipfile module doesn't decode them however. The documentation says:

There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write().

I guess that the same is true for reading the file names - these have to be decoded from UTF-8 manually. In my tests, ZipFile.namelist() will return the names without decoding them. The calls to ZipFile.getinfo() and most definitely require manual encoding.

comment:4 Changed 5 years ago by sebastian

It seems that the reason it works when reading files from the local file system directly, is that Python automatically encodes filenames, e.g. when calling open(), according to sys.getfilesystemencoding(). This however, isn't always UTF-8.

So it seems that if we want to ensure consistent behavior here, we have to manually decode/encode filenames in both cases. But apparently Windows uses Unicode natively. So no idea what happens if you pass encoded UTF-8 to a file API there.

Last edited 5 years ago by sebastian (previous) (diff)

comment:5 Changed 3 years ago by fhd

  • Cc trev removed

comment:6 Changed 2 years ago by atudor

tried to reproduce on both Ubuntu 14.04.5 and MacOS 10.12.6. Could NOT reproduce.
Everything works as expected. Closing the ticket now.

Last edited 2 years ago by atudor (previous) (diff)

comment:7 Changed 2 years ago by atudor

  • Resolution set to worksforme
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.