Opened on 02/02/2016 at 10:28:12 AM
Closed on 10/04/2018 at 01:52:55 PM
#3610 closed defect (worksforme)
UnicodeDecodeError during generation of static content with non ASCII filename
Reported by: | saroyanm | Assignee: | |
---|---|---|---|
Priority: | P4 | Milestone: | |
Module: | Sitescripts | Keywords: | cms |
Cc: | kzar, sebastian | Blocked By: | |
Blocking: | Platform: | Unknown / Cross platform | |
Ready: | yes | Confidential: | no |
Tester: | Unknown | Verified working: | no |
Review URL(s): |
Description (last modified by sebastian)
Environment
Python 2.7.6
How to reproduce
- Download attached folder and unzip
- Make an initial commit
- Generate static content using CMS
- Run test server and go to http://localhost:5000/test%C3%BC.pdf.
Observed behaviour
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13 error being thrown on step #3 while test server on step #4 doesn't show any error and works fine.
Expected behaviour
Both static content generation script and test server should behave consistently.
Attachments (1)
Change History (8)
Changed on 02/02/2016 at 10:28:58 AM by saroyanm
comment:2 Changed on 02/04/2016 at 01:10:39 PM by sebastian
- Cc trev kzar sebastian added
- Description modified (diff)
- Priority changed from Unknown to P4
- Ready set
comment:3 Changed on 02/04/2016 at 02:02:05 PM by trev
comment:4 Changed on 02/04/2016 at 04:35:19 PM by sebastian
It seems that the reason it works when reading files from the local file system directly, is that Python automatically encodes filenames, e.g. when calling open(), according to sys.getfilesystemencoding(). This however, isn't always UTF-8.
So it seems that if we want to ensure consistent behavior here, we have to manually decode/encode filenames in both cases. But apparently Windows uses Unicode natively. So no idea what happens if you pass encoded UTF-8 to a file API there.
comment:5 Changed on 12/21/2017 at 11:29:03 AM by fhd
- Cc trev removed
comment:6 Changed on 10/04/2018 at 01:49:57 PM by atudor
tried to reproduce on both Ubuntu 14.04.5 and MacOS 10.12.6. Could NOT reproduce.
Everything works as expected. Closing the ticket now.
comment:7 Changed on 10/04/2018 at 01:52:55 PM by atudor
- Resolution set to worksforme
- Status changed from new to closed
It seems that the problem is the ZIP format here. hg archive -t uzip will encode file names as UTF-8, zipfile module doesn't decode them however. The documentation says:
I guess that the same is true for reading the file names - these have to be decoded from UTF-8 manually. In my tests, ZipFile.namelist() will return the names without decoding them. The calls to ZipFile.getinfo() and ZipFile.read() most definitely require manual encoding.