Opened 5 years ago

Closed 5 years ago

#2268 closed change (fixed)

Deprecate setupStderr()

Reported by: sebastian Assignee: sebastian
Priority: P3 Milestone:
Module: Sitescripts Keywords:
Cc: trev, matze Blocked By:
Blocking: Platform: Unknown
Ready: yes Confidential: no
Tester: Verified working: no
Review URL(s):

http://codereview.adblockplus.org/4790980679041024

Description (last modified by sebastian)

Background

The setupStderr() function is used in our executable scripts and WSGI controllers to make sys.stderr expect unicode instead of byte strings.

While this - IMO ugly - hack prevents UnicodeEncodeErrors when writing unicode objects with non-ascii text to stderr, it causes UnicodeDecodeErrors when you write str objects with non-ascii data to stderr.

Moreover, we have more boilerplate code importing and calling setupStderr() than code actually writing to sys.stderr.

However, it's not feasible to get rid of it at once, as this would require to test every single script. But deprecating it would be a good idea.

What to change

Deprecate setupStderr() showing a warning when running the script during development.

Change History (10)

comment:1 Changed 5 years ago by sebastian

I intentionally didn't set the issue to Ready yet. This code is in there since the initial commit. So I couldn't find any information why it was introduced. But it looks like a bad idea to me. @trev: Do you know more?

comment:2 Changed 5 years ago by sebastian

  • Description modified (diff)

comment:3 follow-up: Changed 5 years ago by trev

Yes, this hack predates the sitescripts repository. And - sure, I came up with this back when I was very new to Python. The problem back then was that sitescripts.subscriptions.subscriptionParser would parse files encoded in UTF-8 - and it would occasionally produce warnings about the content of these files. Having these warnings explode due to Unicode content was a bad thing.

it causes UnicodeDecodeErrors when you write str objects with non-ascii data to stderr.

Arguably, non-text data should never be written to stderr.

Moreover, the way it's currently implemented doesn't work for WSGI apps running in a server environment, since there is no sys.stderr. Unhandled exceptions are logged to environ["wsgi.errors"] instead.

That's why our WSGI apps call setupStderr(environ['wsgi.errors']) and print to sys.stderr then.

What's the alternative? I don't think that we should be thinking about encodings whenever we print a warning or an error message. It seems that the logging module will handle Unicode content automatically, should we use it consistently throughout the ode? Note that this means replacing both direct writing to sys.stderr and any use of warnings module.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 5 years ago by sebastian

Replying to trev:

Yes, this hack predates the sitescripts repository. And - sure, I came up with this back when I was very new to Python. The problem back then was that sitescripts.subscriptions.subscriptionParser would parse files encoded in UTF-8 - and it would occasionally produce warnings about the content of these files. Having these warnings explode due to Unicode content was a bad thing.

I would consider it a better approach to just encode those warnings properly when generating them, rather than changing the default behavior of the whole environment. Note that other pieces of code including third-party and core libraries are more likely to rely on the default behavior.

it causes UnicodeDecodeErrors when you write str objects with non-ascii data to stderr.

Arguably, non-text data should never be written to stderr.

By default sys.stderr expects str objects. So code that is compatible with the standard Python environment would do something like sys.stderr.write(text.encode("utf-8")). However, the writer you applied will decode the string again with ascii (potentially raising UnicodeDecodeError) since it expects unicode objects, and then encodes it again as utf8 to pass it to underlying file. Also note that exceptions are coerced to str, not unicode before they get written to sys.stderr as well. So your hack breaks print >>sys.stderr, "ä" as well as raise Exception('ä').

What's the alternative? I don't think that we should be thinking about encodings whenever we print a warning or an error message. It seems that the logging module will handle Unicode content automatically, should we use it consistently throughout the ode? Note that this means replacing both direct writing to sys.stderr and any use of warnings module.

$ grep -R sys.stderr | wc -l
32
$ grep -R setupStderr | wc -l
72

Currently there about twice as many calls to setupStderr() as usages of sys.stderr, of which most are sending str objects anyway. So yes, just removing setupStderr() and handling encoding manually in the very few places where we print dynamic text to stderr (if any at all) seems like the best solution for now.

comment:5 in reply to: ↑ 4 Changed 5 years ago by trev

Replying to sebastian:

So yes, just removing setupStderr() and handling encoding manually in the very few places where we print dynamic text to stderr (if any at all) seems like the best solution for now.

Frankly, I disagree. It really isn't obvious where we might be writing Unicode data to sys.stderr. For example, sitescripts.crawler.web.crawler (currently not even calling setupStderr()) will insert domain names from database or lines from a POST request into warning messages - that's a footgun. Even if that data doesn't contain Unicode right now, that might change tomorrow. Are people supposed to remember changing the warning messages then? And it's the same in many other places, deciding whether encoding is necessary requires good knowledge of the data being used there.

Given that the logging module will solve this issue for us, do you see a problem with making it mandatory whenever warnings or errors should be printed?

comment:6 Changed 5 years ago by sebastian

  • Description modified (diff)
  • Priority changed from Unknown to P3
  • Ready set

I didn't argued against using the logging module. My point just was that I consider the setupStderr() hack harmful. Using the logging module instead writing to sys.stderr directly actually sounds good.

comment:7 Changed 5 years ago by sebastian

  • Owner set to sebastian

comment:8 Changed 5 years ago by sebastian

  • Review URL(s) modified (diff)
  • Status changed from new to reviewing

comment:9 Changed 5 years ago by sebastian

  • Description modified (diff)
  • Review URL(s) modified (diff)
  • Summary changed from Get rid of setupStderr() to Deprecate setupStderr()

comment:10 Changed 5 years ago by sebastian

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.