The one feature why I changed almost all my projects from python2 to python3 is the vastly improved handling of encoding stuff. In python2, I was never sure if I needed to throw in a .decode or a .encode and with which arguments to make things work. All my üs and äs would end up as weird characters, so I would try an .encode, which sometimes solved it and sometimes made it weirder yet. So I would try .decode instead, which then sometimes solved it and sometimes didn’t. It was not fun.
Now, for python3, the story is much better and cleaner, since I either have utf-8 strings, which I can print and everything or I have bytes, which are just bytes and need to be decoded before they can be treated as strings. Standard library functions in python3 take and return either strings or bytes. Take for example the open() call: depending on the mode, it returns bytes or strings. If I try to write bytes to a file opened in string mode, I get a TypeError. So everything is warm and nice and I get type errors if I do stupid things, and then I immediately know if I have to decode or encode.
So I wrote a small program which takes a mail on stdin and passes it via LMTP to dovecot, using python3’s smtplib. Everything worked, no type errors anywhere, I even tested it by sending some weird characters in an email. It worked. I deployed to the hemio mail server. A few days later, I get an SMS in the morning: We are loosing mails! Just silently dropping them. WHAT? That’s of course the worst possible thing you can do as a mailserver. After shutting down the mail server to prevent further breakage, I check the log what was happening. The Traceback I see gives me flashbacks to python2:
Traceback (most recent call last):
File “/usr/local/lib/lda-lmtp.py”, line 163, in
exitcode = main(args)
File “/usr/local/lib/lda-lmtp.py”, line 57, in main
msg = sys.stdin.read()
File “/usr/lib/python3.4/encodings/ascii.py”, line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 13114: ordinal not in range(128)
What? How do I get a UnicodeDecodeError? I thought I was passing unicode strings around all the time, why decode? Checking the documentation of smtplib.SMTP.sendmail, it says:
…msg may be a string containing characters in the ASCII range, or a byte string. A string is encoded to bytes using the ascii codec…
So: smtplib.SMTP.sendmail wants bytes. However, if you pass a string instead, it will silently .encode it using the ‘ascii’ codec. WHY? One of the features of python3 is that you have to consciously decide if you want to encode or decode, instead of the willy-nilly casting/one-type-fits-all of python2. But, helpful as ever, smtplib just ascii-encodes your msg for you. Which will barf on interesting characters. Which ended up just dropping mails. Not nice.
The fix was easy: I re-opened stdin in binary mode and just read in the mails in binary directly, such that my program never has to think about encodings and strings. But I am very confused why smtplib is going out of its way to confuse python3 developers. If you can only deal with bytes, just accept bytes. Throw a traceback if you are given strings. Don’t silently try ascii-decoding a given string. It hurts, it loses mails and virtual kittens die!
And why didn’t my testing catch that earlier? Well, my mail program, claws-mail, encodes all outgoing mails in 7bit-printable encoding automatically, so I never actually tested 8bitmime. -_-