[Twisted-Python] Unicode

Tue Oct 4 08:40:29 MDT 2005

On Oct 4, 2005, at 1:37 AM, glyph at divmod.com wrote:

> On Mon, 03 Oct 2005 18:19:44 -0600, Ken Kinder <ken at kenkinder.com>  
> wrote:
>
>
>> The purpose of Python's unicode type is transparent exchange of  
>> string
>> objects, whether those string objects are of type str or type  
>> unicode.
>> Pretending that isn't so and raising a TypeError is not helpful. I  
>> would
>> urge you to AT LEAST provide a detailed explanation in that error,
>> explaining the philosophical disagreement you have with Python's
>> unicode-string conversion behavior and have a flag you can set to
>> disable that check.
>>
>
>
>> From http://docs.python.org/api/stringObjects.html:
>>
>
>    "Only string objects are supported; no Unicode objects should be  
> passed."
>
> So there is a precedent for this in the very APIs you are citing :).
>
> You seem to have misunderstood the intent of Python's unicode  
> support.  Python allows byte strings to be treated in the same way  
> as character strings in the areas where such a transposition is  
> useful and semantically valid; in some cases it  
> (uncharacteristically) guesses based on the default encoding.  I  
> say "uncharacteristically" because Python refuses the temptation to  
> guess when presented with, say, an array object containing bytes,  
> integers, or a list of smaller strings.  Automatic conversion is  
> not the norm in Python.
>
> I see others have already relayed you to the FAQ.  Please read the  
> articles attached to it.
>
> As long as I'm writing a list post about this though, let me  
> include another example which may explain why this is an absolutely  
> horrible idea.  There are basically 2 modes that .write() could use  
> to accept a unicode object; one where it would cause random  
> exceptions at runtime based on input, or one where it would  
> generate corrupt data on the network.
>
> Let's say I've got a very simple protocol that writes 2 bytes  
> indicating the length of a string, then a string, like so:
>
> def writeChunk(self, x):
>  self.transport.write(struct.pack("!H", len(x)))
>  self.transport.write(x)
>
> If 'x' were a unicode object in this case, we could do one of 2  
> things:
>
> A - Write it to the transport as UTF-8/UTF-16 (an encoding that can  
> accept any unicode data)
> B - Write it to the transport using ascii/charmap (the default  
> encoding, or an encoding that will only produce single-byte  
> characters.
>
> Given option A, this code will appear to work until it is passed a  
> unicode string with a code point > '\u00ff'.  At that point, the  
> 'length' prefix will be incorrect; since len() works in terms of  
> code points and not bytes, a phrase like u'Shoot me with a \u2022'  
> will be truncated by the receiving end, possibly into a string  
> which can't even be decoded:

What you mean is a code point > '\u007f', not '\u00ff'... but yeah, I  
agree with all this stuff.  Explicit is better than implicit, and str  
-> unicode implicit conversion is just wrong in almost all cases  
(except when it's knowable pure 7-bit ascii, like a constant or  
symbol in your code).

-bob