[Twisted-Python] Unicode
Bob Ippolito
bob at redivi.com
Tue Oct 4 08:40:29 MDT 2005
On Oct 4, 2005, at 1:37 AM, glyph at divmod.com wrote:
> On Mon, 03 Oct 2005 18:19:44 -0600, Ken Kinder <ken at kenkinder.com>
> wrote:
>
>
>> The purpose of Python's unicode type is transparent exchange of
>> string
>> objects, whether those string objects are of type str or type
>> unicode.
>> Pretending that isn't so and raising a TypeError is not helpful. I
>> would
>> urge you to AT LEAST provide a detailed explanation in that error,
>> explaining the philosophical disagreement you have with Python's
>> unicode-string conversion behavior and have a flag you can set to
>> disable that check.
>>
>
>
>> From http://docs.python.org/api/stringObjects.html:
>>
>
> "Only string objects are supported; no Unicode objects should be
> passed."
>
> So there is a precedent for this in the very APIs you are citing :).
>
> You seem to have misunderstood the intent of Python's unicode
> support. Python allows byte strings to be treated in the same way
> as character strings in the areas where such a transposition is
> useful and semantically valid; in some cases it
> (uncharacteristically) guesses based on the default encoding. I
> say "uncharacteristically" because Python refuses the temptation to
> guess when presented with, say, an array object containing bytes,
> integers, or a list of smaller strings. Automatic conversion is
> not the norm in Python.
>
> I see others have already relayed you to the FAQ. Please read the
> articles attached to it.
>
> As long as I'm writing a list post about this though, let me
> include another example which may explain why this is an absolutely
> horrible idea. There are basically 2 modes that .write() could use
> to accept a unicode object; one where it would cause random
> exceptions at runtime based on input, or one where it would
> generate corrupt data on the network.
>
> Let's say I've got a very simple protocol that writes 2 bytes
> indicating the length of a string, then a string, like so:
>
> def writeChunk(self, x):
> self.transport.write(struct.pack("!H", len(x)))
> self.transport.write(x)
>
> If 'x' were a unicode object in this case, we could do one of 2
> things:
>
> A - Write it to the transport as UTF-8/UTF-16 (an encoding that can
> accept any unicode data)
> B - Write it to the transport using ascii/charmap (the default
> encoding, or an encoding that will only produce single-byte
> characters.
>
> Given option A, this code will appear to work until it is passed a
> unicode string with a code point > '\u00ff'. At that point, the
> 'length' prefix will be incorrect; since len() works in terms of
> code points and not bytes, a phrase like u'Shoot me with a \u2022'
> will be truncated by the receiving end, possibly into a string
> which can't even be decoded:
What you mean is a code point > '\u007f', not '\u00ff'... but yeah, I
agree with all this stuff. Explicit is better than implicit, and str
-> unicode implicit conversion is just wrong in almost all cases
(except when it's knowable pure 7-bit ascii, like a constant or
symbol in your code).
-bob
More information about the Twisted-Python
mailing list