Anyone who lives on this planet long enough knows that more than one langauge is spoken on it. Anyone who programs long enough knows that data comes in more forms than simple, English ASCII. It was not always so. Thirty years ago, anyone who worked on computers had only the basic English character range to use. If they needed to talk about a piÃÂ±ata, they would type it "pinata" and need to disambiguate it as a part of a Mexican celebration by other means.
Eventually, different sets of characters came to be used to represent the sundry language communities in the world. They were, however, mapped over the same key values as their American English counterparts. Key values come in four part combinations like "0xdf". For American English, this could be mapped, say, to the lowercase letter 'a'. But in French, that value is mapped to the letter 'q' because French keyboards conventionally have 'q' where American keyboards have 'a' and 'a' where American keyboards have 'q'.
The trouble in multilingual programming comes because, when processing text, the computer operates on these values, not the letters. By default, Python uses North American English. However, it readily uses any other alphabet in the world; you simply need to tell it do so. This tutorial shows you how to save character strings as Unicode, how to convert Unicode strings to standard string format, and how to convert standard strings to Unicode.
The mapping of "0x40" for the letter "g" is called an encoding. The value is encoded as the letter. Depending on the encoding, "0x40" could be the letter "g" (as in many North American and European encodings) or the Bangladeshi "Ãâ" or the Georgian "Ã¡ÆÅ¾".
For multilingual encoding, Python uses Unicode. Unicode is a system that provides a unique number for every character of a language, no matter what the language.
In order to print a Unicode character, Python requires you to do two things:
- Define the character string to be printed as Unicode, and
- Declare the type of encoding you would like Python to use in the output.
To define a string as Unicode, one simply prefixes a 'u' to the opening quotation mark of the assignment. So, for example, to define variable x as holding the Hebrew letter shin, we simply write:
The value within quotation marks can be literally anything. The basic structure of the statement remains the same:
x = u"ÃÂ©"
One important aspect of working in multiple languages and Unicode is that just because you do not see the letter does not mean that it is not there. If the editor in which you are writing these pieces of code does not support the encoding or does not have a font to reflect the alphabet being used, you will see either blank space, rectangular boxes, or aberrant output that looks a bit like comic strip profanity.
y = u"Ãâ" z = u"Ã¡ÆÅ¾"
Nonetheless, the assignment is made. But a value saved is not worth very much unless you do something with it. While Python easily handles the variable with the Unicode value, we still need to produce some output. Go to the next page to see how.
To output a Unicode string, we must first encode the Unicode string as a standard string. To do this, we use Python's encode function in order to tell Python which encoding to use. For the Unicode strings of the previous page, we could write:
Then we can print them as usual or otherwise write them to a file-like object (e.g., local file, web page, etc.). Now that the values are assigned to other variables, we can reference those handles like any other variable:
a = x.encode("utf-8") b = y.encode("utf-8") c = z.encode("utf-8")
This is one way of converting Unicode strings to standard strings in Python. There is, however, a much easier way. Go to the next page of this tutorial to find out how.
>>> converted = (a, b, c) >>> for i in converted: ... ÃÂ ÃÂ ÃÂ ÃÂ print i ... ÃÂ© Ãâ Ã¡ÆÅ¾
Another way to use the encode function is to convert the value on-the-fly, including it in the print statement.
If this is still too much typing, you will be glad to know that Python will take any object for the encode function, even an iterator. Therefore, one can also feed the encoding argument to a series of Unicode strings like this:
print x.encode("utf-8") print y.encode("utf-8") print z.encode("utf-8")
>>> convert_on_output = (x, y, z) >>> for j in convert_on_output: ... ÃÂ ÃÂ ÃÂ ÃÂ print j.encode("utf-8") ... ÃÂ© Ãâ Ã¡ÆÅ¾
But what if we want to encode a standard string as Unicode? While there are some who think that, when it is spoken slowly enough and loudly enough, everyone understands English. This method of communication works even less for computers than it does for humans.
As mentioned earlier, Python's default encoding is usually ASCII. It can, however, be set to other encodings. To find out which encoding a given Python installation is using, use the method sys.getdefaultencoding(). In the Python shell, it will look like this:
To convert a standard, ASCII string to Unicode, we simply use Python's built-in unicode function:
>>> import sys >>> sys.getdefaultencoding() 'ascii'
Now, x contains the Unicode form of t. You might ask: What's the difference? On your local system, probably none. If, however, you are a consultant programming in Missoula, Montana for a Greek-speaking client in Athens, there is a substantial difference. See the next page for a more detailed explanation.
t = "Hello" x = unicode(t)
In order to show the difference more clearly, let's get a bit more technical. ASCII key values bridge a range of 128 hexadecimal values from 0x00 to 0x7f ("0x00" is the hexadecimal form of "0"; "0x7f" is the hexadecimal equivalent of "127"; the integers from 0 to 127 are 128 in number). As I stated earlier, non-English keybindings cannot use this range nor this set. For many years, different encodings were used as a foil. This is whence the ISO 8859 series developed. However, many of these encodings could not "talk" to each other. This caused programming to become incredibly complex and subsequently impinged upon the ability of users to interact internationally.
Then the Unicode Consortium developed a system whereby every possible alphabet (save for Old Chinese) had its own unique identifier. So, the ASCII set became the beginning of a series that now includes over 100,000 possible characters. The ASCII set is reflected in addresses U+0000 through U+007F of that series. So, when "Hello" is converted from ASCII to Unicode, the computer stops "seeing" ASCII (that range from 0x00 to 0x7f) and starts reading Unicode. This offers greater interoperability with other languages and protects the program itself from premature obsolescence. Instead of multiple encodings, Unicode effectively offers different parts of one giant encoding.