# HG changeset patch # User Atul Varma # Date 1212757884 25200 # Node ID 7bd041383b455cce4282233b57f4e5a104968a61 # Parent ace6f51e84e043801006f23c4e402a3533adeb4e Added section on strings/unicode. diff -r ace6f51e84e0 -r 7bd041383b45 PythonForJsProgrammers.txt --- a/PythonForJsProgrammers.txt Fri Jun 06 05:35:04 2008 -0700 +++ b/PythonForJsProgrammers.txt Fri Jun 06 06:11:24 2008 -0700 @@ -151,6 +151,57 @@ .. _`sha`: http://docs.python.org/lib/module-sha.html .. _`Modules`: http://docs.python.org/tut/node8.html +Strings +======= + +Strings are, unfortunately, the bane of Python programming. Unlike +JavaScript, in which every string is unicode, strings in Python are +really more like immutable arrays of bytes. Unicode strings are an +entirely different type, and unicode literals must be prepended with a +``u``, like so: + + >>> u"I am a unicode string." + u'I am a unicode string.' + >>> "I am a non-unicode string." + 'I am a non-unicode string.' + +The non-intuitiveness of this is due to historical reasons: Python is +an older language than JavaScript and dates back to 1991, so the +language didn't originally support unicode. When support was added, +it was added in a way that didn't break backwards compatibility. This +situation will be resolved in `Python 3000`_, the first version of +Python to break backwards compatibility with previous versions. + +A string with a character encoding may be converted to a ``unicode`` +object through the ``decode()`` method, like so: + + >>> "Here is an ellipsis: \xe2\x80\xa6".decode("utf-8") + u'Here is an ellipsis: \u2026' + +Conversely, you can convert a ``unicode`` object into a string via the +``encode()`` method: + + >>> u"Here is an ellipsis: \u2026".encode("utf-8") + 'Here is an ellipsis: \xe2\x80\xa6' + +An exception will be raised if there are characters that aren't +supported by the encoding you specify, though: + + >>> u"hello\u2026".encode("ascii") + Traceback (most recent call last): + ... + UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 5: ordinal not in range(128) + +As such, it's a good idea to optionally specify an algorithm to deal +with characters that aren't supported by the encoding: + + >>> u"hello\u2026".encode("ascii", "ignore") + 'hello' + >>> u"hello\u2026".encode("ascii", "xmlcharrefreplace") + 'hello…' + +.. _`Python 3000`: http://www.python.org/dev/peps/pep-3000/ + Expressions =========== @@ -588,5 +639,4 @@ .. _`doctest`: http://docs.python.org/lib/module-doctest.html -.. TODO: Add section on strings/unicode .. TODO: Add final section on where to go from here