Mercurial > python-for-js-programmers
changeset 33:7bd041383b45
Added section on strings/unicode.
author | Atul Varma <varmaa@toolness.com> |
---|---|
date | Fri, 06 Jun 2008 06:11:24 -0700 |
parents | ace6f51e84e0 |
children | 1e3903e6968a |
files | PythonForJsProgrammers.txt |
diffstat | 1 files changed, 51 insertions(+), 1 deletions(-) [+] |
line wrap: on
line diff
--- a/PythonForJsProgrammers.txt Fri Jun 06 05:35:04 2008 -0700 +++ b/PythonForJsProgrammers.txt Fri Jun 06 06:11:24 2008 -0700 @@ -151,6 +151,57 @@ .. _`sha`: http://docs.python.org/lib/module-sha.html .. _`Modules`: http://docs.python.org/tut/node8.html +Strings +======= + +Strings are, unfortunately, the bane of Python programming. Unlike +JavaScript, in which every string is unicode, strings in Python are +really more like immutable arrays of bytes. Unicode strings are an +entirely different type, and unicode literals must be prepended with a +``u``, like so: + + >>> u"I am a unicode string." + u'I am a unicode string.' + >>> "I am a non-unicode string." + 'I am a non-unicode string.' + +The non-intuitiveness of this is due to historical reasons: Python is +an older language than JavaScript and dates back to 1991, so the +language didn't originally support unicode. When support was added, +it was added in a way that didn't break backwards compatibility. This +situation will be resolved in `Python 3000`_, the first version of +Python to break backwards compatibility with previous versions. + +A string with a character encoding may be converted to a ``unicode`` +object through the ``decode()`` method, like so: + + >>> "Here is an ellipsis: \xe2\x80\xa6".decode("utf-8") + u'Here is an ellipsis: \u2026' + +Conversely, you can convert a ``unicode`` object into a string via the +``encode()`` method: + + >>> u"Here is an ellipsis: \u2026".encode("utf-8") + 'Here is an ellipsis: \xe2\x80\xa6' + +An exception will be raised if there are characters that aren't +supported by the encoding you specify, though: + + >>> u"hello\u2026".encode("ascii") + Traceback (most recent call last): + ... + UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 5: ordinal not in range(128) + +As such, it's a good idea to optionally specify an algorithm to deal +with characters that aren't supported by the encoding: + + >>> u"hello\u2026".encode("ascii", "ignore") + 'hello' + >>> u"hello\u2026".encode("ascii", "xmlcharrefreplace") + 'hello…' + +.. _`Python 3000`: http://www.python.org/dev/peps/pep-3000/ + Expressions =========== @@ -588,5 +639,4 @@ .. _`doctest`: http://docs.python.org/lib/module-doctest.html -.. TODO: Add section on strings/unicode .. TODO: Add final section on where to go from here