changeset 33:7bd041383b45

Added section on strings/unicode.
author Atul Varma <varmaa@toolness.com>
date Fri, 06 Jun 2008 06:11:24 -0700
parents ace6f51e84e0
children 1e3903e6968a
files PythonForJsProgrammers.txt
diffstat 1 files changed, 51 insertions(+), 1 deletions(-) [+]
line wrap: on
line diff
--- a/PythonForJsProgrammers.txt	Fri Jun 06 05:35:04 2008 -0700
+++ b/PythonForJsProgrammers.txt	Fri Jun 06 06:11:24 2008 -0700
@@ -151,6 +151,57 @@
 .. _`sha`: http://docs.python.org/lib/module-sha.html
 .. _`Modules`: http://docs.python.org/tut/node8.html
 
+Strings
+=======
+
+Strings are, unfortunately, the bane of Python programming.  Unlike
+JavaScript, in which every string is unicode, strings in Python are
+really more like immutable arrays of bytes.  Unicode strings are an
+entirely different type, and unicode literals must be prepended with a
+``u``, like so:
+
+    >>> u"I am a unicode string."
+    u'I am a unicode string.'
+    >>> "I am a non-unicode string."
+    'I am a non-unicode string.'
+
+The non-intuitiveness of this is due to historical reasons: Python is
+an older language than JavaScript and dates back to 1991, so the
+language didn't originally support unicode.  When support was added,
+it was added in a way that didn't break backwards compatibility.  This
+situation will be resolved in `Python 3000`_, the first version of
+Python to break backwards compatibility with previous versions.
+
+A string with a character encoding may be converted to a ``unicode``
+object through the ``decode()`` method, like so:
+
+    >>> "Here is an ellipsis: \xe2\x80\xa6".decode("utf-8")
+    u'Here is an ellipsis: \u2026'
+
+Conversely, you can convert a ``unicode`` object into a string via the
+``encode()`` method:
+
+    >>> u"Here is an ellipsis: \u2026".encode("utf-8")
+    'Here is an ellipsis: \xe2\x80\xa6'
+
+An exception will be raised if there are characters that aren't
+supported by the encoding you specify, though:
+
+    >>> u"hello\u2026".encode("ascii")
+    Traceback (most recent call last):
+    ...
+    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 5: ordinal not in range(128)
+
+As such, it's a good idea to optionally specify an algorithm to deal
+with characters that aren't supported by the encoding:
+
+    >>> u"hello\u2026".encode("ascii", "ignore")
+    'hello'
+    >>> u"hello\u2026".encode("ascii", "xmlcharrefreplace")
+    'hello&#8230;'
+
+.. _`Python 3000`: http://www.python.org/dev/peps/pep-3000/
+
 Expressions
 ===========
 
@@ -588,5 +639,4 @@
 
 .. _`doctest`: http://docs.python.org/lib/module-doctest.html
 
-.. TODO: Add section on strings/unicode
 .. TODO: Add final section on where to go from here