<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Perl.com</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/" />
    <link rel="self" type="application/atom+xml" href="http://www.perl.com/pub/atom.xml" />
    <id>tag:www.perl.com,2010-07-21:/pub//2</id>
    <updated>2012-05-14T19:30:00Z</updated>
    <subtitle>news and views of the Perl programming language</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 5.13-en</generator>

<entry>
    <title>Perl Unicode Cookbook: Disable Unicode-awareness in Builtin Character Classes</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/05/perlunicook-disable-unicode-awareness-in-builtin-character-classes.html" />
    <id>tag:www.perl.com,2012:/pub//2.2026</id>

    <published>2012-05-14T13:00:01Z</published>
    <updated>2012-05-14T19:30:00Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 24: Disabling Unicode-awareness in builtin charclasses Many...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Disabling-Unicode-awareness-in-builtin-charclasses">℞ 24: Disabling Unicode-awareness in builtin charclasses</h2>

<p>Many regex tutorials gloss over the fact that builtin character classes include far more than ASCII characters. In particular, classes such as "word character" (<code>\w</code>), "word boundary" (<code>\b</code>), "whitespace" (<code>\s</code>), and "digit" (<code>\d</code>) respect Unicode.</p>

<p>Perl 5.14 added the <code>/a</code> regex modifier to disable <code>\w</code>, <code>\b</code>, <code>\s</code>, <code>\d</code>, and the <small>POSIX</small> classes from working correctly on Unicode. This restricts these classes to mach only ASCII characters. Use the <a href="http://perldoc.perl.org/re.html">re</a> pragma to restrict these claracter classes in a lexical scope:</p>

<pre><code> use v5.14;
 use re &quot;/a&quot;;</code></pre>

<p>... or use the <code>/a</code> modifier to affect a single regex:</p>

<pre><code> my($num) = $str =~ /(\d+)/a;</code></pre>

<p>You may always use speciﬁc un-Unicode properties, such <code>\p{ahex}</code> and <code>\p{POSIX_Digit}</code>. Properties still work normally no matter what charset modiﬁers (<code>/d /u /l /a /aa</code>) are in eﬀect.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Get Character Categories</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/05/perlunicook-get-character-categories.html" />
    <id>tag:www.perl.com,2012:/pub//2.2024</id>

    <published>2012-05-11T13:00:01Z</published>
    <updated>2012-05-12T22:33:08Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 23: Get character category Unicode is a...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Get-character-category">℞ 23: Get character category</h2>

<p>Unicode is a set of characters and a list of rules and properties applied to
those characters. The <a href="http://www.unicode.org/ucd/">Unicode Character
Database</a> collects those properties. The core module <a
href="http://search.cpan.org/perldoc?Unicode::UCD">Unicode::UCD</a> provides
access to these properties.</p>

<p>These general properties group characters into groups, such as upper- or
lowercase characters, punctuation symbols, math symbols, and more. (See
<code>Unicode::UCD</code>'s <code>general_categories()</code> for more
information.)</p>

<p>The <code>charinfo()</code> function returns a hash reference containing a
wealth of information about the Unicode character in question. In particular,
its <code>category</code> value contains the short name of a character's
category.</p>

<p>To find the general category of a numeric codepoint:</p>

<pre><code> use Unicode::UCD qw(charinfo);
 my $cat = charinfo(0x3A3)-&gt;{category};  # &quot;Lu&quot;</code></pre>

<p>To translate this category into something more human friendly:</p>

<pre><code> use Unicode::UCD qw( charinfo general_categories );
 my $categories = general_categories();
 my $cat        = charinfo(0x3A3)-&gt;{category};  # &quot;Lu&quot;
 my $full_cat   = $categories{ $cat }; # &quot;UppercaseLetter&quot;</code></pre>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Match Unicode Linebreak Sequence</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/05/perlunicook-match-unicode-linebreak-sequence.html" />
    <id>tag:www.perl.com,2012:/pub//2.2022</id>

    <published>2012-05-10T13:00:01Z</published>
    <updated>2012-05-10T19:18:43Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 22: Match Unicode linebreak sequence in regex...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Match-Unicode-linebreak-sequence-in-regex">℞ 22: Match Unicode linebreak sequence in regex</h2>

<p>Unicode defines several characters as providing vertical whitespace, like
the carriage return or newline characters. Unicode also gathers several
characters under the banner of a <em>linebreak sequence</em>. A Unicode
linebreak matches the two-character <small>CRLF</small> grapheme or any of
the seven vertical whitespace characters.</p>

<p>As documented in <a
href="http://perldoc.perl.org/perlrebackslash.html">perldoc
perlrebackslash</a>, the <code>\R</code> regex backslash sequence matches any
Unicode linebreak sequence. (Similarly, the <code>\v</code> sequence matches
any single character of vertical whitespace.)</p>

<p>This is useful for dealing with textﬁles coming from diﬀerent operating
systems:</p>

<pre><code> s/\R/\n/g;  # normalize all linebreaks to \n</code></pre>
]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Case-insensitive Comparisons</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/05/perlunicook-case-insensitive-comparisons.html" />
    <id>tag:www.perl.com,2012:/pub//2.2020</id>

    <published>2012-05-09T13:00:01Z</published>
    <updated>2012-05-10T04:17:53Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 21: Unicode case-insensitive comparisons Unicode is more...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Unicode-case-insensitive-comparisons">℞ 21: Unicode case-insensitive comparisons</h2>

<p>Unicode is more than an expanded character set. Unicode is a set of rules
about how characters behave and a set of properties about each character.</p>

<p>Comparing strings for equivalence often requires normalizing them to a
standard form. That normalized form often requires that all characters be in a
specific case. <a
href="http://www.perl.com/pub/2012/05/perl-unicook-unicode-casing.html">℞ 20:
Unicode casing</a> demonstrated that converting between upper- and lower-case
Unicode characters is more complicated than simply mapping <code>[A-Z]</code>
to <code>[a-z]</code>. (Remember also that many characters have a title case
form!)</p>

<p>The proper solution for normalized comparisons is to perform <a href="http://www.w3.org/International/wiki/Case_folding">casefolding</a>
instead of mapping a subset of some characters to another. Perl 5.16 added a
new feature <c>fc()</c>, or "foldcase", to perform Unicode casefolding as the
<code>/i</code> pattern modiﬁer has always provided. This feature is available
for other Perls thanks to the <small>CPAN</small> module <a
href="http://search.cpan.org/perldoc?Unicode::CaseFold"><code>Unicode::CaseFold</code></a>:</p>

<pre><code> use feature &quot;fc&quot;; # fc() function is from v5.16
 # OR
 use Unicode::CaseFold;

 # sort case-insensitively
 my @sorted = sort { fc($a) cmp fc($b) } @list;

 # both are true:
 fc(&quot;tschüß&quot;)  eq fc(&quot;TSCHÜSS&quot;)
 fc(&quot;Σίσυφος&quot;) eq fc(&quot;ΣΊΣΥΦΟΣ&quot;)</code></pre>

<p><a href="http://www.effectiveperlprogramming.com/blog/1507">Fold cases properly</a> goes into more detail about case folding in Perl.</p>
]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Unicode Casing</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/05/perl-unicook-unicode-casing.html" />
    <id>tag:www.perl.com,2012:/pub//2.2018</id>

    <published>2012-05-08T13:00:01Z</published>
    <updated>2012-05-08T21:57:48Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 20: Unicode casing Unicode casing is very...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Unicode-casing">℞ 20: Unicode casing</h2>

<p>Unicode casing is very diﬀerent from <small>ASCII</small> casing. Some of
the complexity of Unicode comes about because Unicode characters may change
dramatically when changing from upper to lower case and back. For example, the
Greek language has two characters for the lower case sigma, depending on
whether the letter is in a medial (σ) or final (ς) position in a word. Greek
only has a single upper case sigma (Σ). (Some classical Greek texts from the Hellenistic period use a crescent-shaped variant of the sigma called the lunate sigma, or &#x3f2;.)</p>

<p>Unicode casing is important for changing case <em>and</em> for performing case-insensitive matching:</p>

<pre><code> uc(&quot;henry ⅷ&quot;)  # &quot;HENRY Ⅷ&quot;
 uc(&quot;tschüß&quot;)   # &quot;TSCHÜSS&quot;  notice ß =&gt; SS

 # both are true:
 &quot;tschüß&quot;  =~ /TSCHÜSS/i   # notice ß =&gt; SS
 &quot;Σίσυφος&quot; =~ /ΣΊΣΥΦΟΣ/i   # notice Σ,σ,ς sameness</code></pre>
]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Specify a File&apos;s Encoding</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/05/perlunicook-specify-a-files-encoding.html" />
    <id>tag:www.perl.com,2012:/pub//2.2016</id>

    <published>2012-05-04T13:00:01Z</published>
    <updated>2012-05-09T16:50:59Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 19: Open ﬁle with speciﬁc encoding While...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Open-file-with-specific-encoding">℞ 19: Open ﬁle with speciﬁc encoding</h2>

<p>While <a href="http://www.perl.com/pub/2012/05/perlunicook-make-file-io-default-to-utf-8.html">setting the default Unicode encoding for IO is sensible</a>, sometimes the default encoding is not correct. In this case, specify the encoding for a filehandle manually in the mode option to <a href="http://perldoc.perl.org/functions/open.html">open</a> or with the <a href="http://perldoc.perl.org/functions/binmode.html">binmode</a> operator. Perl's IO layers will handle encoding and decoding for you. This is the normal way to deal with encoded text, not by calling low-level functions.</p>

<p>To specify the encoding of a filehandle opened for input:</p>

<pre><code>    open(my $in_file, &quot;&lt; :encoding(UTF-16)&quot;, &quot;wintext&quot;);
     # OR
     open(my $in_file, &quot;&lt;&quot;, &quot;wintext&quot;);
     binmode($in_file, &quot;:encoding(UTF-16)&quot;);

     # ...
     my $line = &lt;$in_file&gt;;</code></pre>

<p>To specify the encoding of a filehandle opened for output:</p>

<pre><code>     open($out_file, &quot;&gt; :encoding(cp1252)&quot;, &quot;wintext&quot;);
     # OR
     open(my $out_file, &quot;&gt;&quot;, &quot;wintext&quot;);
     binmode($out_file, &quot;:encoding(cp1252)&quot;);

     # ...
     print $out_file &quot;some text\n&quot;;</code></pre>

<p>More layers than just the encoding can be speciﬁed here. For example, the
incantation <code>&quot;:raw :encoding(UTF-16LE) :crlf&quot;</code> includes
implicit <small>CRLF</small> handling. See <a
href="http://perldoc.perl.org/PerlIO.html">PerlIO</a> for more details.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Make All I/O Default to UTF-8</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/05/perlunicook-make-all-io-default-to-utf-8.html" />
    <id>tag:www.perl.com,2012:/pub//2.2014</id>

    <published>2012-05-03T13:00:01Z</published>
    <updated>2012-05-03T23:01:33Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 18: Make all I/O and args default...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Make-all-I-O-and-args-default-to-utf8">℞ 18: Make all I/O and args default to utf8</h2>

<p>The core rule of Unicode handling in Perl is "always encode and decode at the edges of your program".</p>

<p>If you've configured everything such that all incoming and outgoing data
uses the <small>UTF</small>-8 encoding, you can make Perl perform the
appropriate encoding and decoding for you. As documented in <a
href="http://perldoc.perl.org/perlrun.html">perldoc perlrun</a>, the
<code>-C</code> flag and the <code>PERL_UNICODE</code> environment variable are
available. Use the <code>S</code> option to make the standard input, output,
and error filehandles use <small>UTF</small>-8 encoding.  Use the
<code>D</code> option to make all other filehandles use <small>UTF</small>-8
encoding. Use the <code>A</code> option to decode <code>@ARGV</code> elements
as <small>UTF</small>-8:</p>

<pre><code>     $ perl -CSDA ...
# or
     $ export PERL_UNICODE=SDA</code></pre>

<p>Within your program, you can achieve the same effects with the <a
href="http://perldoc.perl.org/open.html">open</a> pragma to set default
encodings on filehandles and the <a
href="http://perldoc.perl.org/Encode.html">Encode</a> module to decode the
elements of <code>@ARGV</code>:</p>

<pre><code>     use open qw(:std :utf8);
     use Encode qw(decode_utf8);
     @ARGV = map { decode_utf8($_, 1) } @ARGV;</code></pre>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Make File I/O Default to UTF-8</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/05/perlunicook-make-file-io-default-to-utf-8.html" />
    <id>tag:www.perl.com,2012:/pub//2.2012</id>

    <published>2012-05-01T13:00:01Z</published>
    <updated>2012-05-01T19:32:03Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 17: Make ﬁle I/O default to utf8...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Make-ﬁle-I-O-default-to-utf8">℞ 17: Make ﬁle I/O default to utf8</h2>

<p>If you've ever had the misfortune of seeing the Unicode warning "wide
character in print", you may have realized that something forgot to set the
appropriate Unicode-capable encoding on a filehandle somewhere in your program.
Remember that the rule of Unicode handling in Perl is "always encode and decode
at the edges of your program".</p>

<p>You can easily <a
href="http://www.perl.com/pub/2012/04/perlunicook-decode-standard-filehandles-as-utf-8.html">Decode
<code>STDIN</code>, STDOUT</code>, and <code>STDERR</code> as UTF-8 by
default</a> or <a
href="http://www.perl.com/pub/2012/04/perlunicook-decode-standard-filehandles-as-locale-encoding.html">Decode
<code>STDIN</code>, STDOUT</code>, and <code>STDERR</code> per local
settings</a> as a default, or you can use <a
href="http://perldoc.perl.org/functions/binmode.html"><code>binmode</code></a>
to set the encoding on a specific filehandle.</p>

<p>Alternately, you can set the default encoding on all filehandles through the
entire program, or on a lexical basis. As documented in <a
href="http://perldoc.perl.org/perlrun.html">perldoc perlrun</a>, the
<code>-C</code> flag and the <code>PERL_UNICODE</code> environment variable are
available. Use the <code>D</code> option to make all filehandles default to
UTF-8 encoding. That is, files opened without an encoding argument will be in
<small>UTF</small>-8:</p>

<pre><code>     $ perl -CD ...
     # or
     $ export PERL_UNICODE=D</code></pre>

<p>The <a href="http://perldoc.perl.org/open.html">open</a> pragma configures
the default encoding of all filehandle operations in its lexical scope:</p>

<pre><code>     use open qw(:utf8);</code></pre>

<p>Note that the <code>open</code> pragma is currently incompatible with the <a
href="http://perldoc.perl.org/autodie.html"><code>autodie</code></a>
pragma.</p>
]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Decode Standard Filehandles as Locale Encoding</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/04/perlunicook-decode-standard-filehandles-as-locale-encoding.html" />
    <id>tag:www.perl.com,2012:/pub//2.2010</id>

    <published>2012-04-30T13:00:01Z</published>
    <updated>2012-04-30T19:39:01Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 16: Declare STD{IN,OUT,ERR} to be in locale...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Declare-STD-IN-OUT-ERR-to-be-in-locale-encoding">℞ 16: Declare <code>STD{IN,OUT,ERR}</code> to be in locale encoding</h2>

<p>Always convert to and from your desired encoding at the edges of your
programs. This includes the standard filehandles <code>STDIN</code>,
<code>STDOUT</code>, and <code>STDERR</code>. While it may be most common for
modern operating systems to <a
href="http://www.perl.com/pub/2012/04/perlunicook-decode-standard-filehandles-as-utf-8.html">support
UTF-8 in filehandle settings</a>, you may need to use other encodings.</p>

<p>Perl can respect your current locale settings for its default filehandles.
Start by installing the <a
href="http://search.cpan.org/perldoc?Encode::Locale">Encode::Locale</a> module
from the CPAN.</p>

<pre><code>    # cpan -i Encode::Locale
    use Encode;
    use Encode::Locale;

    # or as a stream for binmode or open
    binmode STDIN,  &quot;:encoding(console_in)&quot;  if -t STDIN;
    binmode STDOUT, &quot;:encoding(console_out)&quot; if -t STDOUT;
    binmode STDERR, &quot;:encoding(console_out)&quot; if -t STDERR;</code></pre>

<p>The <code>Encode::Locale</code> module allows you to use "whatever encoding
the attached terminal expects" for input and output filehandles attached to
terminals. It also allows you to specify "whatever encoding the file system
uses for file names"; see the documentation for more.</p>
]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Decode Standard Filehandles as UTF-8</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/04/perlunicook-decode-standard-filehandles-as-utf-8.html" />
    <id>tag:www.perl.com,2012:/pub//2.2008</id>

    <published>2012-04-27T13:00:01Z</published>
    <updated>2012-04-27T23:46:49Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 15: Declare STD{IN,OUT,ERR} to be UTF-8 Always...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Declare-STD-IN-OUT-ERR-to-be-utf8">℞ 15: Declare <code>STD{IN,OUT,ERR}</code> to be UTF-8</h2>

<p>Always convert to and from your desired encoding at the edges of your
programs. This includes the standard filehandles <code>STDIN</code>,
<code>STDOUT</code>, and <code>STDERR</code>.

<p>As documented in <a href="http://perldoc.perl.org/perlrun.html">perldoc
perlrun</a>, the <code>PERL_UNICODE</code> environment variable or the
<code>-C</code> command-line flag allow you to tell Perl to encode and decode
from and to these filehandles as UTF-8, with the <code>S</code> option:</p>

<pre><code>     $ perl -CS ...
     # or
     $ export PERL_UNICODE=S</code></pre>

<p>Within your program, the <a
href="http://perldoc.perl.org/open.html">open</a> pragma allows you to set the
default encoding of these filehandles all at once:</p>

<pre><code>     use open qw(:std :utf8);</code></pre>

<p>Because Perl uses IO layers to implement encoding and decoding, you may also use the <a href="http://perldoc.perl.org/perlfunc.html#binmode">binmode</a> operator on filehandles directly:</p>
<pre><code>     binmode(STDIN,  &quot;:utf8&quot;);
     binmode(STDOUT, &quot;:utf8&quot;);
     binmode(STDERR, &quot;:utf8&quot;);</code></pre>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Decode @ARGV as Local Encoding</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/04/perlunicookbook-decode-argv-as-local-encoding.html" />
    <id>tag:www.perl.com,2012:/pub//2.2006</id>

    <published>2012-04-26T13:00:01Z</published>
    <updated>2012-04-27T11:55:15Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 14: Decode program arguments as locale encoding...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Decode-program-arguments-as-locale-encoding">℞ 14: Decode program arguments as locale encoding</h2>

<p>While it may be most common in modern operating systems for your command-line arguments to be encoded as UTF-8, <code>@ARGV</code> may use other encodings. If you have configured your system with a proper locale, you may need to decode <code>@ARGV</code> appropriately. Unlike <a href="http://www.perl.com/pub/2012/04/perlunicookbook-decode-argv-as-utf8.html">automatic UTF-8 <code>@ARGV</code> decoding</a>, you must do this manually.</p>

<p>Install the <a href="http://search.cpan.org/perldoc?Encode::Locale">Encode::Locale</a> module from the CPAN:</p>

<pre><code>    # cpan -i Encode::Locale
    use Encode qw(locale);
    use Encode::Locale;

    # use &quot;locale&quot; as an arg to encode/decode
    @ARGV = map { decode(locale =&gt; $_, 1) } @ARGV;</code></pre>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Decode @ARGV as UTF-8</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/04/perlunicookbook-decode-argv-as-utf8.html" />
    <id>tag:www.perl.com,2012:/pub//2.2004</id>

    <published>2012-04-24T13:00:01Z</published>
    <updated>2012-04-27T11:55:11Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 13: Decode program arguments as utf8 While...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Decode-program-arguments-as-utf8">℞ 13: Decode program arguments as utf8</h2>

<p>While the <a
href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">standard
Perl Unicode preamble</a> makes Perl's filehandles use UTF-8 encoding by
default, filehandles aren't the only sources and sinks of data. The command-line arguments to your programs, available through <code>@ARGV</code>, may also need decoding.</p>

<p>You can have Perl handle this operation for you automatically in two ways, and may do it yourself manually. As documented in <a href="http://perldoc.perl.org/perlrun.html">perldoc perlrun</a>, the <code>-C</code> flag controls Unicode features. Use the <code>A</code> modifier for Perl to treat your arguments as UTF-8 strings:

<pre><code>     $ perl -CA ...</code></pre>

<p>You may, of course, use <code>-C</code> on the shebang line of your programs.</p>

<p>The second approach is to use the <code>PERL_UNICODE</code> environment variable. It takes the same values as the <code>-C</code> flag; to get the same effect as <code>-CA</code>, write:</p>

<pre><code>     $ export PERL_UNICODE=A</code></pre>

<p>You may temporarily <em>disable</em> this automatic Unicode treatment with <code>PERL_UNICODE=0</code>.</p>

<p>Finally, you may decode the contents of <code>@ARGV</code> yourself manually with the <a href="http://search.cpan.org/perldoc?Encode">Encode</a> module:</p>

<pre><code>    use Encode qw(decode_utf8);
    @ARGV = map { decode_utf8($_, 1) } @ARGV;</code></pre>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Explicit encode/decode</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/04/perlunicook-explicit-encode-decode.html" />
    <id>tag:www.perl.com,2012:/pub//2.2002</id>

    <published>2012-04-23T13:00:01Z</published>
    <updated>2012-04-24T21:59:47Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 12: Explicit encode/decode While the standard Perl...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Explicit-encode-decode">℞ 12: Explicit encode/decode</h2>

<p>While the <a
href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">standard
Perl Unicode preamble</a> makes Perl's filehandles use UTF-8 encoding by
default, filehandles aren't the only sources and sinks of data.  On rare
occasions, such as a database read, you may be given encoded text you need to
decode.</p>

<p>The core <a href="http://perldoc.perl.org/Encode.html">Encode</a> module
offers two functions to handle these conversions. (Remember that
<code>decode()</code> means to convert octets from a known encoding into Perl's
internal Unicode form and <code>encode()</code> means to convet from Perl's
internal form into a known encoding.)</p>

<pre><code>  use Encode qw(encode decode);

  # given $bytes, containing octets in a known encoding
  my $chars = decode(&quot;shiftjis&quot;, $bytes, 1);

  # given $chars, a string encoded in Perl's internal format
  my $bytes = encode(&quot;MIME-Header-ISO_2022_JP&quot;, $chars, 1);</code></pre>

<p>For streams all in the same encoding, don't use encode/decode; instead set the ﬁle encoding when you open the ﬁle or immediately after with <code>binmode</code> as described in a future reference. Remember the canonical rule of Unicode: always encode/decode at the edges of your application.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Names of CJK Codepoints</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/04/perlunicook-names-of-cjk-codepoints.html" />
    <id>tag:www.perl.com,2012:/pub//2.2000</id>

    <published>2012-04-20T13:00:01Z</published>
    <updated>2012-04-20T22:35:14Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 11: Names of CJK codepoints CJK refers...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Names-of-CJK-codepoints">℞ 11: Names of <small>CJK</small> codepoints</h2>

<p><a href="http://www.unicode.org/faq/han_cjk.html">CJK</a> refers to Chinese, Japanese, and Korean. In the context of Unicode, it usually refers to the Han ideographs used in the modern Chinese and Japanese writing systems. As you can expect, pictoral languages such as Chinese make Unicode handling more complex.</p>

<p>Sinograms like "東京" come back with character names of <small><code>CJK UNIFIED IDEOGRAPH-6771</code></small> and <small><code>CJK UNIFIED IDEOGRAPH-4EAC</code></small>, because their "names" vary between languages. The <small>CPAN</small> 
<a href="http://search.cpan.org/perldoc?Unicode::Unihan"><code>Unicode::Unihan</code></a> module has a large database for decoding these (and a whole lot more), provided you know how to understand its output.</p>

<pre><code> # cpan -i Unicode::Unihan
 use Unicode::Unihan;
 my $str   = &quot;東京&quot;;
 my $unhan = Unicode::Unihan-&gt;new;
 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) {
     printf &quot;CJK $str in %-12s is &quot;, $lang;
     say $unhan-&gt;$lang($str);
 }</code></pre>

<p>prints:</p>

<pre><code> CJK 東京 in Mandarin     is DONG1JING1
 CJK 東京 in Cantonese    is dung1ging1
 CJK 東京 in Korean       is TONGKYENG
 CJK 東京 in JapaneseOn   is TOUKYOU KEI KIN
 CJK 東京 in JapaneseKun  is HIGASHI AZUMAMIYAKO</code></pre>

<p>If you have a speciﬁc romanization scheme in mind, use the speciﬁc module:</p>

<pre><code> # cpan -i Lingua::JA::Romanize::Japanese
 use Lingua::JA::Romanize::Japanese;
 my $k2r = Lingua::JA::Romanize::Japanese->new;
 my $str = &quot;東京&quot;;
 say &quot;Japanese for $str is &quot;, $k2r-&gt;chars($str);</code></pre>

<p>prints:</p>

<pre><code> Japanese for 東京 is toukyou</code></pre>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Custom Named Characters</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/04/perlunicook-custom-named-characters.html" />
    <id>tag:www.perl.com,2012:/pub//2.1998</id>

    <published>2012-04-19T13:00:01Z</published>
    <updated>2012-04-19T18:29:08Z</updated>

    <summary>Editor&apos;s note: Perl guru Tom Christiansen created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks. ℞ 10: Custom named characters As several other...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Editor's note:</em> Perl guru <a href="http://training.perl.com/">Tom Christiansen</a> created and maintains a list of 44 recipes for working with Unicode in Perl 5. Perl.com is pleased to serialize this list over the coming weeks.</p>

<h2 id="Custom-named-characters">℞ 10: Custom named characters</h2>

<p>As several other recipes demonstrate, the <a
href="http://perldoc.perl.org/charnames.html">charnames</a> pragma offers
considerable power to use and manipulate Unicode characters by their names. Its
<code>:alias</code> option allows you to give your own lexically scoped
nicknames to existing characters, or even to give unnamed private-use
characters useful names:</p>

<pre><code> use charnames &quot;:full&quot;, &quot;:alias&quot; =&gt; {
     ecute =&gt; &quot;LATIN SMALL LETTER E WITH ACUTE&quot;,
     &quot;APPLE LOGO&quot; =&gt; 0xF8FF, # private use character
 };

 &quot;\N{ecute}&quot;
 &quot;\N{APPLE LOGO}&quot;</code></pre>

<p>You may even override existing names (lexically, of course) with different
characters.</p>

<p>This feature has some limitations. For best effect, aliases should hew to
the rules of ASCII identifiers and must not resemble regex quantifiers. You can
only alias one character at a time; other options exist to give a character
sequence an alias.</p>

<p>As always, the documentation of the <code>charnames</code> pragma offers
more details.</p>]]>
        
    </content>
</entry>

</feed>

