Perl Unicode Cookbook: Match Unicode Linebreak Sequence

May 10, 2012 by Tom Christiansen

℞ 22: Match Unicode linebreak sequence in regex

Unicode defines several characters as providing vertical whitespace, like the carriage return or newline characters. Unicode also gathers several characters under the banner of a linebreak sequence. A Unicode linebreak matches the two-character CRLF grapheme or any of the seven vertical whitespace characters.

As documented in perldoc perlrebackslash, the \R regex backslash sequence matches any Unicode linebreak sequence. (Similarly, the \v sequence matches any single character of vertical whitespace.)

This is useful for dealing with textﬁles coming from diﬀerent operating systems:

 s/\R/\n/g;  # normalize all linebreaks to \n

Previous: ℞ 21: Case-insensitive Comparisons

Series Index: The Standard Preamble

Next: ℞ 23: Get Character Categories

Tags

unicode

Tom Christiansen

Browse their articles

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub

TPRF Gold Sponsor

TPRF Silver Sponsor

Perl Resources

Perl Unicode Cookbook: Match Unicode Linebreak Sequence

℞ 22: Match Unicode linebreak sequence in regex

Tom Christiansen

Browse their articles

Feedback

Site Map

Contact Us

License

Legal