Perl Unicode Cookbook: Match Unicode Linebreak Sequence

℞ 22: Match Unicode linebreak sequence in regex

Unicode defines several characters as providing vertical whitespace, like the carriage return or newline characters. Unicode also gathers several characters under the banner of a linebreak sequence. A Unicode linebreak matches the two-character CRLF grapheme or any of the seven vertical whitespace characters.

As documented in perldoc perlrebackslash, the \R regex backslash sequence matches any Unicode linebreak sequence. (Similarly, the \v sequence matches any single character of vertical whitespace.)

This is useful for dealing with textfiles coming from different operating systems:

 s/\R/\n/g;  # normalize all linebreaks to \n

Previous: ℞ 21: Case-insensitive Comparisons

Series Index: The Standard Preamble

Next: ℞ 23: Get Character Categories

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub