Future proofing strings in PHP

For many characters, UTF-8 encodes characters with multiple bytes because one single byte cannot represent thousands of characters. However, because PHP presently does not support Unicode, the language will consider each byte its own character. If you’ve ever worked with foreign languages in PHP, you realize that it’s quite a mess. If you want to find out how many characters long a sentence written in Chinese is, strlen() will tell you how many bytes it takes up, but not how many actual characters. PHP 6.0 intends to change this with native Unicode string processing.

However, a problem arises when PHP 6 gets released. Right now, if you’re working with binary data, strlen() returns the information that you want to know: a string’s byte count. With Unicode support, PHP would attempt to count the number of characters instead. Luckily, as of PHP 5.2.1, you can mark your strings as ‘binary’ so that when PHP 6.0 comes around, string functions will work on your strings the same way they do now (or at least so I assume). Until 6.0 is finished and gets deployed, whether a string is binary or not won’t matter, but it won’t have any ill effect.

To mark a string as binary, prepend the letter “b” in front of the string, like so:

$string = b"This is a binary string!";

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>