php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #40395 PCRE engine unable to output NULL characters
Submitted: 2007-02-08 00:04 UTC Modified: 2007-02-11 19:50 UTC
From: jfrim at idirect dot com Assigned: nlopess (profile)
Status: Closed Package: Documentation problem
PHP Version: * OS: *
Private report: No CVE-ID: None
 [2007-02-08 00:04 UTC] jfrim at idirect dot com
Description:
------------
The PERL-compatible regular expression engine is unable to output NULL characters correctly.  This is evident with the preg_replace() function (tested), and seems likely evident with other PCRE functions (untested) according to some other but reports already submitted.  Instead of returning a NULL character, a literal '\0' sequence is returned.


Reproduce code:
---------------
<?php
$inputstring = "ASCII NUL\0, SOH\01, STX\02, ETX\03";
echo preg_replace('/([\\x00-\\x02])/e',"'['.ord('\\1').']'",$inputstring);
?>

Expected result:
----------------
ASCII NUL[0], SOH[1], STX[2], ETX

(Note that "ETX" is immediately followed by ctrl char #3)


Actual result:
--------------
ASCII NUL[92], SOH[1], STX[2], ETX

(Note that "ETX" is immediately followed by ctrl char #3)

The "92" is present in place of what should be "0" because preg_replace() incorrectly returns a literal '\0' sequence instead of a NULL character, and the ord() function then returns the value of the literal backslash.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-02-08 00:26 UTC] tony2001@php.net
Sorry, but your problem does not imply a bug in PHP itself.  For a
list of more appropriate places to ask for help using PHP, please
visit http://www.php.net/support.php as this bug system is not the
appropriate forum for asking support questions.  Due to the volume
of reports we can not explain in detail here why your report is not
a bug.  The support channels will be able to provide an explanation
for you.

Thank you for your interest in PHP.


 [2007-02-08 05:32 UTC] jfrim at idirect dot com
If the regular expression were /([\x00-\xFF])/ , you would think EVERY possible byte value would be matched.  In fact, all of them do get matched.  However, all of them EXCEPT for byte value 0x00 is returned in the \1 back reference.  Any 0x00 bytes are returned as two bytes, 0x5C followed by 0x30.

I have not found in any Perl regular expression documentation an explanation for why the 0x00 byte is handled like this, so could you please tell me why this is NOT a bug with PCRE.

Thanks.
 [2007-02-08 06:01 UTC] jfrim at idirect dot com
I'd also like to present bug #16590:

http://bugs.php.net/bug.php?id=16590

Note the following example they list as a SOLUTION to specifying NULLs in the pattern:

preg_match("/\\x00/", "foo\0bar")

And note the following statement from bug report #16590:

"...The docs state that PCRE is binary safe..."


So if PCRE is binary safe, and you can specify NULLs in the pattern with \x00, why are back references unable to return these matched NULLs?!?!?

How is this NOT a bug?!??
 [2007-02-08 13:17 UTC] nlopess@php.net
Ok, so the problem here is that preg_do_eval() calls php_addslashes_ex(), that escapes "'", "\" and "\0".
So we should either not escape the \0 or reflect the behaviour in the docs.
Assigning to the extension maintainer.
 [2007-02-08 19:47 UTC] jfrim at idirect dot com
I have verifed that along with 0x00 being escaped, 0x22 (the double-quote character) is also escaped.  No other byte values are affected.

Even if the documentation was changed to reflect this escaped behaviour of 0x00 and 0x22, there would still be a bug with this behaviour since 0x5C (the backslash character) is NOT escaped!

This would create a discrepency problem if the input string to a preg_replace() contained a literal backslash followed by a number zero, or a backslash followed by a double-quote.  There would be no way to tell from the resulting preg_replace'd data if those sequences are escaped NULLs and escaped double-quotes, or if those were literal sequences in the input string.

So the only way to fix this bug is to either...
...A: Escape the backslash as well, and change the documentation to state that 0x00, 0x22, and 0x5C are escaped, or...
...B: Do not escape any characters.

I would say method B is preferred, since no stripslashes() would have to be performed on the resulting output from a preg_replace(), and it's far more intuitive to always know that a regular expression back-reference will always contain the exact byte value that was matched, without having to worry about special exceptions.
 [2007-02-08 19:59 UTC] jfrim at idirect dot com
The following code demonstrates 0x00 and 0x22 being escaped, without 0x5C being escaped.
It creates an 8-bit ASCII text output, with the character value (in DECIMAL) enclosed within braces (except for escaped chars, in which case it ends up as "92"), followed by the actual character, then a CRLF, for all 256 characters.

Note how the backslash (0x5C, decimal 92) is NOT escaped, and contrary to what nlopess@php.net posted, the single-quote (0x27, decimal 39) is NOT escaped either.  (The double-quote (0x22, decimal 34) is escaped instead.)

<?php
header('Content-Type: text/plain; charset=US-ASCII');
header('Content-Disposition: inline; filename=PCRE.txt');
header('Pragma: no-cache');
header('Expires: 0');
header('Cache-Control: no-cache; must-revalidate');
$teststring='';
for ($i=0; $i<=255; $i++) {
	$teststring.=chr($i);
}
echo preg_replace('/([\\x00-\\xFF])/e',"'{'.ord('\\1').'}\\1'.chr(13).chr(10)",$teststring);
?>
 [2007-02-08 21:55 UTC] jfrim at idirect dot com
Another reason why it would be best to return NULL and DOUBLE-QUOTE (0x00 and 0x22 respectively) in regular expression back-references WITHOUT being escaped:


If this bug was fixed by escaping the backslash as well...

...The the context of the resulting output string would be a mix of escaped and non-escaped data.  (Since the input string is non-escaped, but back-references are escaped.)  This would make it impossible to safely un-escape without risk of data corruption.  The only way to handle this would be to use the "e" modifier in the regular expression and embed stripslashes() into the replacement string.  That's extra processing overhead, and basically makes the entire preg_replace() function useless without the "e" modifier.  It also defeats any possible purposes as to why the back-references are escaped in the first place.  Boo to this solution!


Alternatively, if this bug was fixed by returning NULL and DOUBLE-QUOTE without being escaped...

When using preg_replace, the resulting string will always be in a non-encoded context.  If a slash-encoded string is ever desired, the entire thing can be wrapped in addslashes() by the user, without ever risking destroying the integrity of the data.
 [2007-02-09 17:37 UTC] nlopess@php.net
ok, so after talking with Andrei, we came up with the decision to document it rather than changin the behaviour (e.g. because of bug #5676).
BTW, probably you'll want to consider using preg_replace_callback().
note to self: need to review again the escaped chars (at least NULL, single-quote and double-quote are)
 [2007-02-09 19:28 UTC] jfrim at idirect dot com
The code from my [2007-02-08 19:59:04] post shows only 0x00 and 0x22 being escaped.  Maybe single-quote (0x27) only gets escaped depending on the PHP.INI settings?  I may check into this later.

If nothing is changed in the PHP code, the best work-around I could come up with is this:

Always use the "e" modifier, and in the replacement string for preg_replace(), surround the back-reference with a pair of str_replace(), one to handle 0x00 and one for 0x22.

Example:
<?php
echo preg_replace('/([\\x00-\\xFF])/e',"'0x'.sprintf('%02X',ord(str_replace('\\\\0',\"\\0\",str_replace('\\\\\"','\"','\\1'))))",$inputstring);
?>

This example takes a string, and turn each byte into "0x" followed by the two digit hex code.  Note the first str_replace() turns the \0 into a proper NULL, and the second nested str_replace() turns the \" into just " .

It's a very dirty work-around, because preg_replace() is useless without the "e" modifier (adds processing overhead), and str_replace() has to be called twice (adds processing overhead again), and the number of backslashes in the source code is tremendous and can get confusing!

And we still have a potential problem remaining.  If just which characters are escaped and which ones aren't is dependant on the PHP.INI settings (ie. regarding double-quote and single-quote), then it's impossible for this dirty work-around to be portable, unless the entire thing is encapsulated in an if() or switch() block.  That's REALLY dirty!

The reason why stripslashes() can't be used on the back-reference is because the backslash character, if matched in the pattern, is NOT escaped when returned in the back-reference!  stripslashes() ends up returning a null when only a single backslash is passed to it.


If we can't change the behaviour of preg_replace() without breaking compatibility, then I suggest introducing a new function called something like preg_replace_ex() or preg_replace_binsafe() or something, which fixes the bug properly.

The ideal bug fix would be for the back-reference to never escape any returned characters, since the input string fed to preg_replace() is NOT in an escaped context, and you should not mix escaped data with unescaped data.
 [2007-02-11 19:50 UTC] nlopess@php.net
This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation better.


 [2020-02-07 06:10 UTC] phpdocbot@php.net
Automatic comment on behalf of nlopess
Revision: http://git.php.net/?p=doc/en.git;a=commit;h=cce3db21a11236d935dc93d601897302a8e7afe8
Log: fix bug #40395: document which chars are escaped when running with '/e' modifier
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu May 02 03:01:29 2024 UTC