php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #60884 htmlentities() behaves differently and thus breaks existing code
Submitted: 2012-01-25 15:29 UTC Modified: 2012-01-27 10:00 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: t dot nickl at exse dot de Assigned: aharvey (profile)
Status: Closed Package: Documentation problem
PHP Version: 5.4.0RC6 OS: CentOS 4.4
Private report: No CVE-ID: None
 [2012-01-25 15:29 UTC] t dot nickl at exse dot de
Description:
------------
//This code must be run via web:

//This is a string from e.g. some database containing a german umlaut 'ä'. Note the encoding really is iso8859-1 . It's just assigned here literally to be concise.
$a = "Rechnungsadresse ändern";

//this output works: (An empty string activates some autodetection)
var_dump(htmlentities($a, ENT_COMPAT | ENT_HTML401, ''));

//this works too (the same output is generated):
var_dump(htmlentities($a, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1'));

//this does NOT work (outputs empty string)
var_dump(htmlentities($a));

// Reason: php changed the charset htmlentities uses when you NOT give anything (90% of the code out there):

//determine_charset() :
///////////////////////////////////////////////////////
// php-5.2.1/ext/standard/html.c :
//    /* Guarantee default behaviour for backwards compatibility */
//    if (charset_hint == NULL)
//        return cs_8859_1;
/////////////////////////////////////////////////////
// php-5.4.0RC4/ext/standard/html.c :
//   /* Default is now UTF-8 */
//   if (charset_hint == NULL)
//        return cs_utf_8;

// This breaks the meaning of existing german code. For example, typo3 outputs empty string if end users used german umlauts in rich text editor in backend.

// Please change determine_charset() back to using cs_8859_1 if the third parameter of htmlentities() is omitted.

Test script:
---------------
See description.

Expected result:
----------------
See description.

Actual result:
--------------
See description.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-01-25 18:01 UTC] johannes@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

In PHP 5.4 the default_charset php.ini option was set to utf-8. You can override this in php.ini or .htaccess or such.
 [2012-01-25 18:01 UTC] johannes@php.net
-Status: Open +Status: Bogus
 [2012-01-25 22:52 UTC] rasmus@php.net
I know it hurts, but we really need to move away from ISO-8859-1 and towards 
UTF-8 as the default charset of the Web. We have chosen to take the hit in 5.4. 
The documentation has carried a warning about this impending change for quite a 
while urging people to specify a charset.

For PHP 5.4 compatibility Typo3 should either hardcode iso-8859-1 or they should 
change their calls to:

  htmlentities($a,NULL,'')

to pick up the default script-encoding charset.
 [2012-01-26 09:18 UTC] t dot nickl at exse dot de
@johannes@php.net:
Setting default_charset to latin1 does not work. Empty string is still outputted when calling htmlentities with only one argument.
Your copy&paste preamble does not help, changing the meaning of the written code is a bug, don't worry.

@rasmus@php.net:
Thank you, I sadly will change every htmlentities($a) to htmlentities($a,NULL,'') before deploying php5.4.
 [2012-01-26 23:00 UTC] sixd@php.net
Re-opening as a Doc bug.  The htmlentities doc needs to be clearer about this.  Its "this default is very likely to change" text is outdated.
 [2012-01-26 23:00 UTC] sixd@php.net
-Status: Bogus +Status: Re-Opened -Package: *General Issues +Package: Documentation problem
 [2012-01-27 10:00 UTC] aharvey@php.net
Automatic comment from SVN on behalf of aharvey
Revision: http://svn.php.net/viewvc/?view=revision&revision=322842
Log: Fix doc bug #60884 (htmlentities() behaves differently and thus breaks existing
code) by clarifying the new behaviour in PHP 5.4.0.
 [2012-01-27 10:00 UTC] aharvey@php.net
-Status: Re-Opened +Status: Closed -Assigned To: +Assigned To: aharvey
 [2012-01-27 10:00 UTC] aharvey@php.net
This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation better.


 [2012-02-24 09:27 UTC] t dot nickl at exse dot de
foreach(array(chr(195).chr(132) /*utf8*/, chr(196) /*latin1*/) as $germanUmlaut)
{
	$str= 'a'.$germanUmlaut.'de';
	$pos= strpos($str, $germanUmlaut);
	var_dump(htmlentities($str));
	var_dump($pos);
	var_dump(substr($str, $pos, 1));
}
/*
on php5.3.10:
string(12) "a&Atilde;<secondbyteofutf8umlaut>de"
int(1)
string(1) "<firstbyteofutf8umlaut>"
string(9) "a&Auml;de"
int(1)
string(1) "<latin1umlaut>"


on php5.4:
string(9) "a&Auml;de"
int(1)
string(1) "<firstbyteofutf8umlaut>"
string(0) ""
int(1)
string(1) "<latin1umlaut>"


I find it very funny that htmlentities assumes utf8 now, but substr is still thinking in latin1, as it gives only back the first byte of a multibyte character (seen above).
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri May 03 21:01:32 2024 UTC