PHP :: Bug #60884 :: htmlentities() behaves differently and thus breaks existing code

Bug #60884

htmlentities() behaves differently and thus breaks existing code

Submitted:

2012-01-25 15:29 UTC

Modified:

2012-01-27 10:00 UTC

Votes:	1
Avg. Score:	5.0 ± 0.0
Reproduced:	1 of 1 (100.0%)
Same Version:	0 (0.0%)
Same OS:	0 (0.0%)

From:

t dot nickl at exse dot de

Assigned:

aharvey (profile)

Status:

Closed

Package:

Documentation problem

PHP Version:

5.4.0RC6

OS:

CentOS 4.4

Private report:

CVE-ID:

None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	t dot nickl at exse dot de
New email:
PHP Version:		OS:

New Comment:

[2012-01-25 15:29 UTC] t dot nickl at exse dot de

Description:
------------
//This code must be run via web:

//This is a string from e.g. some database containing a german umlaut 'ä'. Note the encoding really is iso8859-1 . It's just assigned here literally to be concise.
$a = "Rechnungsadresse ändern";

//this output works: (An empty string activates some autodetection)
var_dump(htmlentities($a, ENT_COMPAT | ENT_HTML401, ''));

//this works too (the same output is generated):
var_dump(htmlentities($a, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1'));

//this does NOT work (outputs empty string)
var_dump(htmlentities($a));

// Reason: php changed the charset htmlentities uses when you NOT give anything (90% of the code out there):

//determine_charset() :
///////////////////////////////////////////////////////
// php-5.2.1/ext/standard/html.c :
//    /* Guarantee default behaviour for backwards compatibility */
//    if (charset_hint == NULL)
//        return cs_8859_1;
/////////////////////////////////////////////////////
// php-5.4.0RC4/ext/standard/html.c :
//   /* Default is now UTF-8 */
//   if (charset_hint == NULL)
//        return cs_utf_8;

// This breaks the meaning of existing german code. For example, typo3 outputs empty string if end users used german umlauts in rich text editor in backend.

// Please change determine_charset() back to using cs_8859_1 if the third parameter of htmlentities() is omitted.

Test script:
---------------
See description.

Expected result:
----------------
See description.

Actual result:
--------------
See description.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2012-01-25 18:01 UTC] johannes@php.net

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

In PHP 5.4 the default_charset php.ini option was set to utf-8. You can override this in php.ini or .htaccess or such.

[2012-01-25 18:01 UTC] johannes@php.net

-Status: Open +Status: Bogus

[2012-01-25 22:52 UTC] rasmus@php.net

I know it hurts, but we really need to move away from ISO-8859-1 and towards 
UTF-8 as the default charset of the Web. We have chosen to take the hit in 5.4. 
The documentation has carried a warning about this impending change for quite a 
while urging people to specify a charset.

For PHP 5.4 compatibility Typo3 should either hardcode iso-8859-1 or they should 
change their calls to:

  htmlentities($a,NULL,'')

to pick up the default script-encoding charset.

[2012-01-26 09:18 UTC] t dot nickl at exse dot de

@johannes@php.net:
Setting default_charset to latin1 does not work. Empty string is still outputted when calling htmlentities with only one argument.
Your copy&paste preamble does not help, changing the meaning of the written code is a bug, don't worry.

@rasmus@php.net:
Thank you, I sadly will change every htmlentities($a) to htmlentities($a,NULL,'') before deploying php5.4.

[2012-01-26 23:00 UTC] sixd@php.net

Re-opening as a Doc bug.  The htmlentities doc needs to be clearer about this.  Its "this default is very likely to change" text is outdated.

[2012-01-26 23:00 UTC] sixd@php.net

-Status: Bogus +Status: Re-Opened -Package: *General Issues +Package: Documentation problem

[2012-01-27 10:00 UTC] aharvey@php.net

Automatic comment from SVN on behalf of aharvey
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=322842
Log: Fix doc bug #60884 (htmlentities() behaves differently and thus breaks existing
code) by clarifying the new behaviour in PHP 5.4.0.

[2012-01-27 10:00 UTC] aharvey@php.net

-Status: Re-Opened +Status: Closed -Assigned To: +Assigned To: aharvey

[2012-01-27 10:00 UTC] aharvey@php.net

This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation better.

[2012-02-24 09:27 UTC] t dot nickl at exse dot de

foreach(array(chr(195).chr(132) /*utf8*/, chr(196) /*latin1*/) as $germanUmlaut)
{
	$str= 'a'.$germanUmlaut.'de';
	$pos= strpos($str, $germanUmlaut);
	var_dump(htmlentities($str));
	var_dump($pos);
	var_dump(substr($str, $pos, 1));
}
/*
on php5.3.10:
string(12) "a&Atilde;<secondbyteofutf8umlaut>de"
int(1)
string(1) "<firstbyteofutf8umlaut>"
string(9) "a&Auml;de"
int(1)
string(1) "<latin1umlaut>"


on php5.4:
string(9) "a&Auml;de"
int(1)
string(1) "<firstbyteofutf8umlaut>"
string(0) ""
int(1)
string(1) "<latin1umlaut>"


I find it very funny that htmlentities assumes utf8 now, but substr is still thinking in latin1, as it gives only back the first byte of a multibyte character (seen above).

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Sun Nov 23 06:00:02 2025 UTC