php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #66484 doube encoding in html_entities()
Submitted: 2014-01-14 10:39 UTC Modified: 2014-01-14 19:48 UTC
From: spam2 at rhsoft dot net Assigned:
Status: Not a bug Package: Scripting Engine problem
PHP Version: 5.5.8 OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: spam2 at rhsoft dot net
New email:
PHP Version: OS:

 

 [2014-01-14 10:39 UTC] spam2 at rhsoft dot net
Description:
------------
Input:  Hasta ve Çalışan Güvenliği
Output: Hasta ve Çalışan Güvenliği

ı must be ı



Test script:
---------------
htmlentities($text, ENT_QUOTES, 'ISO8859-1');

$text comes from a MySQL database
yes, i know that this must not be handeled with ISO-8859-1
but that is no reason for double entities at all


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-01-14 18:28 UTC] requinix@php.net
-Status: Open +Status: Feedback
 [2014-01-14 18:28 UTC] requinix@php.net
Go into your database and look at the value directly. I bet you it has
  Hasta ve Çalışan Güvenliği
 [2014-01-14 19:10 UTC] spam2 at rhsoft dot net
you lose that bet (at least partly)

well, there is only one entity only god knows

where it came from because the entered data 
are for sure not entity encoded in the source
otherwise ü would also be ü in the DB

+---------------------------------+
| s2titel                         |
+---------------------------------+
| Hasta Güvenliği Platformu  |
+---------------------------------+
1 row in set (0.00 sec)

htmlentities: Hasta ve Çalışan Güvenliği
browser display in the input-filed: Hasta ve Çalışan Güvenliği

as you can see there are more entities than expected
 [2014-01-14 19:29 UTC] requinix@php.net
-Status: Feedback +Status: Not a bug
 [2014-01-14 19:29 UTC] requinix@php.net
Looks like I totally won that, actually: I didn't say *all* the entities would be encoded, just those three (on that input). More specifically, there'll be numeric entities in place of characters that aren't present in ISO 8859-1.

I'm also betting you entered the data through an HTML form on a page that's encoded (or interpreted as) ISO 8859-1? When you entered characters that weren't supported by the character set the browser entity-encoded them, and using the only encoding scheme it knew would work: numeric entities. [1]

So not a bug: PHP diligently encoded the already-partially-encoded input as you requested. Your two best options are
a) Use an encoding that supports the characters you want to use
b) Pass $double_encode=false to htmlentities()

[1] Ratified in HTML 5 as http://www.w3.org/TR/html5/forms.html#application/x-www-form-urlencoded-encoding-algorithm step 4.3. It's also the behavior for HTML 4 but I don't see where it's declared in the standard.
 [2014-01-14 19:37 UTC] spam2 at rhsoft dot net
how do you imagine taht you have totally won?

DATABASE: Hasta Güvenliği Platformu
ENCODED:  Hasta ve Çalışan Güvenliği

however, starting with PHP 5.4 htmlentities() is *completly broken* anyways because by stupidity the change default to UTF-8 breaks *any code* not wokring with UTF8 and then PHP upstream wonders that nobody upgrades servers

well, we did because we are using wrappers on at least 98% of all code from the last 10 years because htmlentities() never correctly encoded the whole ASCII-table
 [2014-01-14 19:48 UTC] requinix@php.net
The bet is beside the point: characters submitted through an HTML form that are not representable in the character encoding used will be numeric-entity-encoded by the browser.

There is no possible way you gave that input string to htmlentities() and got that output string returned so I'm going to extrapolate:

Input:  Hasta Güvenliği Platformu
Output: Hasta Güvenliği Platformu

Input:  Hasta ve Çalışan Güvenliği
Output: Hasta ve Çalışan Güvenliği

In both cases htmlentities() did exactly what it was supposed to do. The fact that the input is already partially encoded is irrelevant because you didn't pass $double_encode=false to the function.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 16:01:29 2024 UTC