PHP :: Bug #21027 :: htmlspecialchars() misbehaviour

Bug #21027	htmlspecialchars() misbehaviour
Submitted:	2002-12-15 10:08 UTC	Modified:	2002-12-15 10:19 UTC
From:	flying at dom dot natm dot ru	Assigned:
Status:	Not a bug	Package:	Scripting Engine problem
PHP Version:	4.3.0RC3	OS:	All
Private report:	No	CVE-ID:	None

View Developer Edit

[2002-12-15 10:08 UTC] flying at dom dot natm dot ru

htmlspecialchars() handles '&' char incorrectly - it doesn't care if it is aready part of entity or not. It results in very "funny" things when this function is being called several times for the same string. For example:

echo htmlspecialchars(htmlspecialchars(htmlspecialchars(htmlspecialchars(htmlspecialchars('text & text')))));

will produce: 
text &amp;amp;amp;amp; text 

Most correct bahaviour will be to check, if it is followed by any valid entity as they're described in HTML specification. However it can be quite hard to do, because there is lots of entities. So another way is also possible (it should be faster but more dirdy): just check if '&' char is started some abstract entity. Here is 2 regular expressions which are implements correct '&' char handling:

1. This is correct way to handle entities:
preg_replace('/\&(?!((#\d{1,5})|(#(x|X)[\dA-Fa-f]{1,4})|[aA]acute|[aA]circ|acute|(ae|AE)lig|
[aA]grave|alefsym|[aA]lpha|amp|an[dg]|[aA]ring|asymp|[aA]tilde|[aA]uml|
bdquo|[bB]eta|brvbar|bull|cap|[cC]cedil|cedil|cent|[cC]hi|circ|clubs|cong|
copy|crarr|cup|curren|[dD]agger|d[aA]rr|deg|[dD]elta|diams|divide|[eE]acute|
[eE]circ|[eE]grave|empty|e[mn]sp|[eE]psilon|equiv|[eE]ta|eth|ETH|[eE]uml|
euro|exist|fnof|forall|frac1[24]|frac34|frasl|[gG]amma|g[et]|h[aA]rr|hearts|
hellip|[iI]acute|[iI]circ|iexcl|[iI]grave|image|infin|int|[iI]ota|iquest|
isin|[iI]uml|[kK]appa|[lL]ambda|lang|laquo|l[aA]rr|lceil|ldquo|le|lfloor|
lowast|loz|lrm|lsa?quo|lt|macr|mdash|micro|middot|minus|[mM]u|nabla|nbsp|
ndash|n[ei]|not(in)?|nsub|[nN]tilde|[nN]u|[oO]acute|[oO]circ|(oe|OE)lig|
[oO]grave|oline|[oO]mega|[oO]micron|oplus|or|ord[fm]|[oO]slash|[oO]tilde|
otimes|[oO]uml|par[at]|permil|perp|[pP]hi|[pP]i|piv|plusmn|pound|[pP]rime|
pro[dp]|[pP]si|quot|radic|rang|raquo|r[aA]rr|rceil|rdquo|real|reg|rfloor|
[rR]ho|rlm|rsaquo|rsquo|sbquo|[sS]caron|sdot|sect|shy|[sS]igma|sigmaf|sim|
spades|sube?|sum|sup[123e]?|szlig|[tT]au|there4|[tT]heta|thetasym|thinsp|
thorn|THORN|tilde|times|trade|[uU]acute|u[aA]rr|[uU]circ|[uU]grave|uml|
upsih|[uU]psilon|[uU]uml|weierp|[xX]i|[yY]acute|yen|[yY]uml|[zZ]eta|zwn?j);)/','&amp;',$str);

2. This is less correct, but still better way to handle them:
preg_replace('/&(?!(([A-Za-z_:][A-Za-z0-9\.\-_:]*)|(#\d+)|(#(x|X)[\dA-Fa-f]+));)/','&amp;',$str);

 Good thing about second regexp is that in a case this way will be implemented by htmlspecialchars() function - it will be possible to use it to handle XML entities aswell.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2002-12-15 10:12 UTC] derick@php.net

What is wrong with that output of your test script?

Derick

[2002-12-15 10:17 UTC] flying at dom dot natm dot ru

> What is wrong with that output of your test script?

Only one thing: it produces:
text &amp;amp;amp;amp; text 

while it must be:
text &amp; text 

regardless of number of times, i call htmlspecialchars()

[2002-12-15 10:19 UTC] derick@php.net

No, that's not true.

If you want to htmlspecialchars "&amp;" it comes out as "&amp;amp;" which is exactly as it should be. Encoding "&amp;amp;" will result in "&amp;amp;amp", again as it should do. 

Also htmlspecialchars("&lt;") will return "&amp;lt;" which is expected too.

Derick

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 02 00:01:34 2025 UTC