|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2010-04-15 16:06 UTC] zdenis at free dot fr
Description: ------------ When using mb_detect_encoding, depending on how many é characters - or any character above 127 - are present in the string, the detected charset is not consistent and then sometimes wrong. Test script: --------------- // little example php -r "echo mb_detect_encoding(\"é\", 'UTF-8,ISO-8859-1');" php -r "echo mb_detect_encoding(\"éé\", 'UTF-8,ISO-8859-1');" // real life example php -r "echo mb_detect_encoding(\"Produit commandé\", 'UTF-8,ISO-8859-1');" php -r "echo mb_detect_encoding(\"Société\", 'UTF-8,ISO-8859-1');" Expected result: ---------------- ISO-8859-1 ISO-8859-1 ISO-8859-1 ISO-8859-1 Actual result: -------------- UTF-8 ISO-8859-1 UTF-8 ISO-8859-1 PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Mon Oct 27 09:00:02 2025 UTC |
I am getting russian spam in my email forms. mb_detect_encoding() on my form mail content string shows as ASCII strangely enough! the characters are around the UNICODE Ѐ range. this prevents me from detecting foreign language characters in my form mail. please fix. my code is //detect foreign languages $arr[0] = "ASCII"; $arr[1] = "US-ASCII"; if (false===mb_detect_encoding($comment,$arr,true)) { echo "<div style='color:red;'>ERRORB:".mb_detect_encoding($comment)."</div>"; return true; //error } and using the string I generated from charmap ЋϊЁγϋГИБЫЫЏωАДрмдп I get ASCII for a result from that last mb_detect_encoding($comment)I would like to add some precisions. The issue appear when the only non ASCII char is at the end of the string. So I guess that during loop on filters for encoding detection, the last char of the given string is lost. See this little exemple to understand : $utf8String = urldecode('Test%C3%A9'); $encoding = (mb_detect_encoding($utf8String,array('UTF-8','ISO-8859-1'))); echo $encoding.PHP_EOL; $isoString = urldecode('Test%E9'); $encoding = (mb_detect_encoding($isoString,array('UTF-8','ISO-8859-1'))); echo $encoding.PHP_EOL; Will prompt : ------------- UTF-8 UTF-8 Whereas : $utf8String = urldecode('Test%C3%A9'); $encoding = (mb_detect_encoding($utf8String .' ', array('UTF-8','ISO-8859-1'))); // ^- Just add a stuffing char to the end of the string echo $encoding.PHP_EOL; $isoString = urldecode('Test%E9'); $encoding = (mb_detect_encoding($isoString .' ', array('UTF-8','ISO-8859-1'))); // ^- Just add a stuffing char to the end of the string echo $encoding.PHP_EOL; Will prompt (as attempted) : ---------------------------- UTF-8 ISO-8859-1 -- I d'ont know if it will help, but i've searched for the reason of these behavior into PHP source, and i wonder if in the https://github.com/php/php-src/blob/9d3cedbd880b5d7bce2c18b799b6ca8b3e58d257/ext/mbstring/mbstring.c PHP_FUNCTION(mb_detect_encoding) the call of zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s|zb", &str, &str_len, &encoding_list, &strict) doesn't init the str_len value well. Because this value is used after to init the mbfl_string object mbfl_string_init(&string); string.no_language = MBSTRG(language); string.val = (unsigned char *)str; string.len = str_len; If not, i suppose the error to be into the mbfl_identify_encoding_name function, but i haven't got any leads.