php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #77907 mb-functions do not respect default_encoding
Submitted: 2019-04-16 14:06 UTC Modified: 2019-04-17 12:12 UTC
From: n dot scheer at binserv dot de Assigned: nikic (profile)
Status: Closed Package: *Unicode Issues
PHP Version: 7.2.17 OS: CentOS 7.6.1810
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: n dot scheer at binserv dot de
New email:
PHP Version: OS:

 

 [2019-04-16 14:06 UTC] n dot scheer at binserv dot de
Description:
------------
The documentation states (c.f. https://www.php.net/manual/en/ini.core.php#ini.default-charset):

"In PHP 5.6 onwards, "UTF-8" is the default value and [...] The value of default_charset
will also be used to set the default character set for [...] and for mbstring functions
if the mbstring.http_input mbstring.http_output mbstring.internal_encoding
configuration option is unset."

As such, I'd expect to be able to set default_charset to iso-8859-1 and mbstring to pick that same setting for its internal encoding (if the mentioned directives are unset, that is).

In the test script below the output of mb_strlen should be "2" in both cases, as iso-8859-1 is a single byte encoding. This is not the case, instead it seems that the two bytes are recognized as the utf-8 character "รถ".

If mb_internal_encoding is set explicitly, the mb-functions work as expected.

But it should in fact not be needed to use mb_internal_encoding() nor the ini setting for it, because default_charset should be used as default.





Test script:
---------------
<?php

ini_set('default_charset', 'iso-8859-1');
var_dump(ini_get("mbstring.internal_encoding"));
var_dump(ini_get("mbstring.http_input"));
var_dump(ini_get("mbstring.http_output"));
var_dump(mb_internal_encoding());
var_dump(mb_strlen( "\xc3\xb6" ));
var_dump(mb_strlen( "\xc3\xb6", '8bit' ));

mb_internal_encoding('iso-8859-1');
var_dump(mb_internal_encoding());
var_dump(mb_strlen( "\xc3\xb6" ));
var_dump(mb_strlen( "\xc3\xb6", '8bit' ));

Expected result:
----------------
string(0) ""
string(0) ""
string(0) ""
string(5) "UTF-8"
int(1)
int(2)
string(10) "ISO-8859-1"
int(2)
int(2)

Actual result:
--------------
string(0) ""
string(0) ""
string(0) ""
string(5) "UTF-8"
int(2)
int(2)
string(10) "ISO-8859-1"
int(2)
int(2)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-04-16 14:13 UTC] nikic@php.net
-Assigned To: +Assigned To: nikic
 [2019-04-16 14:13 UTC] nikic@php.net
Working on this right now...
 [2019-04-16 14:40 UTC] nikic@php.net
https://github.com/php/php-src/pull/4035

Probably 7.4 only, this is a non-trivial change.
 [2019-04-17 12:08 UTC] nikic@php.net
-Status: Assigned +Status: Closed
 [2019-04-17 12:12 UTC] n dot scheer at binserv dot de
That was fast - thank you! :)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Oct 08 13:01:26 2024 UTC