php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #79455 scandir() mishandles UTF-8 character(s)
Submitted: 2020-04-07 14:04 UTC Modified: 2020-04-16 10:25 UTC
From: pguest at meaa dot mea dot com Assigned: cmb (profile)
Status: Not a bug Package: Directory function related
PHP Version: 7.4.4 OS: Windows 10
Private report: No CVE-ID: None
 [2020-04-07 14:04 UTC] pguest at meaa dot mea dot com
Description:
------------
A Windows filesystem directory containing the character:
á, 225, 0xE1

is returned from scandir as two characters:
Ã, 195, 0xC3
¡, 161, 0xA1

In otherwords:
'Arquivo dos Gráficos'
becomes:
'Arquivo dos Gráficos'

Test script:
---------------
<?php

$rootdir = 'C:/temp/';
$portuguese_string = 'Arquivo dos Gráficos';
$newdir = $rootdir . $portuguese_string;
if (is_dir($newdir))
{
   rmdir ($newdir);
}
if (is_dir($rootdir))
{
   rmdir ($rootdir);
}

mkdir($rootdir);
mkdir($newdir);

echo sprintf("'%s' encoding: %s\n", $portuguese_string, mb_detect_encoding($portuguese_string));
$scandir_return = scandir($rootdir);
echo "scandir() returns: " . print_r($scandir_return, true);
echo sprintf("'%s' encoding: %s\n", $scandir_return[2], mb_detect_encoding($scandir_return[2]));

?>


Expected result:
----------------
The expectation is that scandir() should read back the directory string with the same UTF-8 characters involved in its creation.

Actual result:
--------------
'Arquivo dos Gráficos'
becomes:
'Arquivo dos Gráficos'

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-04-07 14:18 UTC] salathe@php.net
-Package: *Directory Services problems +Package: Directory function related
 [2020-04-07 14:47 UTC] cmb@php.net
-Status: Open +Status: Feedback -Assigned To: +Assigned To: cmb
 [2020-04-07 14:47 UTC] cmb@php.net
I cannot reproduce this.  Please provide the output of
the following script:

<?php
var_dump(
    ini_get('internal_encoding'),
    ini_get('default_charset'),
    ini_get('zend.multibyte'),
    sapi_windows_cp_get('ansi'),
    sapi_windows_cp_get('oem')
);
?>
 [2020-04-07 14:53 UTC] pguest at meaa dot mea dot com
-Status: Feedback +Status: Assigned
 [2020-04-07 14:53 UTC] pguest at meaa dot mea dot com
cmb,
Here is output from:
<?php
var_dump(
    ini_get('internal_encoding'),
    ini_get('default_charset'),
    ini_get('zend.multibyte'),
    sapi_windows_cp_get('ansi'),
    sapi_windows_cp_get('oem')
);
?>

C:\Workspace\PDT\__CORE\php_cmb.php:7:
string(0) ""
C:\Workspace\PDT\__CORE\php_cmb.php:7:
string(5) "UTF-8"
C:\Workspace\PDT\__CORE\php_cmb.php:7:
string(1) "0"
C:\Workspace\PDT\__CORE\php_cmb.php:7:
int(1252)
C:\Workspace\PDT\__CORE\php_cmb.php:7:
int(437)
 [2020-04-08 07:37 UTC] cmb@php.net
-Status: Assigned +Status: Feedback
 [2020-04-08 07:37 UTC] cmb@php.net
Thanks for the info!  So apparently your script is UTF-8 encoded,
and you're using the default INI settings, which is supposed to
produce the desired filenames, but it works as if there was a call
to `sapi_windows_cp_set(1252)` at the beginning of the script.  Is
there a respective auto_prepend_file?

Anyhow, I suppose that adding

    sapi_windows_cp_set(65001);

at the top of the script should enforce the desired behavior.

By the way, which SAPI do you use?
 [2020-04-09 14:47 UTC] pguest at meaa dot mea dot com
I use CLI SAPI within Eclipse PDT IDE.  I do not explicitly involve an auto_prepend_file within php.ini.  I am not sure whether the Eclipse environment invokes certain configuration files prior to a run configuration launch.
However I am thankful for your suggestion for use of sapi_windows_cp_set() within my scripts.  This dialog with you has helped educate me in regard to character encoding and code pages.  Thank you!
 [2020-04-16 10:25 UTC] cmb@php.net
-Status: Feedback +Status: Not a bug
 [2020-04-16 10:25 UTC] cmb@php.net
> I use CLI SAPI within Eclipse PDT IDE.

Ah, that custom console layer likely explains the issue.  Great
that you have been able to resolve the problem!
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Thu Nov 26 04:01:23 2020 UTC