php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74589 __DIR__ wrong for unicode character
Submitted: 2017-05-14 07:07 UTC Modified: 2017-05-15 16:25 UTC
From: ganlvtech at qq dot com Assigned: ab (profile)
Status: Closed Package: *General Issues
PHP Version: 7.1.5 OS: Windows
Private report: No CVE-ID: None
 [2017-05-14 07:07 UTC] ganlvtech at qq dot com
Description:
------------
Save the test script to "D:\新建文件夹\test.php".(There must be some unicode character in the path)

Then execute `php test.php`.

On Windows, shows

D:\
D:\新建文件夹
bool(false)


Save to "D:\新建文件夹a\test.php", then shows

D:\新建文件夹a
D:\新建文件夹a
bool(true)


Save to "D:\a新建文件夹\test.php", then shows

D:\
D:\a新建文件夹
bool(false)


Save to "D:\a新建文件夹\a新建文件夹\test.php", then shows

D:\
D:\a新建文件夹\a新建文件夹
bool(false)


So, you may find that. If the directory's name ends with a unicode character, then __DIR__ would miss this part, until it find a dir not ends with a unicode character.

Sorry for my poor English.


Test script:
---------------
<?php
echo __DIR__, "\n";
echo dirname(__FILE__), "\n";
var_dump(__DIR__ === dirname(__FILE__));
?>


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-05-14 14:37 UTC] ab@php.net
-Status: Open +Status: Feedback
 [2017-05-14 14:37 UTC] ab@php.net
Thanks for the report. Please post additionally

- default_charset INI
- internal_encoding INI
- whether zend_multibyte is used
- whether your're on a DBCS system, codepage 932, 936 or alike

I have to say, that so far i've tried the snippet on a cp 437 system with default php.ini settings, i see it working correctly. I guess, that there can be an issue with a particular system codepage and non UTF-8 settings in php.ini, would be nice you to do a bit more research in this direction.

Thanks.
 [2017-05-14 17:52 UTC] ganlvtech at qq dot com
php.ini is php.ini-development

default_charset => UTF-8 => UTF-8
internal_encoding => no value => no value
zend.multibyte => Off => Off

PHP version: PHP 7.1.5 (cli) (built: May  9 2017 19:48:36) ( NTS MSVC14 (Visual C++ 2015) x64 )
System: Windows 10 Home (64bit). (x64 processor)
Default language: zh-CN (There is no other system language supported in my system. My system cannot switch into English mode)
Default code page: 936(GBK)

I tried `chcp 65001` or `chcp 437`, it makes no changes.

=====

I have tried in Interactive shell. It seems working correctly.

D:\新建文件夹>php -a
Interactive shell

php > echo __FILE__;
php shell code
php > echo __DIR__;
D:\新建文件夹

=====

I have also tried php 5.4 or php 5.6 (both are use php.ini-development)

"D:\新建文件夹\test.php" shows (different from php7)

D:\
D:\
bool(true)


So, i tried

Script:

<?php
echo __DIR__, "\n";
echo dirname(__FILE__), "\n";
var_dump(__DIR__ === dirname(__FILE__));

echo __FILE__, "\n";
echo str_replace('\\', '/', __FILE__), "\n";
echo dirname(str_replace('\\', '/', __FILE__)), "\n";
?>

Result: (php 5.4)

D:\
D:\
bool(true)
D:\新建文件夹\test.php
D:/新建文件夹/test.php
D:/新建文件夹

=====

Anything works well on Ubuntu Server 16.04 (php7.0 and php5.5 were tested).

I think it may be caused by backslash.
 [2017-05-14 18:16 UTC] ganlvtech at qq dot com
I tested GBK and UTF-8 as default_charset and php 5.4, 5.6 and 7.1.

Test Report:


php: 5.4 or 5.6
default_charset: GBK or UTF-8

Results:

D:\
D:\
bool(true)


-----

php: 7.1
default_charset: GBK or UTF-8

Results:

D:\
D:\新建文件夹
bool(false)

=====

CJK charchter and even \u00a1 may cause a wrong result.

Seems that, only if trailing charchter is an ASCII character, the result can be correct.

=====

Hope that the tests above can help you.

Thanks.
 [2017-05-14 19:33 UTC] ganlvtech at qq dot com
<?php
echo __FILE__, "\n";
echo strlen(__FILE__), "\n";
echo mb_strlen(__FILE__), "\n";
?>

(php 7.1, cp 936)
D:\新建文件夹>php test.php
D:\新建文件夹\test.php
27
17


(php 5.4, cp 936)
D:\新建文件夹>php54 test.php
D:\新建文件夹\test.php
22
22
 [2017-05-14 20:31 UTC] ganlvtech at qq dot com
Probably reason.

When php core get the filename from system, my system returns a string with cp 936. Because php5 doesn't auto convert charset, so the strlen and mb_strlen is both 22 (one chinese character is two bytes). But php7 convert charset automatically, so strlen is 27(1 char for 3 bytes in UTF-8) and mb_strlen is 17.

And zend_dirname function use a macro IS_SLASH_P, and the macro call a WIN32API IsDBSCLeadByte. For cp936(GBK), the chinese character's two bytes is both larger than 0x80, IsDBSCLeadByte always return non-zero, even when testing the second byte.

In php7, dirname(__FILE__) passed a  converted, UTF-8 string to zend_dirname, so it works well. But there might not be a automatically conversion in the zend engine when directily using __DIR__.

In php5 conversion will never automatically apply, so the two forms both don't work.

Summary:
Everything is caused by my system's returning bp936(GBK) encoded path.

This may not be a bug of php, but it should be metioned in php docs.

Thanks.
 [2017-05-15 10:46 UTC] ab@php.net
Thanks for this deep investigation. Yeah, zend_dirname is what my debug session leads me to. I think that is the exact point. I still couldn't repro this on a cp 437 system, so I'm getting a VM with cp 936, might take some time. If you're able to debug internals or even produce a patch, i can also evaluate/test that.

Basically, there's only 7.1 with UTF-8 support and versions before. To the time of the initial patch, I explicitly left zend_dirname() as is and instead integrated the new API, fe like in the userland dirname(). The only point in the new API is, that it needs the INI to have been initialized before. So might need to check this and reevaluate, if everything is ok, then just replace zend_dirname to use the new API for Windows.

Thanks.
 [2017-05-15 13:00 UTC] ganlvtech at qq dot com
in ext/standard/string.c:1647

1629 PHP_FUNCTION(dirname)
...
1646 #ifdef PHP_WIN32
1647                 ZSTR_LEN(ret) = php_win32_ioutil_dirname(ZSTR_VAL(ret), str_len);
1648 #else
1649                 ZSTR_LEN(ret) = zend_dirname(ZSTR_VAL(ret), str_len);
1650 #endif

php_win32_ioutil_dirname is used if PHP_WIN32 defined.


but in Zend/zend_compile.c:6505

6501                 case T_DIR:
6502                 {
6503                         zend_string *filename = CG(compiled_filename);
6504                         zend_string *dirname = zend_string_init(ZSTR_VAL(filename), ZSTR_LEN(filename), 0);
6505                         zend_dirname(ZSTR_VAL(dirname), ZSTR_LEN(dirname));

always zend_dirname


I'm not very sure about php-src's code structure. It may be a little difficult for me to produce a patch.
 [2017-05-15 14:38 UTC] ab@php.net
Automatic comment on behalf of ab
Revision: http://git.php.net/?p=php-src.git;a=commit;h=ae3f975c5d58f891359a72ad3df84d845e70cdcc
Log: Fixed bug #74589 __DIR__ wrong for unicode character
 [2017-05-15 14:38 UTC] ab@php.net
-Status: Feedback +Status: Closed
 [2017-05-15 14:49 UTC] ab@php.net
-Status: Closed +Status: Feedback
 [2017-05-15 14:49 UTC] ab@php.net
Finally got the VM, issue confirmed. I've pushed a change in this regard. Any 7.1 or master snapshot starting with ae3f975c5d58f891359a72ad3df84d845e70cdcc is suitable for a test, please fetch one from http://windows.php.net/snapshots/

Thanks.
 [2017-05-15 15:22 UTC] ganlvtech at qq dot com
php-7.1-rae3f975 passed the test.

But how about php < 7.1

There should be a caution in __DIR__ and dirname() docs.

CAUTION! If you are using Windows server and php < 7.1, be sure that all characters in your php script's full path are all ASCII characters.
 [2017-05-15 16:25 UTC] ab@php.net
-Status: Feedback +Status: Closed -Assigned To: +Assigned To: ab
 [2017-05-15 16:25 UTC] ab@php.net
Thanks for checking. Effectively, PHP 7.1 is a huge rewrite regarding the FS functions and UTF-8 support on Windows in general. Please check also http://git.php.net/?p=php-src.git;a=blob;f=UPGRADING;h=9e23d7a247e0a8f76e1a29f333e08b41237b3d5c;hb=refs/heads/PHP-7.1#l442  Earlier versions use ANSI APIs only so where it can fail parsing - it does. That is a known issue which is likely to be met also on other platforms, fe on systems using some non ASCII multibyte encoding like BIG5, etc.

On Windows, this issue is a well known and a long standing issue, so finally was fixed in 7.1 thanks to the wide char APIs. Fe, in PHP < 7.1 there's no support for non ANSI filenames anyway. Still some uncritical or hard to catch places like you've found might be present, so they're being cleaned up all the way. I think i'm just closing the ticket for now. You can still reopen and change to the doc bug, or create a new one or post a doc patch to https://edit.php.net/ . This is however a general behavior in earlier PHP versions, it doesn't concern only __DIR__.

Thanks!
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Apr 18 12:01:28 2024 UTC