php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #54028 Directory::read() cannot handle non-unicode chars properly
Submitted: 2011-02-15 16:51 UTC Modified: 2011-02-25 15:23 UTC
From: schmale at froglogic dot com Assigned:
Status: Not a bug Package: Directory function related
PHP Version: 5.3.5 OS: Windows 7
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: schmale at froglogic dot com
New email:
PHP Version: OS:

 

 [2011-02-15 16:51 UTC] schmale at froglogic dot com
Description:
------------
Notice: This problem does ONLY affect the CLI interpreter, NOT the CGI.

Using dir('path/to/dir'), the read() method does not return UTF-8, if the directory contains e.g. umlauts (ä, ö, ü). I tested this on Linux and Windows, both CGI and CLI, and the problem does only occur with Windows/CLI.

Test script:
---------------
$path = 'path/to/directory/which/contains/umlauts';

$directory = dir($path);
while (false !== ($content = $directory->read())) {
    if (mb_check_encoding($content, 'UTF-8') === false) {
        fprintf(STDERR, 'Returned non-utf-8 (%s)', $content);
    }
}


Expected result:
----------------
The expected result, of course, was that the return value of read is always encoded in UTF-8, i.e. no messages are print, when we run the script.

Actual result:
--------------
If a subdirectory contains umlauts (or I guess any non-unicode character), a message is print, i.e. the return value is not encoded in UTF-8.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-02-15 16:54 UTC] pajoye@php.net
-Status: Open +Status: Bogus
 [2011-02-15 16:54 UTC] pajoye@php.net
There is already a feature request for unicode filesystem support.

Btw, Windows does not use UTF-8 for its encoding.
 [2011-02-15 17:10 UTC] schmale at froglogic dot com
Well, I don't know what Windows uses as encoding, but I sure do know, that it works properly with the Windows CGI version. The point is, a directory called 'Startmenü' will return 'Startmenü' with Linux/CGI, Linux/CLI, Windows/CGI, but NOT with Windows/CLI - the latter returning 'Startmenñæ' (or sth similar). In other words: The behaviour with Windows/CLI is broken, where the other versions return the exact name of the directory, as expected.

So I think it has nothing (little) to do with unicode filesystem support or the encoding of Windows, but with differences between CGI and CLI.
 [2011-02-25 13:29 UTC] carsten_sttgt at gmx dot de
| and the problem does only occur with Windows/CLI.

I have no difference between CGI and CLI (both executed from the shell)

Of course, something is courious:
<?php
$directory = dir(getenv('USERPROFILE'));
while (false !== ($content = $directory->read())) {
    if (mb_check_encoding($content, 'UTF-8') === false) {
        printf('Returned non-utf-8 (%s)', $content);
        printf(" Encoding: %s\r\n", mb_detect_encoding($content));
    }
}
?>

And the output is:
Returned non-utf-8 (Startmenü) Encoding: UTF-8


Regards,
Carsten
 [2011-02-25 13:32 UTC] pajoye@php.net
There is no UTF-8 support in Windows APIs or in PHP for the file system APIs.

Windows supports UCS-2 internally via the wild char APIs. PHP relies on the ANSI 
APIs and the encoding is then the runtime encoding (whatever is set for the 
running process or system wild).

The feature request I was referring to is about making PHP uses the wild char API 
and accepts UTF-8 as input (and output).
 [2011-02-25 13:52 UTC] carsten_sttgt at gmx dot de
> Windows supports UCS-2 internally via the wild char APIs.
I now... I'm just wondering why:

"mb_detect_encoding($content)" is returing 'UTF-8'
and
"mb_check_encoding($content, 'UTF-8')" is returning FALSE?


Also I think there is another problem:
| C:\Users\Carsten Wiedmann>php -r "echo realpath('.');"
| C:\Users\Carsten Wiedmann
| C:\Users\Carsten Wiedmann>cd Startmenü
| 
| C:\Users\Carsten Wiedmann\Startmenü>php -r "echo realpath('.');"
| 
| C:\Users\Carsten Wiedmann\Startmenü>

Regards,
Carsten
 [2011-02-25 13:56 UTC] pajoye@php.net
I'm not sure what else I should say to explain what is possible and what not. 

Last attempt: Unless you 100% know which runtime encoding is actually used by 
the process where PHP runs, you are are out of luck and have to use ASCII (if 
you have luck, maybe ANSI too).

But anything related to Unicode does not work, period. Even if one can have the 
feeling that it works from time to time due to the joy of similar encoding, or 
close enough.
 [2011-02-25 14:16 UTC] carsten_sttgt at gmx dot de
> PHP relies on the ANSI APIs and the encoding is then the runtime encoding
> (whatever is set for the running process or system wild).

"Startmenü" can be accessed without any problems thought the ANSI API. An "ü" exists in CP437, CP850 and CP1252 (just use the chcp command), thus I'm not talking about unicode. You can also test this with a small C-Code. So why is
| php -r "echo realpath('.');"
returning false?
 [2011-02-25 15:23 UTC] pajoye@php.net
Sorry, but I don't have any more ways to explain why it could work for one case or 
another. There is no bug but a feature request for Unicode support.
 [2011-02-25 15:39 UTC] carsten_sttgt at gmx dot de
> There is no bug but a feature request for Unicode support.

You are right. It's not a Unicode (or ANSI) issue/bug. "%USERPROFILE%\Startmenü" is a link to "%APPDATA%\Microsoft\Windows\Start Menu"
| mklink /j "%USERPROFILE%\Startmenü" "%APPDATA%\Microsoft\Windows\Start Menu"
and tsrm_realpath seems to have a problem with this. eg. the same with getcwd() is working.


But back to the first question:
How can 
"mb_detect_encoding($content)" returing 'UTF-8'
and
"mb_check_encoding($content, 'UTF-8')" returning FALSE for the same $content?
 [2011-02-25 17:57 UTC] carsten_sttgt at gmx dot de
> How can "mb_detect_encoding($content)" returing 'UTF-8'

Ok, with strict encoding detection it's working as aspected.


@schmale
> The expected result, of course, was that the return value of read is always
> encoded in UTF-8, i.e. no messages are print, when we run the script.
The return value for filesystem functions in PHP should be "CP1252" on most systems (Western European). It's just e.g. 'abc' in CP1252 is also a valid UTF-8 string. But a CP1252 'ü' is not a valid UTF-8 'ü'.

And of course, with PHP, you have problems accessing files with filenames outside CP1252 (even you can see them in explorer with the correct names). (you can use the COM extension to manage such files)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 05:01:29 2024 UTC