php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #47096 move_uploaded_file not OS encoding aware
Submitted: 2009-01-14 09:26 UTC Modified: 2009-01-15 14:57 UTC
Votes:64
Avg. Score:4.6 ± 0.7
Reproduced:60 of 60 (100.0%)
Same Version:31 (51.7%)
Same OS:33 (55.0%)
From: nuabaranda at web dot de Assigned:
Status: Open Package: Filesystem function related
PHP Version: 5.2.8 OS: win32 only - Windows XP
Private report: No CVE-ID:
Have you experienced this issue?
Rate the importance of this bug to you:

 [2009-01-14 09:26 UTC] nuabaranda at web dot de
Description:
------------
Files with filenames containing non-ascii characters like german umlauts get destroyed when saved with move_uploaded_file(). The UTF-8 special characters get translated byte-wise into CP1251 characters when determining the Windows filenames thus destroying the original special characters.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-02-06 20:21 UTC] mindfreakthemon at gmail dot com
And on Windows 7 and Vista under Apache 2.2 that bug exists too.
 [2009-02-26 09:46 UTC] mm107137 at spamcorptastic dot com
I have the same problem under debian host (ovh hoster).
Filename with french accents passed to move_upload_file are destroyed.
There's no problems if filename is not passed as utf8.

Very annoying
 [2011-09-23 03:02 UTC] xd-yang at qq dot com
Since basename() is locale aware, why not move_uploaded_file()?
A common remedial measure is to use iconv() to explicitly convert the destination filename encoding usually from utf-8 to ansi(like gb2312). But this becomes complicated and unreachable in a multilingual CMS, like wordpress. Can this issue be solved in the future?
 [2012-03-17 18:19 UTC] salsi at icosaedro dot it
As PHP operates under Windows as a "non-Unicode aware program", file names are bare array of bytes represented under PHP as "string"; these strings are converted back and forth to Unicode by Windows according to the currently selected "code page table" (see "Control Panel", "Regional and Language Options", "Administrative" tab panel, "Language for non-Unicode programs"). Unfortunately, UTF-8 encoding is not available there, so whatever locale you choose, some Unicode file names may still remain unaccessible to PHP.

For example, if your system locale is any western european encoding (code page 1252), there is no way to refer to a file whose name is "日本語"; only on Windows system with japanese locale set (code page 932) you can access such a name, provided that the "string" that represents that name be properly encoded as requested by the code page 932, that is "\x93\xfa\x96\x7b\x8c\xea".

So, if you have a generic name of a file (along with its path) as a Unicode string $u (for example UTF-8 encoded) and you want to try to save it with that name under Windows, you must first check the current locale calling setlocale(LC_CTYPE, 0) to retrieve the current code page, then you must convert $u to an array of bytes according to the code page; if one or more code points have no counterpart in the current code page, the file cannot be saved with that name from PHP. Dot.

To complicate the implementation of such an algorithm, neither mbstring nor iconv are aware of all the Windows code pages, so you must write these conversion routines by yourself. This is just what I have done experimentally under PHP, and it appears to work nicely (http://www.icosaedro.it/phplint/libraries.cgi?lib=stdlib/it/icosaedro/io/FileName.html). Hopefully some day something similar will be available in PHP core lib., or some other abstraction layer of classes may provide full access to the Unicode realm.

References:

http://en.wikipedia.org/wiki/Windows_code_page

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/
 [2012-04-03 15:12 UTC] salsi at icosaedro dot it
Just to complete my little survey of the file names encoding issue:

1. Under Windows Vista, in the control panel "Regional and Language Settings" also the "Formats" panel must be set accordingly to the language selected in the "Advanced" panel in order to set the LC_CTYPE property; the "Advanced" panel only selects the translation mapping between Unicode and multi-byte encoding but does not set the locale properties.
For example, on a western country LC_CTYPE="english_United States.1252" while in Japan it might be LC_CTYPE="Japanese_Japan.1252".

2. Windows applies the "best fit" conversion table (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/) when translating from Unicode file names to multi-byte file name (http://msdn.microsoft.com/en-us/library/windows/desktop/dd374047%28v=vs.85%29.aspx); characters that have not a best fit are replaced by a question mark "?".
So, for example, when the japanese locale is set (code page 932) the Latin capital letter A with dieresis ("Ä") might map to the plain capital letter "A" and accented vouels like "àèìòù" might be translated to the plain ASCII letters "aeiou".
This means that from inside PHP file names retrieved from the file system via dir() or getcwd() are only APPROXYMATIONS of the real path and there is no way to detect if they really match the actual name.


Conclusions
===========

Under Unix and Linux with a properly set locale, PHP program can access and retrieve any file name that match the current locale; UTF-8 is the better choice here.

Under Windows, PHP programs can generate and can access any file or file path that contains only characters included in the current code page table; however, PHP programs cannot trust on file names retrieved from the file system because these might be arbitrarily mangled and there is no way to detect such artifact.
 [2012-08-23 12:32 UTC] nicolas dot grekas+php at gmail dot com
Well, if you really need it, there may be one possibility using a COM object:

$fs = new \COM('Scripting.FileSystemObject', null, CP_UTF8);
 
PHP Copyright © 2001-2014 The PHP Group
All rights reserved.
Last updated: Thu Apr 24 02:02:10 2014 UTC