php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #62119 basename broken with non-ASCII-chars
Submitted: 2012-05-23 08:38 UTC Modified: 2017-06-21 09:42 UTC
Votes:15
Avg. Score:3.8 ± 1.2
Reproduced:11 of 11 (100.0%)
Same Version:2 (18.2%)
Same OS:2 (18.2%)
From: thomas dot hebinck at digionline dot de Assigned:
Status: Analyzed Package: *Directory/Filesystem functions
PHP Version: 5.3.13 OS: Linux/Ubuntu
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2012-05-23 08:38 UTC] thomas dot hebinck at digionline dot de
Description:
------------
With the default locale setting "C", basename() drops non-ASCII-chars at the beginning of a filename.

Test script:
---------------
$path='/test/äaä.txt';
echo $path."\n";
setlocale(LC_ALL,'C');
echo dirname($path).'/'.basename($path)."\n";
setlocale(LC_ALL,'en_US.iso885915'); // bash: locale -a
echo dirname($path).'/'.basename($path)."\n";


Expected result:
----------------
/test/äaä.txt
/test/äaä.txt
/test/äaä.txt

Actual result:
--------------
/test/äaä.txt
/test/aä.txt
/test/äaä.txt


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-07-03 15:29 UTC] pollita@php.net
-Status: Open +Status: Verified
 [2012-07-03 15:29 UTC] pollita@php.net
Verified on Debian, but since this is the behavior of the underlying libc 
implementation, I'm not sure it's PHP's role to fix it.

Leaving open for now since we could potentially detect this case and deal with it, 
but on initial look I'm inclined to push it off on the OS.
 [2014-08-04 10:36 UTC] bugs dot php dot net at dw-perspective dot org dot uk
I have this problem too, on a Fedora 20 (=current) system.

Interestingly, the system's "basename" binary, which I'd assume is making the same glibc call, does not have this problem:

# LANG=C basename '/test/äaä.txt'
äaä.txt

So perhaps the problem is more subtle that a simple glibc bug?
 [2014-08-04 17:04 UTC] bugs dot php dot net at dw-perspective dot org dot uk
In https://bugzilla.redhat.com/show_bug.cgi?id=1126399, a glibc developer says that glibc's basename() is not locale-dependent - and therefore that if PHP's basename() is locale dependent, then that points to a PHP issue.
 [2015-06-10 08:56 UTC] christiansen dot jacob at gmail dot com
This is still an issue in PHP 5.6 and it is PHP's problem, since PHP roll its own implementation of basename.

The problem seems to occur when running basename on a string that have a multibyte char as the first char when LC_TYPE is set to POSIX. Which seems to be default for PHP.

On way to solve this is to set the LC_TYPE to UTF-8, but I guess that PHP should handle this.
 [2016-10-14 17:20 UTC] cmb@php.net
-Status: Verified +Status: Analyzed -Assigned To: +Assigned To: cmb
 [2016-10-14 17:20 UTC] cmb@php.net
> This is still an issue in PHP 5.6 and it is PHP's problem, since
> PHP roll its own implementation of basename.

Yes. The actual culprit is that php_basename() uses mblen(3) if
available, and that is locale dependend. If an invalid character
is passed to mblen(3), -1 is returned, and the length is assumed
to be 1[1], what appears to be doubtful. Bailing out returning
FALSE, or at least a notice/warning might be more useful.

> On way to solve this is to set the LC_TYPE to UTF-8, but I guess
> that PHP should handle this.

As of November 2010 it is documented[1]:

| basename() is locale aware, so for it to see the correct
| basename with multibyte character paths, the matching locale
| must be set using the setlocale() function.

So clearly, setting the appropriate locale is the job of callers
of basename().

I'm going to move the notes up on the page, and will suggest
adding a notice/warning in case of unrecognized characters.

[1] <https://github.com/php/php-src/blob/PHP-7.0.12/ext/standard/string.c#L1531-L1532>
[2] <http://php.net/manual/en/function.basename.php#refsect1-function.basename-notes>
 [2016-10-14 17:28 UTC] cmb@php.net
Automatic comment from SVN on behalf of cmb
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=340481
Log: Fix #62119: basename broken with non-ASCII-chars

We move the important notes up into the description section, and elevate
the note regarding the locale awareness to a caution.
 [2017-01-02 12:37 UTC] jeanseb@php.net
Automatic comment from SVN on behalf of jeanseb
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=341589
Log: Related to #62119

This will do the same change for pathinfo and dirname as already made for basename.

--
Provided by anonymous 78971 (tobias.nyholm@gmail.com)
 [2017-05-25 15:38 UTC] megaone at yandex dot ru
"On way to solve this is to set the LC_TYPE to UTF-8, but I guess that PHP should handle this"

I'm actually having problem with UTF-8. What's more it's not just a single letter, it's entire word that gets cut. So basically it cuts everything until any space. If no spaces entire filename gets cut.

'имя файла.txt' becomes ' файла.txt'
'имяфайла.txt' becomes ' .txt'
 [2017-06-21 09:42 UTC] cmb@php.net
-Assigned To: cmb +Assigned To:
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Fri May 24 14:01:26 2019 UTC