PHP :: Bug #33898 :: basename() misbehaves on multibyte characters

Bug #33898	basename() misbehaves on multibyte characters
Submitted:	2005-07-28 10:59 UTC	Modified:	2005-07-28 11:52 UTC
From:	feldgendler at mail dot ru	Assigned:
Status:	Not a bug	Package:	Filesystem function related
PHP Version:	5.0.4	OS:	Debian GNU/Linux i686
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	feldgendler at mail dot ru
New email:
PHP Version:		OS:

New/Additional Comment:

[2005-07-28 10:59 UTC] feldgendler at mail dot ru

Description:
------------
The source code in my testcase is in UTF-8 encoding itself. The quoted string contains Cyrillic letters. If I save the source code in KOI8-R (single-byte) Cyrillic encoding, and change the second argument to setlocale() to "ru_RU.KOI8-R", the observed result is what I expect. This shows that the bug only occurs on multi-byte characters, because in KOI8-R all characters are single-byte.

Relevant PHP configuration options:
--enable-mbstring=all
(--enable-zend-multibyte was not specified)

Relevant environment variables:
LANG=en_US.UTF-8
(LC_* are not set)

Reproduce code:
---------------
<?php

setlocale(LC_CTYPE, "en_US.UTF-8");
echo basename("english/???????");

?>

Expected result:
----------------
???????

Actual result:
--------------
english

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2005-07-28 11:15 UTC] feldgendler at mail dot ru

I've explored the source code of php_basename() function, and here is what I found:

In case of a multi-byte character (inc_len > 1) that immediately follows a slash, state is not changed to 1 because that code is skipped. 

The following code:

					if (state == 0) {
						comp = c;
						state = 1;
					}

...needs to be inserted to the point marked below:

	while (cnt > 0) {
		inc_len = (*c == '\0' ? 1: php_mblen(c, cnt));

		switch (inc_len) {
			case -2:
			case -1:
				inc_len = 1;
				php_mblen(NULL, 0);
				break;
			case 0:
				goto quit_loop;
			case 1:
#if defined(PHP_WIN32) || defined(NETWARE)
				if (*c == '/' || *c == '\\') {
#else
				if (*c == '/') {
#endif
					if (state == 1) {
						state = 0;
						cend = c;
					}
				} else {
					if (state == 0) {
						comp = c;
						state = 1;
					}
				}
			default:
-- HERE IT GOES -->
				break;
		}
		c += inc_len;
		cnt -= inc_len;
	}

Can I expect that this bug will be fixed in CVS?

[2005-07-28 11:31 UTC] tony2001@php.net

See bug #33260.

[2005-07-28 11:41 UTC] feldgendler at mail dot ru

A message in that bug says "Sorry, but this is not supported yet. You'll have to wait for PHP that supports unicode."

What do you mean? Doesn't PHP 5.0.4, with all its multi-byte capabilities, support Unicode?

I've searched the bug database and found that there were similar bugs (30105, 30014, 28981) that are currently in "No feedback", not "Bogus" state. Why is this bug Bogus?

And last, what's wrong with my proposed modification? Doesn't it fix the bug?

[2005-07-28 11:52 UTC] tony2001@php.net

>What do you mean? Doesn't PHP 5.0.4, with all its multi-
>byte capabilities, support Unicode?

Yes, full multibyte support is planned for 5.2.

>And last, what's wrong with my proposed modification? 

We don't need a workaround for a particular problem while patches fixing all multibyte-related problems are ready and being tested.
Also, please do use `diff -u` next time when you post patches. Thanks.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Mon Jun 29 12:00:02 2026 UTC