php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #71034 Can't add UTF-8 filenames as UTF-8 to ZIP archive
Submitted: 2015-12-04 21:36 UTC Modified: 2015-12-07 07:49 UTC
From: slavikca at gmail dot com Assigned:
Status: Not a bug Package: Zip Related
PHP Version: 5.6.16 OS: Ubuntu 14.04
Private report: No CVE-ID: None
 [2015-12-04 21:36 UTC] slavikca at gmail dot com
Description:
------------
*Steps to reproduce*
- use ZipArchive to archive file, which has UTF-8 characters in it.


Test script:
---------------
<?php
$zip = new ZipArchive();
if ($zip->open("test.zip", ZIPARCHIVE::CREATE) !== true)
    die ("Failed to create archive");

if (!$zip->addFile("/var/www/test/web/тест123")) //here are UTF-8 characters in the filename
    print ("Failure adding file");

$zip->close();

Expected result:
----------------
the name of archived file will be stored as UTF-8

Actual result:
--------------
file successfully archived. But it's names stored not in UTF-8.

To see, if filename was stored au UTF-8 or not, you can use "-U" option in unzip utility:

Here is output for archive created in PHP:
# unzip -l -U test.zip
...
   /var/www/test/web/тест123

Here is output for archive created by linux zip utility:
# unzip -l -U test2.zip
...
   /var/www/test/web/#U0442#U0435#U0441#U0442123

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-12-04 21:45 UTC] slavikca at gmail dot com
In case, if it was not clear from my original issue description:

output of "unzip" from linux shows UTF-8 characters. So, that is example of adding filename correctly, in UTF-8.

output of "unzip" from file, create in PHP shows filename seemingly correctly (same as original), but it is not in UTF-8.
That's bad, because it will show correct filename only in the systems with same locale settings. If it is different - it will be garbled. But UTF-8 will work correctly 100% on any system (which supports UTF-8) with any locale.
 [2015-12-05 02:07 UTC] ab@php.net
-Status: Open +Status: Feedback
 [2015-12-05 02:07 UTC] ab@php.net
Please check the following:

- is the current locale UTF-8
- is the terminal charset UTF-8
- is the actual filename string used in PHP UTF-8, especially the filename is hardcoded - your script is UTF-8 encoded file

Only if all of these match, it could be a bug in PHP on in libzip.

Thanks.
 [2015-12-05 02:24 UTC] slavikca at gmail dot com
1) is the current locale UTF-8
2) is the terminal charset UTF-8

Yes, all my locales are UTF-8:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"

3) is the actual filename string used in PHP UTF-8, especially the filename is hardcoded - your script is UTF-8 encoded file

If your question about file, where I have PHP code, then - yes:

$ file index.php
index.php: PHP script, UTF-8 Unicode text

If you are saying, that I need to force PHP to use string in UTF-8 - I do not know, how to do it. Do I have to use some commands / flags?
 [2015-12-05 06:48 UTC] slavikca at gmail dot com
Also, today tried to test with this system:
- PHP 7.0.0
- Ubuntu 14.04
- Zip version	1.13.0
- Libzip version	1.0.1

Same results. Archive created with filenames stored not in UTF-8.

Here is phpinfo page (may be removed later):
http://dev.slavikf.com/info.php
 [2015-12-06 16:37 UTC] ab@php.net
-Status: Feedback +Status: Not a bug
 [2015-12-06 16:37 UTC] ab@php.net
Thanks for the additional info. It seems that it is not a bug in PHP. In 7.0.0, i use

$zip->addFile(__FILE__, "\u{442}\u{435}\u{441}\u{442}123.txt");

which effectively adds a file named тест123.txt to the zip archive. This is what I see in my terminal

$ unzip -l -U bug71034.zip
Archive:  bug71034.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      319  2015-12-06 17:25   тест123.txt
---------                     -------
      319                     1 file

Same i get with your original piece of code.


Please note, in the description you list to see the following:
/var/www/test/web/#U0442#U0435#U0441#U0442123

That means terminal is not in UTF-8. It is not critical for the PHP behavior, but it avoids any possible misinterpretation when you list out the results. So were worth it to configure.

so after all, the issue on your side appears to be that the filename you pass to ext/zip is not UTF-8. Consequently that name is stored into the zip file. It is weird given all your locale is UTF-8 and the PHP file with the hard coded file name as well, but it is the only explanation for the behavior. The filename can be forced to UTF-8 when you fe convert it with iconv and pass it as second argument to ZipArchive::addFile().

Thanks.
 [2015-12-06 16:52 UTC] ab@php.net
Hmm ... or do I err, and the zip utility is actually converts to pure Unicode? Than could explain the difference in the output showing, still confusing. Anyway, adding filenames in UTF-8 works and is not an issue.

Thanks.
 [2015-12-06 17:14 UTC] slavikca at gmail dot com
yea, the reason I see  this in terminal:
/var/www/test/web/#U0442#U0435#U0441#U0442123
is because of "-U" option: 
 -U  use escapes for all non-ASCII Unicode

So, your terminal output confirms, that filename in archive file is NOT in UTF-8.
For UTF-8, you would see escaped chars, such as "#U0442#U0435#U0441#U0442123"
 [2015-12-06 17:17 UTC] slavikca at gmail dot com
to compare, here how native linux utility work:

root@php7:/var/www/html# zip tzip.zip тест1
  adding: тест1 (deflated 38%)

root@php7:/var/www/html# unzip -l -U tzip.zip
Archive:  tzip.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       16  2015-12-04 18:10   #U0442#U0435#U0441#U04421

So, in that case, I see, that filename is stored in UTF-8.
 [2015-12-06 18:28 UTC] ab@php.net
@slavikca, i think we're both confused by that zip Unicode thing. But nope - if you terminal is UTF-8, it will correctly show UTF-8 string like "тест". I've explicitly used string "\u{442}\u{435}\u{441}\u{442}123.txt" as file name in PHP, and I explicitly see the string "тест123.txt" in the terminal when used "unzip -l -U".

I've just checked the description of yours - it seems the case for you, too. So your script is UTF-8 encoded file with the hardcoded path, it alreaddy add the UTF-8 encoded filename. If your terminal is UTF-8, and you output a non UTF-8 strings, the best case were to see some abrakadabra, or nothing. That was the reason i was asking about the terminal charset. Fe do a small experiment, given your terminal is in UTF-8

php -r '$s0 = "тест"; $s1 = iconv("UTF-8", "cp1251", $s0); echo "$s0", " ", strlen($s0), " ", "$s1", " ", strlen($s1), "\n"; '

In my terminal, i only see as out "тест 8  4" - so it's unable to out the string $s1 in a single byte charset which is invalid for UTF-8.

Now, when coming back to the zip entries file names, please consider this:

<?php
$zip = new ZipArchive;
if ($zip->open('test.zip') === TRUE) {
                $s0 = $zip->getNameIndex(0);
                echo $s0, " ", strlen($s0), "\n";
                $zip->close();
                echo 'ok';
} else {
            echo 'Fehler';
}

This shows me on the UTF-8 terminal "тест123 11" when i use it on a zip file created by PHP.

So you see that there is no issue in PHP in creating zips with UTF-8 filenames. Whatever the unzip commando shows, for PHP it seems to be correct with the case. I have completely no clue which conversions on filenames the zip utility does, but that seems to be not relevant for the current case. Furthermore, what PHP does seems to be rather correct in opposite to the regular zip utility.

Thanks.
 [2015-12-07 07:49 UTC] slavikca at gmail dot com
Spent hours troubleshooting this issue.
I went to Zip binary format definition, 
And found, that bug is really in ... linux unzip utility.

So, yes, PHP (both 5.6 and 7.0) really creates filenames in archive in UTF-8.
So, this is not a bug.
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Tue May 21 22:01:26 2019 UTC