php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #65815 ZipArchive reads filenames with UTF-8 characters wrong
Submitted: 2013-10-02 15:51 UTC Modified: 2015-05-05 14:55 UTC
Votes:6
Avg. Score:4.0 ± 0.8
Reproduced:6 of 6 (100.0%)
Same Version:1 (16.7%)
Same OS:1 (16.7%)
From: matti dot jarvinen at nitroid dot fi Assigned: cmb (profile)
Status: Closed Package: Zip Related
PHP Version: 5.4.20 OS: Fedora 3.8.6-203.fc18.x86_64
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: matti dot jarvinen at nitroid dot fi
New email:
PHP Version: OS:

 

 [2013-10-02 15:51 UTC] matti dot jarvinen at nitroid dot fi
Description:
------------
I have a valid Zip file created with Windows 8 and with iZarc containing filenames like 12-päivä.pdf, 13-päivä.pdf

ZipArchive reads filenames wrong.

At least getNameIndex and extractTo are affected.

Test script:
---------------
<?php 
mb_internal_encoding('UTF-8');
ini_set('default_charset', 'UTF-8');

$Zip = new ZipArchive();

$open = $Zip->open('test.zip');

$length = $Zip->numFiles;

for($i = 0; $i < $length; $i++)
{
  $importName = $Zip->getNameIndex($i);

  print $brokenImportName;

  die();

  // this is a specific workaround. Some characters are stuck in ASCII apparently
  //$fixedImportName = str_replace(chr(132),'ä',$brokenImportName);

  //print $fixedImportName;
}

?>

Expected result:
----------------
12-päivä.pdf

Actual result:
--------------
12-p�iv�.pdf

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-10-03 10:31 UTC] matti dot jarvinen at nitroid dot fi
If zip file contains following files:

test3/12-päivä.pdf
test3/ä中华人民共和国.PDF
test3/Российская Федерация.PDF
test3/中华人民共和国.PDF


ZipArchive will read them as:

test3/12-p�iv�.pdf
test3/ä中华人民共和国.PDF
test3/Российская Федерация.PDF
test3/中华人民共和国.PDF

Broken file names can be changed to correct UTF-8 characters with:

<?php

// correct UTF-8 should hold together through this
if($filename === mb_convert_encoding(mb_convert_encoding($filename, "UTF-32", "UTF-8"), "UTF-8", "UTF-32"))
{
  $fixedFilename = $filename;
}else
{
  // otherwise we should use 
  $fixedFilename = mb_convert_encoding($filename, 'UTF-8','CP850');
}

?>

.ZIP File Format Specification Version: 6.3.3 APPENDIX D - Language Encoding (EFS) might hold the answers about reading file name encoding correctly from the zip file.
http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Codepage if not UTF-8 should be CP437 if I understood correctly from the specs, although that encoding is not supported in PHP. I got good results with CP850 but I cannot verify this with workaround with every character in CP850 and CP437.
 [2015-05-05 14:55 UTC] cmb@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: cmb
 [2015-05-05 14:55 UTC] cmb@php.net
This issue is supposed to be fixed with libzip 0.11. As of PHP
5.6.0 libzip 0.11.2 or newer is bundled. For older versions PECL
provides up-to-date zip extension packages:
<https://pecl.php.net/package/zip>
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Nov 29 10:01:30 2024 UTC