|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2002-02-19 15:56 UTC] robert dot marchand at umontreal dot ca
Hi, when trying to use the IMP Webmail client with an Exchange 2000 Server, folders with accents in the name are mangled. It appears that the Exchange Imapd server sends modified utf-7 names as explained in rfc 2060. IMP use a call to the imap_utf7_decode function. It should work but it does'nt. Here's a sample: What Exchange send: "&AMk-l&AOk-ments envoy&AOk-s" What it means: ?l?ments envoy?s. What IMP show: <nothing> My setting is: SGI Irix 6.5 Imap-2001a PHP 4.1.1 IMP 2.2.7 I have setup a small script that show the problem: <?php error_reporting(63); $folder = '&AMk-l&AOk-ments envoy&AOk-s'; $plain = 'Bo?te de r?ception'; $unicode = mb_convert_encoding($plain, "UNICODE", "ISO-8859-1"); $br = "<br>"; echo "<html><head><title>test UTF7</title></head><body>"; echo "folder (modified UTF-7): ", $folder, $br; echo "plain (Latin1): ", $plain, $br, $br; echo "<strong>mb_convert_encoding test</strong>", $br; $test = mb_convert_encoding($folder, "auto", "UTF7-IMAP"); echo " folder decoded: ", $test, $br; $test = mb_convert_encoding($test, "UTF7-IMAP", "ISO-8859-1"); echo " encoded again: ", $test, $br; $test = mb_convert_encoding($test, "auto", "UTF7-IMAP"); echo " decoded again: ", $test, $br, $br; $test = mb_convert_encoding($plain, "UTF7-IMAP", "ISO-8859-1"); echo " plain encoded: ", $test, $br; $test = mb_convert_encoding($test, "auto", "UTF7-IMAP"); echo " decoded: ", $test, $br, $br; echo "<strong>imap_utf7_decode test</strong>", $br; $test = imap_utf7_decode($folder); echo "folder decoded: ", $test, $br; $test = imap_utf7_encode($test); echo " encoded again: ", $test, $br; $test = imap_utf7_decode($test); echo " decoded again: ", $test, $br, $br; $test = imap_utf7_encode($plain); echo " plain encoded: ", $test, $br; //$test = imap_utf7_encode($unicode); //echo "unicode encoded: ", $test, $br; $test = imap_utf7_decode($test); echo " decoded: ", $test, $br; echo "</body></html>"; ?> And here is the output: folder (modified UTF-7): &AMk-l&AOk-ments envoy&AOk-s plain (Latin1): Bo?te de r?ception mb_convert_encoding test folder decoded: ?l?ments envoy?s encoded again: &AMk-l&AOk-ments envoy&AOk-s decoded again: ?l?ments envoy?s plain encoded: Bo&AO4-te de r&AOk-ception decoded: Bo?te de r?ception imap_utf7_decode test folder decoded: ?l?ments envoy?s encoded again: &A8l-l&A+p-ments envoy&AfM-s decoded again: ?l?ments envoy?s plain encoded: Bo&7s-te de r&6g-ception decoded: Bo?te de r?ception -------------- As you see I've found a work around by using the function mb_convert_encoding. Is imap_utf7_* really broken or what? Thanks. PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sat Nov 08 00:00:02 2025 UTC |
I have the same problem with imap_utf7_decode() in PHP v4.2.2 Script 'test_utf7.php3' is attached to this posting to illustrate this problem. According PHP documentation imap_utf7_decode() returns "the decoded 8bit data", but documentation says nothing about encoding of returned "8bit data". When I try decode folder with name 'test&AN9ZJw-', imap_utf7_decode() returns following string: 0x74, 0x65, 0x73, 0x74, 0x00, 0xDF, 0x59, 0x27 It looks as UTF-16 (UCS-2) string with missed '0x00' for ASCII characters. If I'm right and imap_utf7_decode() returns UTF-16 string, this string should be represented as: 0x00, 0x74, 0x00, 0x65, 0x00, 0x73, 0x00, 0x74, 0x00, 0xDF, 0x59, 0x27 To fix this this problem I wrote patch for ext/imap/php_imap.c and attache it to this posting. Best regards, Gamid Isayev --- test_utf7.php3 ------------------------------------ <HTML> <HEAD> <TITLE>Test UTF7</TITLE> <META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf-8"> </HEAD> <BODY> <? $folder = 'test&AN9ZJw-'; echo "folder (modified UTF-7): $folder<BR><BR>\n"; echo "<strong>mb_convert_encoding test</strong><BR>\n"; $test = $folder; $test = mb_convert_encoding($test, "UTF-8", "UTF7-IMAP"); echo " folder decoded: [$test]<BR>\n"; $test = mb_convert_encoding($test, "UTF7-IMAP", "UTF-8"); echo "encoded again: [", $test, "]<BR>\n"; $test = mb_convert_encoding($test, "UTF-8", "UTF7-IMAP"); echo "decoded again: [", $test, "]<BR><BR>\n"; echo "<strong>imap_utf7_decode test</strong><BR>\n"; $test = $folder; $test = imap_utf7_decode($test); echo "folder decoded: [", $test, "]<BR>\n"; $test = imap_utf7_encode($test); echo "encoded again: [", $test, "]<BR>\n"; $test = imap_utf7_decode($test); echo "decoded again: [", $test, "]<BR><BR>\n"; ?> </BODY> </HTML> --- end of test_utf7.php3 ----------------------------- --- ext/imap/php_imap.c ------------------------------- --- php_imap.c Fri Jul 26 17:25:10 2002 +++ php_imap.c Fri Jul 26 17:26:28 2002 @@ -2215,7 +2215,7 @@ php_error(E_WARNING, "imap_utf7_decode: Invalid modified UTF-7 character: `%c'", *inp); RETURN_FALSE; } else if (*inp != '&') { - outlen++; + outlen += 2; } else if (inp + 1 == endp) { php_error(E_WARNING, "imap_utf7_decode: Unexpected end of string"); RETURN_FALSE; @@ -2272,8 +2272,11 @@ if (*inp == '&' && inp[1] != '-') { state = ST_DECODE0; } - else if ((*outp++ = *inp) == '&') { - inp++; + else { + *outp++ = 0x00; + if ((*outp++ = *inp) == '&') { + inp++; + } } } else if (*inp == '-') { --- end of ext/imap/php_imap.c ------------------------This is the updated patch for 'ext/imap/php_imap.c': --- php_imap.c Mon Jul 29 15:17:45 2002 +++ php_imap.c Mon Jul 29 15:18:27 2002 @@ -2215,14 +2215,14 @@ php_error(E_WARNING, "imap_utf7_decode: Invalid modified UTF-7 character: `%c'", *inp); RETURN_FALSE; } else if (*inp != '&') { - outlen++; + outlen += 2; } else if (inp + 1 == endp) { php_error(E_WARNING, "imap_utf7_decode: Unexpected end of string"); RETURN_FALSE; } else if (inp[1] != '-') { state = ST_DECODE0; } else { - outlen++; + outlen += 2; inp++; } } else if (*inp == '-') { @@ -2272,8 +2272,11 @@ if (*inp == '&' && inp[1] != '-') { state = ST_DECODE0; } - else if ((*outp++ = *inp) == '&') { - inp++; + else { + *outp++ = 0x00; + if ((*outp++ = *inp) == '&') { + inp++; + } } } else if (*inp == '-') {JFYI: cvs -d :pserver:cvsread@cvs.php.net:/repository co php4 cd php4/ext/imap/ cvs update -r 1.112.2.1 php_imap.c patch php_imap.c _file_with_my_patch_ cvs update -A php_imap.c cvs ci php_imap.c As result: --- php_imap.c 5 Aug 2002 21:53:09 -0000 1.134 +++ php_imap.c 6 Aug 2002 23:00:31 -0000 @@ -2077,14 +2077,14 @@ php_error(E_WARNING, "%s(): Invalid modified UTF-7 character: `%c'", get_active_function_name(TSRMLS_C), *inp); RETURN_FALSE; } else if (*inp != '&') { - outlen++; + outlen += 2; } else if (inp + 1 == endp) { php_error(E_WARNING, "%s(): Unexpected end of string", get_active_function_name(TSRMLS_C)); RETURN_FALSE; } else if (inp[1] != '-') { state = ST_DECODE0; } else { - outlen++; + outlen += 2; inp++; } } else if (*inp == '-') { @@ -2134,8 +2134,11 @@ if (*inp == '&' && inp[1] != '-') { state = ST_DECODE0; } - else if ((*outp++ = *inp) == '&') { - inp++; + else { + *outp++ = 0x00; + if ((*outp++ = *inp) == '&') { + inp++; + } } } else if (*inp == '-') {Robert, The following patch fixes both imap_utf7_encode() and imap_utf7_decode() to work with UTF-16. PS: this patch is for PHP 4.2.2, the patch for CVS is posted in the php.dev Gamid Isayev --- php_imap.c Wed Aug 7 15:45:53 2002 +++ php_imap.c Thu Aug 8 14:24:16 2002 @@ -2215,14 +2215,14 @@ php_error(E_WARNING, "imap_utf7_decode: Invalid modified UTF-7 character: `%c'", *inp); RETURN_FALSE; } else if (*inp != '&') { - outlen++; + outlen += 2; } else if (inp + 1 == endp) { php_error(E_WARNING, "imap_utf7_decode: Unexpected end of string"); RETURN_FALSE; } else if (inp[1] != '-') { state = ST_DECODE0; } else { - outlen++; + outlen += 2; inp++; } } else if (*inp == '-') { @@ -2272,8 +2272,11 @@ if (*inp == '&' && inp[1] != '-') { state = ST_DECODE0; } - else if ((*outp++ = *inp) == '&') { - inp++; + else { + *outp++ = 0x00; + if ((*outp++ = *inp) == '&') { + inp++; + } } } else if (*inp == '-') { @@ -2349,29 +2352,42 @@ outlen = 0; state = ST_NORMAL; endp = (inp = in) + inlen; - while (inp < endp) { + while (inp < endp || state != ST_NORMAL) { if (state == ST_NORMAL) { - if (SPECIAL(*inp)) { + if (*inp == 0x00 && *(inp+1) < 0x80) { + /* ASCII character */ + outlen++; // for ASCII char + if (*(inp+1) == '&') + outlen++; // for '-' + inp += 2; + } else { + /* begin encoding */ state = ST_ENCODE0; - outlen++; - } else if (*inp++ == '&') { + outlen++; // for '&' + } + } else if (inp == endp || (*inp == 0x00 && *(inp+1) < 0x80)) { + /* flush overflow and terminate region */ + if (state != ST_ENCODE0) { outlen++; } - outlen++; - } else if (!SPECIAL(*inp)) { + outlen++; // for '-' state = ST_NORMAL; } else { - /* ST_ENCODE0 -> ST_ENCODE1 - two chars - * ST_ENCODE1 -> ST_ENCODE2 - one char - * ST_ENCODE2 -> ST_ENCODE0 - one char - */ - if (state == ST_ENCODE2) { - state = ST_ENCODE0; - } - else if (state++ == ST_ENCODE0) { - outlen++; + switch (state) { + case ST_ENCODE0: + outlen++; + state = ST_ENCODE1; + break; + case ST_ENCODE1: + outlen++; + state = ST_ENCODE2; + break; + case ST_ENCODE2: + outlen += 2; + state = ST_ENCODE0; + case ST_NORMAL: + break; } - outlen++; inp++; } } @@ -2388,14 +2404,17 @@ endp = (inp = in) + inlen; while (inp < endp || state != ST_NORMAL) { if (state == ST_NORMAL) { - if (SPECIAL(*inp)) { + if (*inp == 0x00 && *(inp+1) < 0x80) { + /* ASCII character */ + inp++; + if ((*outp++ = *inp++) == '&') + *outp++ = '-'; + } else { /* begin encoding */ *outp++ = '&'; state = ST_ENCODE0; - } else if ((*outp++ = *inp++) == '&') { - *outp++ = '-'; } - } else if (inp == endp || !SPECIAL(*inp)) { + } else if (inp == endp || (*inp == 0x00 && *(inp+1) < 0x80)) { /* flush overflow and terminate region */ if (state != ST_ENCODE0) { *outp++ = B64(*outp);Robert Marchand wrote: > this will not work without changing current applications. Now it is not working at all for non-ASCII characters. Example: For IMAP folder name "test&WSc-" ("test" + chinese character) current imap_utf7_decode() returns "testY'" For IMAP folder name "testY'", current imap_utf7_decode() also returns "testY'" So, what you will do in this case? > The problem is that these function try to encode and decode without > knowing the charset used. 1) imap_utf7_decode() does not need to know charset of input string, because input string is encoded in modified UTF7 2) if you specify charset for imap_utf7_decode() output string, what will you do when IMAP folder name has characters from different charsets (example: "test&BCQA31kn-" - ASCII, Russian, German, Chinese)? > As it is now, 8 bit is expected from imap_utf7_decode. <...skiped...> > It should really be: > imap_utf7_utf8_decode > imap_utf7_utf16_decode (patched version) > imap_utf8_utf7_encode > imap_utf16_utf7_encode (patched version) I think you are confusing "8 bit" and UTF-8. UTF-8 encoded character is "8 bit" only for ASCII characters. For non-ASCII characters UTF-8 will be two and more bytes. So, imap_utf7_decode() != imap_utf7_utf8_decode(). Gamid IsayevHi, you're write about the "utf8" thing. I was meaning "8bit". For the rest, I cannot change the software I use (Horde/IMP) because it is not me who wrote it. I can assure you it will break if you go with your mods. This is for the general problem. Now it seems I have a specific problem here with my SGI platform. Here is what I get with your patch: folder (modified UTF-7): test&AN9ZJw- mb_convert_encoding test folder decoded: [test?Y大] (hexa: 74 65 73 74 c3 9f e5 a4 a7 ) encoded again: [test&AN9ZJw-] decoded again: [test?Y大] (hexa: 74 65 73 74 c3 9f e5 a4 a7 ) imap_utf7_decode test folder decoded: [ (hexa: 0 74 0 65 0 73 0 74 0 d0 59 24 ) encoded again: [test&ANBZJA-] decoded again: [ (hexa: 0 74 0 65 0 73 0 74 5 d9 59 24 ) Here is another sample: folder (modified UTF-7): &AMk-l&AOk-ments envoy&AOk-s mb_convert_encoding test folder decoded: [??léments envoyés] (hexa: c3 89 6c c3 a9 6d 65 6e 74 73 20 65 6e 76 6f 79 c3 a9 73 ) encoded again: [&AMk-l&AOk-ments envoy&AOk-s] decoded again: [??léments envoyés] (hexa: c3 89 6c c3 a9 6d 65 6e 74 73 20 65 6e 76 6f 79 c3 a9 73 ) imap_utf7_decode test folder decoded: [ ? (hexa: 6 d3 0 6c 0 e0 0 6d 0 65 0 6e 0 74 0 73 0 20 0 65 0 6e 0 76 0 6f 0 79 0 e0 0 73 ) encoded again: [&BPp-l&A,g-ments envoy&Afa-s] decoded again: [ ? (hexa: 4 fb 0 6c 0 fb 0 6d 0 65 0 6e 0 74 0 73 0 20 0 65 0 6e 0 76 0 6f 0 79 0 fc 0 73 ) Here is the PHP test page to generate this output: <HTML> <HEAD> <TITLE>Test UTF7</TITLE> <META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=utf-16"> </HEAD> <BODY> <? function hexstr($s) { echo "(hexa: "; for ($i=0;$i<strlen($s);$i++) { echo dechex(ord($s[$i])), " "; } echo ")<br>"; } //$folder = 'test&AN9ZJw-'; $folder = '&AMk-l&AOk-ments envoy&AOk-s'; echo "folder (modified UTF-7): $folder<BR><BR>\n"; echo "<strong>mb_convert_encoding test</strong><BR>\n"; $test = $folder; $test = mb_convert_encoding($test, "UTF-8", "UTF7-IMAP"); echo " folder decoded: [$test]<BR>\n"; hexstr($test); $test = mb_convert_encoding($test, "UTF7-IMAP", "UTF-8"); echo "encoded again: [", $test, "]<BR>\n"; $test = mb_convert_encoding($test, "UTF-8", "UTF7-IMAP"); echo "decoded again: [", $test, "]<BR>\n"; hexstr($test); echo "<BR><strong>imap_utf7_decode test</strong><BR>\n"; $test = $folder; $test = imap_utf7_decode($test); echo "folder decoded: [", $test, "]<BR>\n"; hexstr($test); $test = imap_utf7_encode($test); echo "encoded again: [", $test, "]<BR>\n"; $test = imap_utf7_decode($test); echo "decoded again: [", $test, "]<BR>\n"; hexstr($test); ?> </BODY> </HTML> I am on a 64 bit platform. Could this be related to a wrapping shift? There is definitly something wrong here. Thanks.Hi, I've found the culprit in regards to the SGI problem. It is related to auto-increment operator and complex assignment. This doesn't work on SGI: *outp++ |= outp[1] >> 2; Here is my patch that correct the two function on SGI with the SGI Compiler (MIPSPRO): --- php_imap.c.nowarn Tue Jul 30 10:04:24 2002 +++ php_imap.c Tue Aug 13 11:44:50 2002 @@ -2187,6 +2187,7 @@ zval **arg; const unsigned char *in, *inp, *endp; unsigned char *out, *outp; + unsigned char c; int inlen, outlen; enum { ST_NORMAL, /* printable text */ @@ -2289,13 +2290,15 @@ break; case ST_DECODE1: outp[1] = UNB64(*inp); - *outp++ |= outp[1] >> 4; + c = outp[1] >> 4; + *outp++ |= c; *outp <<= 4; state = ST_DECODE2; break; case ST_DECODE2: outp[1] = UNB64(*inp); - *outp++ |= outp[1] >> 2; + c = outp[1] >> 2; + *outp++ |= c; *outp <<= 6; state = ST_DECODE3; break; @@ -2329,6 +2332,7 @@ zval **arg; const unsigned char *in, *inp, *endp; unsigned char *out, *outp; + unsigned char c; int inlen, outlen; enum { ST_NORMAL, /* printable text */ @@ -2399,7 +2403,8 @@ } else if (inp == endp || !SPECIAL(*inp)) { /* flush overflow and terminate region */ if (state != ST_ENCODE0) { - *outp++ = B64(*outp); + c = B64(*outp); + *outp++ = c; } *outp++ = '-'; state = ST_NORMAL; @@ -2412,12 +2417,14 @@ state = ST_ENCODE1; break; case ST_ENCODE1: - *outp++ = B64(*outp | *inp >> 4); + c = B64(*outp | *inp >> 4); + *outp++ = c; *outp = *inp++ << 2; state = ST_ENCODE2; break; case ST_ENCODE2: - *outp++ = B64(*outp | *inp >> 6); + c = B64(*outp | *inp >> 6); + *outp++ = c; *outp++ = B64(*inp++); state = ST_ENCODE0; case ST_NORMAL: This patch was applied to the original php_imap.c (4.2.2) but it can also be applied to the new version from Gamid Isayev. Thanks.