|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2002-12-04 08:16 UTC] flying at dom dot natm dot ru
It will be very useful to have support for -c and -s options available for iconv command-line tool as optional arguments for iconv() function. And also it will be specially useful for XML related code to have an option to convert all unconvertable characters into numeric entities. Thank you all for your job! PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Mon Dec 01 13:00:01 2025 UTC |
You can achieve that by appending "//IGNORE" after the codeset name to which the string is going to be converted. For example: <?php $bar = iconv("UTF-8", "KOI-8R//IGNORE", $foo); ?> Note that this is not portable since most of the iconv implementations don't support it. As far as I know, only glibc's iconv can handle this.Below is PHP example of how such code may looks like. It converts given string from UTF-8 into specified encoding. Notice about difference between utf8ToEntities() and utf8ToEntitiesMultibyte(): first function converts every char in a string into numeric entity while second only converts chars with codes above 0x0800. It allows for example receive normal string with single numeric entity in a case, when there is only one uncovertable character in it. // Convert string from UTF-8 into specified encoding and substitute unconvertable characters by numeric entities // At enter: // $str - string to convert function fromUTF8($str,$encoding) { if ($str===null) return(null); $t = iconv('utf-8',$encoding,$str); if (($t=='') && ($str!='')) // iconv() is unable to convert this string into requested encoding. { // First of all try to convert only multibyte characters. It may help us to return text in requested encoding // with only exception of a few very special chars instead of having all text to be converted in entities. $str2 = utf8ToEntitiesMultibyte($str); $t = iconv('utf-8',$encoding,$str2); if ($t!='') return($t); else return(utf8ToEntities($str)); }; return($t); } // Convert multibyte characters, available into UTF-8 encoded string into numeric entities (as described into RFC 2044) // At enter: // $str - string into UTF-8 encoding function utf8ToEntitiesMultibyte($str) { if (!is_string($str)) return(''); $i = 0; $output = ''; while($i<strlen($str)) { $char = $str{$i}; if ((ord($char) & 0x80)==0) // 0000 0000-0000 007F 0xxxxxxx { $output .= $char; $i++; } elseif ((ord($char)>0xC0) && (ord($char)<=0xDF)) // 0000 0080-0000 07FF 110xxxxx 10xxxxxx { $output .= substr($str,$i,2); $i += 2; } else { $num = 0; if ((ord($char) & 0xFC)==0xFC) // 0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx { $num = (ord($str{$i+5}) & 0x3F) | ((ord($str{$i+4}) & 0x3F) << 6 ) | ((ord($str{$i+3}) & 0x3F) << 12) | ((ord($str{$i+2}) & 0x3F) << 18) | ((ord($str{$i+1}) & 0x3F) << 24) | ((ord($str{$i+0}) & 0x01) << 30); $i += 6; } elseif ((ord($char) & 0xF8)==0xF8) // 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx { $num = (ord($str{$i+4}) & 0x3F) | ((ord($str{$i+3}) & 0x3F) << 6 ) | ((ord($str{$i+2}) & 0x3F) << 12) | ((ord($str{$i+1}) & 0x3F) << 18) | ((ord($str{$i+0}) & 0x03) << 24); $i += 5; } elseif ((ord($char) & 0xF0)==0xF0) // 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx { $num = (ord($str{$i+3}) & 0x3F) | ((ord($str{$i+2}) & 0x3F) << 6 ) | ((ord($str{$i+1}) & 0x3F) << 12) | ((ord($str{$i+0}) & 0x07) << 18); $i += 4; } elseif ((ord($char) & 0xE0)==0xE0) // 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx { $num = (ord($str{$i+2}) & 0x3F) | ((ord($str{$i+1}) & 0x3F) << 6 ) | ((ord($str{$i+0}) & 0x0F) << 12); $i += 3; } else // We should never came here until passed string is not UTF-8, // but without this we're risk to fall in endless loop { $num = ord($char); $i++; }; $output .= '&#'.$num.';'; }; }; return($output); } // Convert UTF-8 encoded string into numeric entities (as described into RFC 2044) // At enter: // $str - string into UTF-8 encoding function utf8ToEntities($str) { if (!is_string($str)) return(''); $i = 0; $output = ''; while($i<strlen($str)) { $char = $str{$i}; if ((ord($char) & 0x80)==0) // 0000 0000-0000 007F 0xxxxxxx { $output .= $char; $i++; } else { $num = 0; if ((ord($char) & 0xFC)==0xFC) // 0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx { $num = (ord($str{$i+5}) & 0x3F) | ((ord($str{$i+4}) & 0x3F) << 6 ) | ((ord($str{$i+3}) & 0x3F) << 12) | ((ord($str{$i+2}) & 0x3F) << 18) | ((ord($str{$i+1}) & 0x3F) << 24) | ((ord($str{$i+0}) & 0x01) << 30); $i += 6; } elseif ((ord($char) & 0xF8)==0xF8) // 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx { $num = (ord($str{$i+4}) & 0x3F) | ((ord($str{$i+3}) & 0x3F) << 6 ) | ((ord($str{$i+2}) & 0x3F) << 12) | ((ord($str{$i+1}) & 0x3F) << 18) | ((ord($str{$i+0}) & 0x03) << 24); $i += 5; } elseif ((ord($char) & 0xF0)==0xF0) // 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx { $num = (ord($str{$i+3}) & 0x3F) | ((ord($str{$i+2}) & 0x3F) << 6 ) | ((ord($str{$i+1}) & 0x3F) << 12) | ((ord($str{$i+0}) & 0x07) << 18); $i += 4; } elseif ((ord($char) & 0xE0)==0xE0) // 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx { $num = (ord($str{$i+2}) & 0x3F) | ((ord($str{$i+1}) & 0x3F) << 6 ) | ((ord($str{$i+0}) & 0x0F) << 12); $i += 3; } elseif ((ord($char) & 0xC0)==0xC0) // 0000 0080-0000 07FF 110xxxxx 10xxxxxx { $num = (ord($str{$i+1}) & 0x3F) | ((ord($str{$i+0}) & 0x1F) << 6 ); $i += 2; } else // We should never came here until passed string is not UTF-8, // but without this we're risk to fall in endless loop { $num = ord($char); $i++; }; $output .= '&#'.$num.';'; }; }; return($output); }