php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #20809 iconv() options
Submitted: 2002-12-04 08:16 UTC Modified: 2003-07-02 14:16 UTC
From: flying at dom dot natm dot ru Assigned:
Status: Closed Package: Feature/Change Request
PHP Version: 4.3.0RC2 OS: All
Private report: No CVE-ID: None
 [2002-12-04 08:16 UTC] flying at dom dot natm dot ru
 It will be very useful to have support for -c and -s options available for iconv command-line tool as optional arguments for iconv() function.
 And also it will be specially useful for XML related code to have an option to convert all unconvertable characters into numeric entities.

 Thank you all for your job!

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-12-04 23:43 UTC] moriyoshi@php.net
You can achieve that by appending "//IGNORE" after the codeset name to which the string is going to be converted.

For example:
<?php
  $bar = iconv("UTF-8", "KOI-8R//IGNORE", $foo);
?>

Note that this is not portable since most of the iconv implementations don't support it. As far as I know, only glibc's iconv can handle this.

 [2003-07-02 13:47 UTC] Xuefer at 21cn dot com
it is said libxml2 does it this way(into numeric entities)
using iconv, means that it's possible, but i'm not sure

if it's possible, i guess it should be ok for php itself to implement "//IGNORE"
simply scan for //IGNORE itself, then do copies whenever get unconvertable error

it's badly needed to avoid truncate to the content only 1 char is unconvertable.
many thanks
 [2003-07-02 14:17 UTC] flying at dom dot natm dot ru
Below is PHP example of how such code may looks like. It converts given string from UTF-8 into specified encoding. 
Notice about difference between utf8ToEntities() and utf8ToEntitiesMultibyte(): first function converts every char 
in a string into numeric entity while second only converts chars with codes above 0x0800. It allows for example 
receive normal string with single numeric entity in a case, when there is only one uncovertable character in it.

// Convert string from UTF-8 into specified encoding and substitute unconvertable characters by numeric entities
// At enter:
//   $str - string to convert
    function fromUTF8($str,$encoding)
    {
        if ($str===null)
            return(null);
        $t = iconv('utf-8',$encoding,$str);
        if (($t=='') && ($str!=''))
// iconv() is unable to convert this string into requested encoding.
        {
// First of all try to convert only multibyte characters. It may help us to return text in requested encoding
// with only exception of a few very special chars instead of having all text to be converted in entities.
            $str2 = utf8ToEntitiesMultibyte($str);
            $t = iconv('utf-8',$encoding,$str2);
            if ($t!='')
                return($t);
            else
                return(utf8ToEntities($str));
        };
        return($t);
    }

// Convert multibyte characters, available into UTF-8 encoded string into numeric entities (as described into RFC 2044)
// At enter:
//   $str - string into UTF-8 encoding
    function utf8ToEntitiesMultibyte($str)
    {
        if (!is_string($str))
            return('');
        $i = 0;
        $output = '';
        while($i<strlen($str))
        {
            $char = $str{$i};
            if ((ord($char) & 0x80)==0)
//   0000 0000-0000 007F   0xxxxxxx
                {
                    $output .= $char;
                     $i++;
                }
            elseif ((ord($char)>0xC0) && (ord($char)<=0xDF))
//   0000 0080-0000 07FF   110xxxxx 10xxxxxx
                {
                    $output .= substr($str,$i,2);
                    $i += 2;
                }
            else
                {
                    $num = 0;
                    if ((ord($char) & 0xFC)==0xFC)
//   0400 0000-7FFF FFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+5}) & 0x3F) |
                                  ((ord($str{$i+4}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+3}) & 0x3F) << 12) |
                                  ((ord($str{$i+2}) & 0x3F) << 18) |
                                  ((ord($str{$i+1}) & 0x3F) << 24) |
                                  ((ord($str{$i+0}) & 0x01) << 30);
                            $i += 6;
                        }
                    elseif ((ord($char) & 0xF8)==0xF8)
//   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+4}) & 0x3F) |
                                  ((ord($str{$i+3}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+2}) & 0x3F) << 12) |
                                  ((ord($str{$i+1}) & 0x3F) << 18) |
                                  ((ord($str{$i+0}) & 0x03) << 24);
                            $i += 5;
                        }
                    elseif ((ord($char) & 0xF0)==0xF0)
//   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+3}) & 0x3F) |
                                  ((ord($str{$i+2}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+1}) & 0x3F) << 12) |
                                  ((ord($str{$i+0}) & 0x07) << 18);
                            $i += 4;
                        }
                    elseif ((ord($char) & 0xE0)==0xE0)
//   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+2}) & 0x3F) |
                                  ((ord($str{$i+1}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+0}) & 0x0F) << 12);
                            $i += 3;
                        }
                    else
// We should never came here until passed string is not UTF-8,
// but without this we're risk to fall in endless loop
                        {
                            $num = ord($char);
                            $i++;
                        };
                    $output .= '&#'.$num.';';
                };
        };
        return($output);
    }

// Convert UTF-8 encoded string into numeric entities (as described into RFC 2044)
// At enter:
//   $str - string into UTF-8 encoding
    function utf8ToEntities($str)
    {
        if (!is_string($str))
            return('');
        $i = 0;
        $output = '';
        while($i<strlen($str))
        {
            $char = $str{$i};
            if ((ord($char) & 0x80)==0)
//   0000 0000-0000 007F   0xxxxxxx
                {
                    $output .= $char;
                     $i++;
                }
            else
                {
                    $num = 0;
                    if ((ord($char) & 0xFC)==0xFC)
//   0400 0000-7FFF FFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+5}) & 0x3F) |
                                  ((ord($str{$i+4}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+3}) & 0x3F) << 12) |
                                  ((ord($str{$i+2}) & 0x3F) << 18) |
                                  ((ord($str{$i+1}) & 0x3F) << 24) |
                                  ((ord($str{$i+0}) & 0x01) << 30);
                            $i += 6;
                        }
                    elseif ((ord($char) & 0xF8)==0xF8)
//   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+4}) & 0x3F) |
                                  ((ord($str{$i+3}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+2}) & 0x3F) << 12) |
                                  ((ord($str{$i+1}) & 0x3F) << 18) |
                                  ((ord($str{$i+0}) & 0x03) << 24);
                            $i += 5;
                        }
                    elseif ((ord($char) & 0xF0)==0xF0)
//   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+3}) & 0x3F) |
                                  ((ord($str{$i+2}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+1}) & 0x3F) << 12) |
                                  ((ord($str{$i+0}) & 0x07) << 18);
                            $i += 4;
                        }
                    elseif ((ord($char) & 0xE0)==0xE0)
//   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+2}) & 0x3F) |
                                  ((ord($str{$i+1}) & 0x3F) << 6 ) |
                                  ((ord($str{$i+0}) & 0x0F) << 12);
                            $i += 3;
                        }
                    elseif ((ord($char) & 0xC0)==0xC0)
//   0000 0080-0000 07FF   110xxxxx 10xxxxxx
                        {
                            $num = (ord($str{$i+1}) & 0x3F) |
                                  ((ord($str{$i+0}) & 0x1F) << 6 );
                            $i += 2;
                        }
                    else
// We should never came here until passed string is not UTF-8,
// but without this we're risk to fall in endless loop
                        {
                            $num = ord($char);
                            $i++;
                        };
                    $output .= '&#'.$num.';';
                };
        };
        return($output);
    }
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Tue Jul 22 00:00:02 2025 UTC