|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2004-06-05 23:55 UTC] php at richardneill dot org
Description: ------------ Feature request: str_demoronise() On my website, I often find users pasting content that was written in Microsoft Word, and which contains undisplayable "ASCII" characters where there should be single/double quotes. Anyone viewing the result on a non-MS platform gets to see rectangles instead of quotes. The problem has been solved in perl here: http://www.fourmilab.ch/webtools/demoroniser/ I quote: ============ Microsoft use their own "extension" to Latin-1, in which a variety of characters which do not appear in Latin-1 are inserted in the range 0x82 through 0x95--this having the merit of being incompatible with both Latin-1 and Unicode, which reserve this region for additional control characters. ============= I'd like to suggest the addition of a str_demoronise() function which fixes these wrong characters, and replaces them by the correct ASCII. Reproduce code: --------------- From the source of demoroniser, here are the substitutions made. The MS column is what Microsoft use (in Hex); the FIX column is the replacement: MS FIX 0x82 , 0x83 <em>f</em> 0x84 ,, 0x85 ... 0x88 ^ 0x89 ' ?/??' <-- whitsepace; no '' quotes 0x8B < 0x8C Oe 0x91 ` 0x92 ' 0x93 " 0x94 " 0x95 * 0x96 - 0x97 -- 0x98 <sup>~</sup> 0x99 <sup>TM</sup> 0x9B > 0x9C oe PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Mon Nov 17 05:00:01 2025 UTC |
"No report body text" meaning we can't see the original report either. Still, wontfix. It was an issue with character encoding and is solved by picking and using one encoding for the website and database. When properly instructed, the browser will observe your encoding request and transparently recode the input. To the problem itself, what is needed is not some new built-in "demoronizing" function but something that understands the real underlying issue: the fact that Windows' "extension" to Latin1 is its own encoding named Windows-1252/CP1252. mbstring and iconv handle that. The userland solution is effectively a one-liner: function str_demoronize($string) { return mb_convert_encoding($string, "utf-8", "cp1252"); // return iconv("cp1252", "utf-8", $string); }