php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #28646 RFE: function to fix microsoft "smart quotes"and other wrong characters
Submitted: 2004-06-05 23:55 UTC Modified: 2014-07-17 18:41 UTC
Votes:9
Avg. Score:4.4 ± 0.7
Reproduced:9 of 9 (100.0%)
Same Version:3 (33.3%)
Same OS:3 (33.3%)
From: php at richardneill dot org Assigned:
Status: Wont fix Package: *General Issues
PHP Version: 4.3.6 OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: php at richardneill dot org
New email:
PHP Version: OS:

 

 [2004-06-05 23:55 UTC] php at richardneill dot org
Description:
------------
Feature request: str_demoronise()

On my website, I often find users pasting content that was written in Microsoft Word, and which contains undisplayable "ASCII" characters where there should be single/double quotes. Anyone viewing the result on a non-MS platform gets to see rectangles instead of quotes.

The problem has been solved in perl here:
http://www.fourmilab.ch/webtools/demoroniser/
I quote: 
============
Microsoft use their own "extension" to Latin-1, in which a variety of characters which do not appear in Latin-1 are inserted in the range 0x82 through 0x95--this having the merit of being incompatible with both Latin-1 and Unicode, which reserve this region for additional control characters.
=============

I'd like to suggest the addition of a str_demoronise() function which fixes these wrong characters, and replaces them by the correct ASCII.




Reproduce code:
---------------
From the source of demoroniser, here are the substitutions made. The MS column is what Microsoft use (in Hex); the FIX column is the replacement:

MS      FIX

0x82    ,
0x83    <em>f</em>
0x84    ,,
0x85    ...
0x88    ^
0x89    ' ?/??'            <-- whitsepace; no '' quotes
0x8B    <
0x8C    Oe
0x91    `
0x92    '
0x93    "
0x94    "
0x95    *
0x96    -
0x97    --
0x98    <sup>~</sup>
0x99    <sup>TM</sup>
0x9B    >
0x9C    oe


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-06-06 00:59 UTC] php at richardneill dot org
For safety's sake, it's probably wiser to have

&lt; 
&gt; 
\`
\'
\"

as the replacements.

Otherwise, we have a nice big security hole, since magic_quotes gets bypassed.
 [2004-06-06 23:07 UTC] papercrane at reversefold dot com
If you're really worried about magic_quotes (which I don't use anyway...), then str_demoroniser should be magic_quotes aware, escaping quotes only if magic_quotes_runtime is on.

Or perhaps it should be a second parameter to the function to escape quotes or not. Making it do one or the other would break *someone's* scripts.
 [2014-07-15 11:33 UTC] yohgaki@php.net
-Status: Open +Status: Wont fix -Package: Feature/Change Request +Package: *General Issues
 [2014-07-15 11:33 UTC] yohgaki@php.net
No report body text.
If you have issue.
Please open new report.
 [2014-07-17 17:02 UTC] php at richardneill dot org
Sorry, can I ask you what you mean by "no report body text"? 
I did file this bug with a great amount of detail, and the details were even emailed back to me from the PHP bug website when you updated this bug as wontfix. However, I can't see the original description here, so I'll paste it back in below:

(This bug is now somewhat less relevant than it was, since abuse of Latin-1 is less common, and we can workaround it by using UTF-8 everywhere).


Description:
------------
Feature request: str_demoronise()

On my website, I often find users pasting content that was written in Microsoft Word, and which contains undisplayable "ASCII" characters where there should be single/double quotes. Anyone viewing the result on a non-MS platform gets to see rectangles instead of quotes.

The problem has been solved in perl here:
http://www.fourmilab.ch/webtools/demoroniser/
I quote: 
============
Microsoft use their own "extension" to Latin-1, in which a variety of characters which do not appear in Latin-1 are inserted in the range 0x82 through 0x95--this having the merit of being incompatible with both Latin-1 and Unicode, which reserve this region for additional control characters.
=============

I'd like to suggest the addition of a str_demoronise() function which fixes these wrong characters, and replaces them by the correct ASCII.




Reproduce code:
---------------
>From the source of demoroniser, here are the substitutions made. The MS column is what Microsoft use (in Hex); the FIX column is the replacement:

MS      FIX

0x82    ,
0x83    <em>f</em>
0x84    ,,
0x85    ...
0x88    ^
0x89    ' °/°°'            <-- whitsepace; no '' quotes
0x8B    <
0x8C    Oe
0x91    `
0x92    '
0x93    "
0x94    "
0x95    *
0x96    -
0x97    --
0x98    <sup>~</sup>
0x99    <sup>TM</sup>
0x9B    >
0x9C    oe
 [2014-07-17 18:41 UTC] requinix@php.net
"No report body text" meaning we can't see the original report either.

Still, wontfix. It was an issue with character encoding and is solved by picking and using one encoding for the website and database. When properly instructed, the browser will observe your encoding request and transparently recode the input.

To the problem itself, what is needed is not some new built-in "demoronizing" function but something that understands the real underlying issue: the fact that Windows' "extension" to Latin1 is its own encoding named Windows-1252/CP1252. mbstring and iconv handle that.

The userland solution is effectively a one-liner:

  function str_demoronize($string) {
    return mb_convert_encoding($string, "utf-8", "cp1252");
    // return iconv("cp1252", "utf-8", $string);
  }
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 18:01:28 2024 UTC