php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #81349 mb_detect_encoding misdetcts ASCII in some cases
Submitted: 2021-08-11 09:24 UTC Modified: 2021-08-11 09:31 UTC
From: phofstetter at sensational dot ch Assigned: nikic (profile)
Status: Closed Package: mbstring related
PHP Version: 8.1.0beta2 OS: macOS
Private report: No CVE-ID: None
 [2021-08-11 09:24 UTC] phofstetter at sensational dot ch
Description:
------------
mb_detect_encoding() seems to have changed quite a bit between PHP 8.0 and 8.1 and mostly for the better, TBH.

However, here's a test case where it misdetects a string as being ASCII when it absolutely cannot be because byte 0 is outside of the 7bit range of ASCII. 

If byte 1 is any non-letter character aside of space, the misdetection will happen. If it's any letter or space, it's fine.

While in <= PHP 8.0, the function had it's quirks, it would never detect a string with a character with its high-bit set as ASCII.

Test script:
---------------
<?php

echo(mb_detect_encoding("\xe4,a", ['ASCII', 'UTF-8', 'ISO-8859-1'])."\n");
echo(mb_detect_encoding("\xe4 a", ['ASCII', 'UTF-8', 'ISO-8859-1'])."\n");


Expected result:
----------------
ISO-8859-1
ISO-8859-1

Actual result:
--------------
ASCII
ISO-8859-1

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-08-11 09:31 UTC] nikic@php.net
-Assigned To: +Assigned To: nikic
 [2021-08-11 09:37 UTC] git@php.net
Automatic comment on behalf of nikic
Revision: https://github.com/php/php-src/commit/28500fe4ef1218e04e830ae94d889c4b6c67940d
Log: Fixed bug #81349
 [2021-08-11 09:37 UTC] git@php.net
-Status: Assigned +Status: Closed
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sun Nov 28 07:03:13 2021 UTC