php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #77937 preg_match failed
Submitted: 2019-04-24 20:33 UTC Modified: 2019-05-07 21:26 UTC
Votes:1
Avg. Score:3.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: v-altruo at microsoft dot com Assigned: cmb (profile)
Status: Closed Package: *General Issues
PHP Version: 7.3.5RC1 OS: Windows 10
Private report: No CVE-ID: None
 [2019-04-24 20:33 UTC] v-altruo at microsoft dot com
Description:
------------
Failed regardless of OPCache being enabled or disabled and if it was TS or NTS. 
Test file location: ext\pcre\tests\locales.phpt

Test script:
---------------
setlocale(LC_ALL, 'pt_PT', 'pt', 'pt_PT.ISO8859-1', 'portuguese');
var_dump(preg_match('/^\w{6}$/', 'aאבחיט'));


Expected result:
----------------
int(1)

Actual result:
--------------
int(0)

Patches

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-04-24 20:45 UTC] requinix@php.net
-Status: Open +Status: Not a bug
 [2019-04-24 20:45 UTC] requinix@php.net
Last I knew Portuguese does not cover Hebrew characters.
 [2019-04-24 22:20 UTC] a at b dot c dot de
Adding the /u modifier to the pattern would help (assuming the source is encoded in UTF8 - you can't even SAY "aאבחיט" in ISO8859-1).
 [2019-04-24 22:24 UTC] a at b dot c dot de
Incidentally, the test cited in the original report uses the string "aàáçéè", not Hebrew characters.
 [2019-04-25 09:14 UTC] cmb@php.net
-Status: Not a bug +Status: Re-Opened -Assigned To: +Assigned To: cmb
 [2019-04-25 09:14 UTC] cmb@php.net
Thanks for reporting!  I can reproduce the *test* *failure*. The
problem is that setlocale()[1] claims to support "pt_PT", but
actually it does not.  Actually supported locales would be "pt-PT"
and "portuguese".

I'm not sure yet what to do about this.  Simply fixing the test
case for Windows would be an option, but that would not fix the
underlying issue which may affect existing userland code.

[1] <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=vs-2019>
 [2019-04-25 10:03 UTC] requinix@php.net
Hmm, yes, it seems Windows will quite happily accept any "language" or "language_country" string regardless of whether either part exists, as long as the language code is 2 or 3 characters.

var_dump(setlocale(LC_ALL, "xjq_ASDF")); // returns xjq_ASDF
var_dump(setlocale(LC_ALL, "0")); // still xjq_ASDF

FFS.

So for maximum portability it seems you have to list Windows-specific strings before the normal strings. Or at least the codes it accepts before any short ones.

setlocale(LC_ALL,
  "Portuguese_Portugal.28591", // windows okay (28591 is the codepage for ISO 8859-1), linux ignored
  "Portuguese_Portugal",       // windows okay, linux ignored
  "Portuguese",                // windows okay, linux ignored
  "pt_PT.ISO8859-1",           // windows ignored (bad codepage), linux okay
  "pt_PT",                     // windows okay (wrong), linux okay
  "pt"                         // windows okay (wrong), linux okay
);
 [2019-04-25 17:03 UTC] cmb@php.net
-Package: PCRE related +Package: *General Issues
 [2019-04-25 17:03 UTC] cmb@php.net
This issue is neither directly PCRE nor testing related.  Consider
the following script:

    <?php
    var_dump(17.4);
    var_dump(setlocale(LC_ALL, 'pt_PT'));
    var_dump(17.5);
    var_dump(ctype_alpha(224));
    ?>

This outputs on my Windows system:

    float(17.4)
    string(5) "pt_PT"
    float(17,5)
    bool(false)

The first three lines indicate that pt_PT is properly supported,
but the failing ctype_alpha() shows that it is not really.

The following C program confirms that the issue is not directly
related to PHP:

    #include <stdio.h>
    #include <ctype.h>
    #include <locale.h>

    int main()
    {
        struct lconv *lc1 = localeconv();
        char *loc = setlocale(LC_ALL, "pt_PT");
        struct lconv *lc2 = localeconv();
        int alpha = isalpha(224);
        printf("%s %s %s %d\n", lc1->decimal_point, loc, lc2->decimal_point, alpha);
        return 0;
    }

Outputs on my Windows system (when built with VC15):

    . pt_PT , 0

Again, ctype fails to properly recognize the locale (which is the
reason for the failing test, since PCRE2 calls ctype functions to
build the character tables).

If I build with VC11, I get:

    . (null) . 0

Apparently, AppVeyor behaves either like this, or it properly
recognizes pt_PT for the ctype functions.
 [2019-05-07 21:26 UTC] cmb@php.net
I've made a patch[1] which is supposed to be as backward
compatible as reasonably possible.  To ease testing, respective
binary snapshots[2] are available as well.  I hope to get some
feedback on that before proceeding.

Thanks!

[1] <https://github.com/cmb69/php-src/commit/fa35882831010861aea3c4b2d12dd4d3d0fb64a7>
[2] <https://windows.php.net/downloads/snaps/ostc/77937/>
 [2019-05-16 09:11 UTC] cmb@php.net
The following pull request has been associated:

Patch Name: Fix #77937: preg_match failed
On GitHub:  https://github.com/php/php-src/pull/4169
Patch:      https://github.com/php/php-src/pull/4169.patch
 [2019-06-11 06:46 UTC] cmb@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=f3ff72e54b2f6c2fa1ac924ad95455a5309099d5
Log: Fix #77937: preg_match failed
 [2019-06-11 06:46 UTC] cmb@php.net
-Status: Re-Opened +Status: Closed
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Dec 03 16:01:33 2024 UTC