|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2012-02-08 18:43 UTC] dey101+php at gmail dot com
Description: ------------ PHP VC9 x86 Thread Safe (from http://windows.php.net/download/) Using a regex to validate if a string is a valid hostname (host or FQDN). It seems that for certain length strings trying to match a literal period at the end will cause the preg_match to return false if the string does not have a period in it. It also will return false if the string has a period at the end, and the regex does not try to match them. The regex is using subpatterns ()to apply the zero or more repetition quantifier *. I tried with both capturing and non-capturing (?:), both yield the same result. However, if I use the one or more quantifier + it does not return bool(false). Using {0,} instead of * does not change the outcome. It seems that the cutoff length for the string is about 20 characters. Less than that, the results are int(0) or int(1) depending on if the regex matches, longer than that, and bool(false) is returned. If the subpattern is part of a longer string, it does work as anticipated. Matching a literal period at the beginning of the pattern does not yield an error. Substituting a-zA-Z0-9 for the [:alnum:] character class does not affect the results. error_get_last() does not return anything, nothing is showing up in logs with error_reporting(-1) set either. Test script: --------------- $regexs = array ( '/^[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/', '/^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*$/', '/^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)+$/', '/^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/' ); $hosts = array ( 'ABCDEFGHIJ1234567890.', // long string with period at end 'ABCDEFGHI234567890.', // slightly shorter string with period at end 'ABCDEFGHIJ1234567890', // long string no period 'ABCDEFGHI1234567890', // a little shorter 'ABCDEFGHI123456789', // even shorter 'ABCDEFGHIJ-1234567890', // long with hyphen 'ABCDEFGHIJ-123456789', // sorter with hyphen 'ABCDEFGHI-123456789', // even shorter with hyphen 'WWW.ABCDEFGHIJ-1234567890.COM', // a FQDN with long sting and hyphen 'WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890-abcd-efgh-hijk.COM' // a really long FQDN ); foreach ($regexs as $regex) { echo "\nRegex: $regex\n"; foreach ($hosts as $host) { echo " Host: $host\n"; $result = preg_match($regex, $host); echo ' Result: '; if ($result === false) { echo '(error) '; print_r(error_get_last()); // never prints anything? } else { echo ($result) ? '(match) ' : '(no match) '; } var_dump($result); } } Expected result: ---------------- none of the results should yield bool(false) Actual result: -------------- // just the output from the last regex, but others yield bool(false) Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/ Host: ABCDEFGHIJ1234567890. Result: (error) bool(false) Host: ABCDEFGHI234567890. Result: (error) bool(false) Host: .ABCDEFGHIJ1234567890 Result: (no match) int(0) Host: ABCDEFGHIJ1234567890 Result: (error) bool(false) Host: ABCDEFGHI1234567890 Result: (error) bool(false) Host: ABCDEFGHI123456789 Result: (match) int(1) Host: ABCDEFGHIJ-1234567890 Result: (error) bool(false) Host: ABCDEFGHIJ-123456789 Result: (error) bool(false) Host: ABCDEFGHI-123456789 Result: (match) int(1) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (match) int(1) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890-abcd-efgh-hijk.COM Result: (match) int(1) PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Wed Nov 05 07:00:01 2025 UTC |
Thank you for your report and helping to make php better. When I ran your script on Windows 2008 and Linux(using TS build of php5.3.10), it looks like the output is the same on both OSes. I don't think this is a PHP on Windows bug. If you would like, I can reclassify this bug as a general bug, not specific to Windows. Or, am I missing something? Is this really a PHP on Windows problem? win2008 sp1 x64 output(TS Build): Regex: /^[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/ Host: ABCDEFGHIJ1234567890. Result: (error) bool(false) Host: ABCDEFGHI234567890. Result: (no match) int(0) Host: ABCDEFGHIJ1234567890 Result: (match) int(1) Host: ABCDEFGHI1234567890 Result: (match) int(1) Host: ABCDEFGHI123456789 Result: (match) int(1) Host: ABCDEFGHIJ-1234567890 Result: (match) int(1) Host: ABCDEFGHIJ-123456789 Result: (match) int(1) Host: ABCDEFGHI-123456789 Result: (match) int(1) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (no match) int(0) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890 -abcd-efgh-hijk.COM Result: (no match) int(0) Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*$/ Host: ABCDEFGHIJ1234567890. Result: (match) int(1) Host: ABCDEFGHI234567890. Result: (match) int(1) Host: ABCDEFGHIJ1234567890 Result: (error) bool(false) Host: ABCDEFGHI1234567890 Result: (error) bool(false) Host: ABCDEFGHI123456789 Result: (no match) int(0) Host: ABCDEFGHIJ-1234567890 Result: (error) bool(false) Host: ABCDEFGHIJ-123456789 Result: (error) bool(false) Host: ABCDEFGHI-123456789 Result: (no match) int(0) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (error) bool(false) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890 -abcd-efgh-hijk.COM Result: (error) bool(false) Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)+$/ Host: ABCDEFGHIJ1234567890. Result: (match) int(1) Host: ABCDEFGHI234567890. Result: (match) int(1) Host: ABCDEFGHIJ1234567890 Result: (no match) int(0) Host: ABCDEFGHI1234567890 Result: (no match) int(0) Host: ABCDEFGHI123456789 Result: (no match) int(0) Host: ABCDEFGHIJ-1234567890 Result: (no match) int(0) Host: ABCDEFGHIJ-123456789 Result: (no match) int(0) Host: ABCDEFGHI-123456789 Result: (no match) int(0) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (error) bool(false) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890 -abcd-efgh-hijk.COM Result: (error) bool(false) Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*[[:alnum:]](?:[[:alnum: ]\-]*[[:alnum:]])*$/ Host: ABCDEFGHIJ1234567890. Result: (error) bool(false) Host: ABCDEFGHI234567890. Result: (error) bool(false) Host: ABCDEFGHIJ1234567890 Result: (error) bool(false) Host: ABCDEFGHI1234567890 Result: (error) bool(false) Host: ABCDEFGHI123456789 Result: (match) int(1) Host: ABCDEFGHIJ-1234567890 Result: (error) bool(false) Host: ABCDEFGHIJ-123456789 Result: (error) bool(false) Host: ABCDEFGHI-123456789 Result: (match) int(1) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (match) int(1) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890 -abcd-efgh-hijk.COM Result: (match) int(1) Linux-x64-gentoo output: Regex: /^[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/ Host: ABCDEFGHIJ1234567890. Result: (error) bool(false) Host: ABCDEFGHI234567890. Result: (no match) int(0) Host: ABCDEFGHIJ1234567890 Result: (match) int(1) Host: ABCDEFGHI1234567890 Result: (match) int(1) Host: ABCDEFGHI123456789 Result: (match) int(1) Host: ABCDEFGHIJ-1234567890 Result: (match) int(1) Host: ABCDEFGHIJ-123456789 Result: (match) int(1) Host: ABCDEFGHI-123456789 Result: (match) int(1) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (no match) int(0) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-123 45-67890-abcd-efgh-hijk.COM Result: (no match) int(0) Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*$/ Host: ABCDEFGHIJ1234567890. Result: (match) int(1) Host: ABCDEFGHI234567890. Result: (match) int(1) Host: ABCDEFGHIJ1234567890 Result: (error) bool(false) Host: ABCDEFGHI1234567890 Result: (error) bool(false) Host: ABCDEFGHI123456789 Result: (no match) int(0) Host: ABCDEFGHIJ-1234567890 Result: (error) bool(false) Host: ABCDEFGHIJ-123456789 Result: (error) bool(false) Host: ABCDEFGHI-123456789 Result: (no match) int(0) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (error) bool(false) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ- 12345-67890-abcd-efgh-hijk.COM Result: (error) bool(false) Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)+$/ Host: ABCDEFGHIJ1234567890. Result: (match) int(1) Host: ABCDEFGHI234567890. Result: (match) int(1) Host: ABCDEFGHIJ1234567890 Result: (no match) int(0) Host: ABCDEFGHI1234567890 Result: (no match) int(0) Host: ABCDEFGHI123456789 Result: (no match) int(0) Host: ABCDEFGHIJ-1234567890 Result: (no match) int(0) Host: ABCDEFGHIJ-123456789 Result: (no match) int(0) Host: ABCDEFGHI-123456789 Result: (no match) int(0) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (error) bool(false) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890-abcd-efgh-hijk.COM Result: (error) bool(false) Regex: /^(?:[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*\.)*[[:alnum:]](?:[[:alnum:]\-]*[[:alnum:]])*$/ Host: ABCDEFGHIJ1234567890. Result: (error) bool(false) Host: ABCDEFGHI234567890. Result: (error) bool(false) Host: ABCDEFGHIJ1234567890 Result: (error) bool(false) Host: ABCDEFGHI1234567890 Result: (error) bool(false) Host: ABCDEFGHI123456789 Result: (match) int(1) Host: ABCDEFGHIJ-1234567890 Result: (error) bool(false) Host: ABCDEFGHIJ-123456789 Result: (error) bool(false) Host: ABCDEFGHI-123456789 Result: (match) int(1) Host: WWW.ABCDEFGHIJ-1234567890.COM Result: (match) int(1) Host: WWW.SUB-SUBDOMAIN.SUBDOMAIN.ABCD-EFGH-IJKL-MNOP-QRST-UVWX-YZ-12345-67890-abcd-efgh-hijk.COM Result: (match) int(1)I have simplified the error to the following: <?php $string = 'ABCDEFGHIJ12345678.'; var_dump(preg_match('/^(?:\w*)*$/i', $string)); $string = 'ABCDEFGHIJ1234567.'; var_dump(preg_match('/^(?:\w*)*$/i', $string)); ?> Outputs: boolean false int 0 Saying /(\w*)*/ is VERY inefficient as it must try every combination before failing, i.e. matching: 'ABCDEFGHIJ12345678', '' 'ABCDEFGHIJ1234567', '8', '' 'ABCDEFGHIJ1234567', '', '8', '' 'ABCDEFGHIJ123456', '78', '' 'ABCDEFGHIJ123456', '7', '8', '' 'ABCDEFGHIJ123456', '7', '', '8', '' 'ABCDEFGHIJ123456', '', '78', '' ... '', 'A', '', 'B', '', 'C', '', 'D', '', 'E', '', 'F', '', 'G', '', 'H', '', 'I', '', 'J', '', '1', '', '2', '', '3', '', '4', '', '5', '', '6', '', '7', '', '8', '' It is most likely running out of memory before it completes. I would suggest that this is not a bug as it will use exponentially more memory the longer the input string gets. You should try something like '/^(?:(?>\w*))*$/i' instead to avoid undesired backtracking.