php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #64948 FILTER_VALIDATE_URL does not see urls with underscores as valid URLs.
Submitted: 2013-05-30 14:59 UTC Modified: 2018-10-17 03:17 UTC
Votes:33
Avg. Score:4.2 ± 0.9
Reproduced:33 of 33 (100.0%)
Same Version:9 (27.3%)
Same OS:14 (42.4%)
From: neclimdul at gmail dot com Assigned:
Status: Open Package: Filter related
PHP Version: 7.2 OS: Ubuntu
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If this is not your bug, you can add a comment by following this link.
If this is your bug, but you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: neclimdul at gmail dot com
New email:
PHP Version: OS:

 

 [2013-05-30 14:59 UTC] neclimdul at gmail dot com
Description:
------------
FILTER_VALIDATE_URL does not see urls with underscores as valid URLs. Underscores 
are however valid and common in urls in the wild.

Test script:
---------------
$url = 'http://foo_bar.example.com';
var_dump(filter_var($url, FILTER_VALIDATE_URL));


Expected result:
----------------
string(26) "http://foo_bar.example.com"

Actual result:
--------------
bool(false)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-05-30 17:46 UTC] aharvey@php.net
This is a tricky one. I think the current behaviour is technically correct here, 
but I don't particularly want to summarily close this either.

Technically speaking, host names can't include underscores. 
http://stackoverflow.com/a/2183140 links to the various RFCs that define this — 
domain names (such as those used in SRV records) can contain underscores, but 
host names have a more restrictive character set.

That said, RFC 3986 (which is presumably what a URL validation routine is 
ultimately beholden to) is specified more loosely to cover non-DNS name 
registries. Hosts are reg-name elements there, which allows percent encoded 
characters, hyphens, dots, underscores, tildes, and a range of characters 
defined as sub-delims.

Given that underscores do have implementation issues in the wild (IE's cookie 
issues, for instance), my inclination is to leave this, as I said at the start, 
but I'd like a second opinion.

tl;dr: RFC lawyer material; probably Won't Fix; need a second opinion.
 [2015-08-21 17:14 UTC] rich at social5 dot com
The very Stack Overflow link aharvey@php.net referenced seems to suggest that FILTER_VALIDATE_URL *should* allow underscores, does it not?

Regardless, ultimately, FILTER_VALIDATE_URL is used "in the wild" to verify whether or not accessing a URL will load a web resource.  There certainly *are* many sites that use underscores in both the main domain as well as the subdomain.  Regardless of the "lawyer material" (which one could argue is in favor of "allowing" underscores, anyway), doesn't it make sense to have FILTER_VALIDATE_URL reflect URLs that are actually used?

This means that FILTER_VALIDATE_URL is useless for us since it causes our system to reject customers that happen to have an underscore in their URL.  Unless this bug is fixed, we'll have to implement a separate solution that will allow the underscore-laden URLs.

This doesn't sound like the best solution to me.  Wouldn't you agree?
 [2018-10-15 13:47 UTC] luke at reviews dot co dot uk
Definitely won't fix? parse_url() handles underscores quite nicely.
 [2018-10-15 14:30 UTC] spam2 at rhsoft dot net
you MUST NOT use underscores in your DNS names - it's that simple
 [2018-10-15 15:13 UTC] neclimdul at gmail dot com
Woooh blast from the past.

> Definitely won't fix? parse_url() handles underscores quite nicely.

Its been so long but it feels like that discrepancy(that parse_url is fine with underscores and filter_var isn't) was connected to me reporting this. Like, some tool used directory names to automate subdomains and parse_url, browsers, and everything else had been happy then a stray filter_var with VALIDATE_URL killed it all. So... not really nicely.

> you MUST NOT use underscores in your DNS names - it's that simple

I don't think it is that simple. Like aharvey in his initial response, this is spread out over lots of RFCs and muddy implementation so I think we need to be 100% clear and documented in why this works the way it does.

First I want to make the DNS argument 100% clear. DNS hostnames disallow but DNS allows (and requires) names with underscores in some cases. This makes it more clear than I could here. http://domainkeys.sourceforge.net/underscore.html http://ietf.org/rfc/rfc2782.txt

If the argument is that URL authorities should be A & AAAA record hostnames, you probably have a point and my RFC lawyering is not up to arguing against it but DNS would not be the reason, URL's would.

Second, the point from the initial bug report is that these URLs can happen in reality. The fact is that support for them is more in the "it works" category then the "it doesn't" with this implementation mostly just falling in with IE.

While I appreciate filter_var's adherence to RFC's, this is a case where real world and RFC's have a disconnect and at the least should be clearly documented to people stumbling into this because as Luke pointed out, its not even consistent in this language.
 [2018-10-15 15:22 UTC] spam2 at rhsoft dot net
it is that simple - there are RFC's covering that the underscore is not allowed and there are clients which behave completly weird if you insist using them

if i could only have the lifetime a few people of me wasted because they never remember things lonmger than a few mnoths leading to sit again with a local development URL containing and underscore and wasting hours of debugging why things don#t work relieable in MSIE and so on

just don't se underscores - it's that simple
 [2018-10-15 16:18 UTC] neclimdul at gmail dot com
"if I could only have the lifetime a few people" First, I feel your pain. I'm remembering more and more exactly why I got here and the wasted days of annoyance tracking down why all of a site was working great when I used a value but the form failed to validate and then refactoring an entire automation to support stripping underscores. Sure it would have been better if when I stated I'd immediately had things not work but I wasn't thinking "oh, linux directories don't comply with the server section of RF2396 and I better strip those underscores" I was getting things done like most developers and just thought "paths are paths, lets just pass these around and yeah that's working I've got bigger problems."

We make these things clear so our future selves and other developers aren't loosing that time and live slightly happier lives and I'm just trying to make someone's life easier because filter_var is kinda out on its own even if it is "correct" and for good reasons.

After your response I wanted to make sure I was 100% clear on what FILTER_VALIDATE_URL claimed so I read through the documentation again and the related RFCs. So I was sure there wasn't a nuance that I missed. PHP's documentation is brief and filled with exceptions:
http://php.net/manual/en/filter.filters.validate.php
"Validates value as URL (according to » http://www.faqs.org/rfcs/rfc2396), optionally with required components. Beware a valid URL may not specify the HTTP protocol http:// so further validation may be required to determine the URL uses an expected protocol, e.g. ssh:// or mailto:. Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail."

RFC2396, exclude the protocol, and only ASCII. Kinda weird but sure.

In that RFC there are 2 sections describing the authority component in question. The second (Server-based Naming Authority) requires the hostname conform to the DNS specification. Being strict to the RFC and ignoring the implementations that allow it we should not allow underscores.

The first section however is "Registry-based Naming Authority" which is _clearly_ not supported FILTER_VALIDATE_URL.

To use the RFC's definitions, the authority component of the RFC is described as:

```
      authority     = server | reg_name

      reg_name      = 1*( unreserved | escaped | "$" | "," |

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
```
Right there in mark underscore is explicitly supported and I assume without looking some other characters that would probably blow up.

So, there's a bug. filter_var doesn't support protocols, non-ascii, _or_ Registry-based name authorities. Again, maybe its just more documentation but something _is_ wrong. Additionally, discrepancies with most url parsing implementations and other parts of PHP's URL parsing would be really great to document as well because that could lead to real life software bugs.
 [2018-10-15 16:25 UTC] spam2 at rhsoft dot net
you can have underscores in the URL but not in the hostname/domain part, that's it
 [2018-10-15 16:31 UTC] requinix@php.net
-Status: Open +Status: Feedback
 [2018-10-15 16:31 UTC] requinix@php.net
@neclimdul: You're looking at RFC 2396 which was obsoleted by RFC 3986. If you look there you'll see the grammar still allows it, however it comes with a caveat:
> A registered name intended for lookup in the DNS uses the syntax defined in
> Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].

As has been said, the situation for underscores is complicated and somewhat contradictory, both for URIs and for DNS. Some RFCs allow them and others do not. I'd give a primer but we're talking about more than just a couple RFCs involved in this. Basically, URIs allow underscores in hostnames by way of the grammar but with the caveat I mentioned above, DNS as a whole does not allow it for labels according to that grammar, yet some DNS record types do allow them (notably SRV and TXT).

But remember here we're specifically talking about URLs.

Can anyone link a functioning *website* that uses an underscore in the domain name? I'm thinking we tie the status of this request to that: if there is one and it works in browsers then we allow underscores, otherwise not.
 [2018-10-16 09:29 UTC] luke at reviews dot co dot uk
As requested, the below URL is what caused me to stumble across this "bug"

https://electrictobacconist_co_uk.secure-cdn.visualsoft.co.uk/images/gamucci-refills-p86-15678_image.jpg
 [2018-10-16 09:34 UTC] spam2 at rhsoft dot net
so blame that fool, even wikipedia knows
https://en.wikipedia.org/wiki/Hostname

The Internet standards (Requests for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters 'a' through 'z' (in a case-insensitive manner), the digits '0' through '9', and the minus sign ('-'). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a minus sign, and must not end with a minus sign. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. No other symbols, punctuation characters, or white space are permitted.

While a hostname may not contain other characters, such as the underscore character (_), other DNS names may contain the underscore.[4][5] Systems such as DomainKeys and service records use the underscore as a means to assure that their special character is not confused with hostnames
 [2018-10-16 16:02 UTC] requinix@php.net
-Status: Feedback +Status: Open -PHP Version: 5.4.15 +PHP Version: 7.2
 [2018-10-16 16:02 UTC] requinix@php.net
That URL clearly shows that underscores are supported. At least in my browser. Whether RFCs technically allow for it or not is besides the point now.

I think FILTER_VALIDATE_URL should allow underscores.
 [2018-10-16 16:05 UTC] spam2 at rhsoft dot net
yeah they are somehow supported

then get MSIE and try to verify a web-app working on it too you developed with chrome/firefox and wonder about all sorts of randomly lost cookies and session errors 

my co-worker repeatly wasted hours by add a hostname with underscore to /etc/hosts on his development machine and then searching issues which would not have happened by follow RFC's

just becaus some fools ignore RFCs makes it not a good idea to follow
 [2018-10-17 03:17 UTC] yohgaki@php.net
Hostname shouldn't have "_".

However, NETBIOS allows "_" in hostname. DomainKeys/etc use "_" to distinguish service name and hostname.

This isn't a bug, but feature request for additional option like FILTER_VALIDATE_URL_WITH_UNDERSCORE or FILTER_VALIDATE_SERVICE_URL.
 [2018-10-17 14:33 UTC] rh at tfli dot co dot uk
FWIW, the URL https://electrictobacconist_co_uk.secure-cdn.visualsoft.co.uk/images/gamucci-refills-p86-15678_image.jpg loads on both MSIE 11 and MS Edge (both on Win10).

I haven't gone as far to check how far it works (e.g. cookies, etc), but it does load the resource - so I would be inclined to agree with requinix.
 [2023-06-30 10:04 UTC] julicddda788maxwell at gmail dot com
Very good info. (https://github.com)(https://www.yourtexasbenefits.bid/)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 11:01:28 2024 UTC