php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #62032 filter_var incorrectly strips characters from strings after "<"
Submitted: 2012-05-14 17:19 UTC Modified: 2016-12-31 18:20 UTC
Votes:4
Avg. Score:4.8 ± 0.4
Reproduced:3 of 3 (100.0%)
Same Version:2 (66.7%)
Same OS:1 (33.3%)
From: iamcraigcampbell at gmail dot com Assigned:
Status: Verified Package: Filter related
PHP Version: 5.4.3 OS: Mac OS X
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: iamcraigcampbell at gmail dot com
New email:
PHP Version: OS:

 

 [2012-05-14 17:19 UTC] iamcraigcampbell at gmail dot com
Description:
------------
Noticed that for strings with < in them outside of html tags, filter_var will 
strip out all characters that come after the <.

Test script:
---------------
<?php
$string = 'i want to say that 5 < 10, but it won\'t let me!';
$filtered_string = filter_var($string, FILTER_SANITIZE_STRING);

var_dump($filtered_string);

$filtered_string_strip_tags = strip_tags($string);

var_dump($filtered_string_strip_tags);

Expected result:
----------------
string(47) "i want to say that 5 < 10, but it won't let me!"
string(47) "i want to say that 5 < 10, but it won't let me!"

Actual result:
--------------
string(21) "i want to say that 5 "
string(47) "i want to say that 5 < 10, but it won't let me!"

Patches

Add-strip_tags-third-optional-parameter (last revision 2012-05-15 14:26 UTC by reeze dot xia at gmail dot com)

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-05-15 14:24 UTC] reeze dot xia at gmail dot com
Hi, 
  I think it's a document problem. you could refer this commit: 
http://svn.php.net/viewvc?view=revision&revision=225196

strip_tags() didn't allow space after < so strip_tags didn't trait it as a 
invalid
tag so it didn't get striped.

filter_var allow space after < so,  it striped everything after <.


I think we could add an extra paramater to strip_tags() allow space after <
and document it eg:

string strip_tags(string str [, string allowable_tags = null [, bool 
allow_tag_spaces = false]])
 [2012-05-15 14:26 UTC] iamcraigcampbell at gmail dot com
Well I can understand stripping it if there is a closing > somewhere, but if it is 
a < that is not followed by a matching > then it should be allowed in the string 
and not stripped.  I think strip_tags works as expected.
 [2012-05-15 14:36 UTC] reeze dot xia at gmail dot com
strip_tags will strip it even without the ending '>' if  '<' followed by a
non-space char.

If we need to check whether is a closed tag it is a feature request to change it's 
behavior. it will break BC.
 [2012-05-15 14:42 UTC] reeze dot xia at gmail dot com
PS: the reason why strip_tags() didn't strip it is '<' is followed by a
space char but not without ending '>', this is the key point.

look deep into the source code, there difference is switch whether or 
not to trait '<' followed by a(or more) spaces a tag or not.
 [2012-05-15 14:45 UTC] iamcraigcampbell at gmail dot com
So in that case I think strip_tags and filter_var are both broken.  In this context: 
"It is true that 5<10"
"It is true that 5 < 10"  

Neither of these are html tags so the string should not be touched regardless of if 
there is a space or not.
 [2012-05-15 14:49 UTC] pajoye@php.net
> or < should be encoded then, see 

http://www.php.net/manual/en/filter.filters.sanitize.php

btw, any option should be added using the option array or defaults, as it is the 
case already.
 [2012-05-15 14:51 UTC] aleksey dot v dot korzun at gmail dot com
How is stripping anything after < with a space is a valid operation? That seems 
like a lazy man's html stripper.

Let's just blindly strip everything that can possibly be made into an html tag of 
any sort. Not.
 [2012-05-15 15:06 UTC] iamcraigcampbell at gmail dot com
@pajoye I agree with you, but there is a use case that encoding will not solve.

Let's say there is a forum where users are posting.  If the user posts:

"This is <strong>NOT</strong> good!" and the tags get encoded then that means the 
HTML tags will be displayed in the forum as plain text.  I think it is more expected 
behavior to display this string as "This is NOT good!".

So one option would be encoding the < only if it is not followed by a > but that is a 
lot of extra work to figure that out.


At the end of the day the point is that no matter how you look at it I still think 
this is a bug.

$string = 'This is true: 2<5';
strip_tags($string); and filter_var($string, FILTER_SANITIZE_STRING);

Should not strip out <5 since that is not an HTML tag.
 [2012-05-15 21:37 UTC] anon at anon dot anon
Well I never heard of this "SANITIZE_STRING" filter before, but it looks just as retarded as it sounds, and about as retarded as strip_tags. 99.99% of the time, strip_tags just should not be used at all because it's horribly broken. The real bugs are (1) strip_tags exists, and (2) that PHP should imply that any kind of magical all-purpose "string sanitization" process could exist.

@iamcraigcampbell:
>Well I can understand stripping it if there is a closing > somewhere, but if it is 
a < that is not followed by a matching > then it should be allowed in the string 
and not stripped.
In that case:
(1) Unclosed tags will eat extra page content, breaking page layout.
(2) Pages consist of many echo statements. By your logic, "<script" is a possibly legal string to echo, but if some later string contains a ">", we need to implement a delayed-choice quantum eraser to make all the parallel universes in which the earlier echo statement occurred cease to exist.

>I think it is more expected behavior to display this string as "This is NOT good!".
No. Display what users type. Don't delete text from their posts based on the quirks of what just happens to be the underlying display format on a particular day. Suppose your hypothetical forum also displays posts in another format, e.g., it has a Flash or iPhone-based app, or it tweets posts, or a few years from now we're all using a completely different markup language. Should it then also strip HTML-like tags from all text in perpetuity from all media just because HTML happened to be a relevant format to someone somewhere once upon a time, or should user-submitted text throw integrity to the wind and change depending on what kind of device someone is attempting to use to view it, whether or not that device's markup was invented when the post was made? What if someone is trying to use text that legitimately resembles an HTML tag (it happens), or, more likely, they're trying to quote or talk about HTML -- no filter can handle this. No no no no no. Display what they type and don't confuse the poor souls. I.e., use htmlspecialchars() if outputting to HTML; or if not, use whatever other escaping method is appropriate to the particular output format that still preserves the integrity of the user-typed text in that format, while making exception for the formatting markup that is legitimately supported and documented to be supported by the forum, such as markdown or bbcode syntax (and probably not HTML, since besides the fact that HTML is ugly and over-complicated for most forum post needs, strip_tags with an allowed tags parameter is the most dangerous of the lot and allows blatant abuse via attributes).

And don't get me started on entities.

tl;dr: no amount of wrapping it in flashy filter functions changes the fact that strip_tags confuses countless souls, is brain-damaged, and ought to be deprecated to death.
 [2012-05-15 21:52 UTC] iamcraigcampbell at gmail dot com
@anon I agree with many of your sentiments :)

Just wanted to point out one thing.  The issue of unclosed script tags or other tags 
would not be a problem assuming the output is escaped which it should be.  Therefore 
if you had "<script" in the string it would end up being output as &lt;script and 
would not cause the issues that you mentioned.

As for displaying what the user typed I could see an argument either way on that.  
The fact still remains that this is a bug.
 [2016-12-31 18:20 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2016-12-31 18:20 UTC] cmb@php.net
I agree insofar as FILTER_SANITIZE_STRING and strip_tags() should
work identical, and I would have supposed that both share a common
implementation, what apparently isn't the case.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Dec 04 19:01:32 2024 UTC