php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #36112 preg_replace example suggests poor patterns, which are harmful if really used
Submitted: 2006-01-20 23:54 UTC Modified: 2006-06-16 13:44 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: pornel at despammed dot com Assigned: colder (profile)
Status: Closed Package: Documentation problem
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: pornel at despammed dot com
New email:
PHP Version: OS:

 

 [2006-01-20 23:54 UTC] pornel at despammed dot com
Description:
------------
The code on http://uk.php.net/preg_replace:

$search = array ('@<script[^>]*?>.*?</script>@si', // Strip 
out javascript
                 '@<[\/\!]*?[^<>]*?>@si',          // Strip 
out HTML tags

doesn't work as advertised. For example it will leave 
contents of:
<script>xxx</script       >
and worse, it will output valid script tags if given:
<<>script>evil<<>/script>

If these patterns were used on some website (for stripping 
markup from user's comments for example), they'd allow XSS 
attack.


Since it's near impossible to properly parse HTML with 
regular expressions I suggest:
* renaming example from 'Convert HTML to text' to 'Remove 
HTML markup'
* adding replacement of '<' as '&gt;'
* suggesting use of more robust methods, like strip_tags, 
nl2br, htmlspecialchars or DOM interface.



Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-03-12 17:06 UTC] colder@php.net
There are lot of inconsistencies in this example:

1) About @<script[^>]*?>.*?</script>@si :
   a) the first ? is useless.

2) About @<[\/\!]*?[^<>]*?>@si :
   a) / and ! don't have to be escaped. 
   b) [\/\!]*? is useless, as it's already matched by [^<>]*?. 
   c) the ? of [^<>]*? is useless.
   d) the PCRE_DOTALL modifier is useless, there is no dot.
   e) the PCRE_CASELESS modifier is useless.
   f) what is the point avoiding "<" in a tag?

3) About @([\r\n])[\s]+@ :
   a) no need to put \s in a char class.
   b) every \r\n will be changed to \r, as \s matches \n.

I think the whole example has to be reconsidered, because there are already functions to do some of the job, like strip_tags() and html_entity_decode().
 [2006-06-16 13:44 UTC] colder@php.net
This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation better.

I simply removed the example for now.
 [2020-02-07 06:11 UTC] phpdocbot@php.net
Automatic comment on behalf of colder
Revision: http://git.php.net/?p=doc/en.git;a=commit;h=b287313c6b706c02e4b97f36f29d3e4e2e813165
Log: Fix #36112 (Bad example removed)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon May 06 20:01:31 2024 UTC