php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #2028 strip_tags state engine inappropriate for single line of html.
Submitted: 1999-08-10 23:57 UTC Modified: 2000-05-30 19:18 UTC
From: cdi at thewebmasters dot net Assigned:
Status: Closed Package: Feature/Change Request
PHP Version: 4.0 Beta 2 OS: RHLinux 5.1 2.0.35
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: cdi at thewebmasters dot net
New email:
PHP Version: OS:

 

 [1999-08-10 23:57 UTC] cdi at thewebmasters dot net
Demo script:

<?php
Header("Content-type: text/plain");
$data = 'HREF="blah.blah">test</A> inside <A HREF="brackets.com">brackets</A>. What\'s it gonna do?';
$data = strip_tags($data);
echo "$data\n";
?>

Output:

HREF="blah.blah"test inside brackets. What's it gonna do?

Config: ./configure --prefix=/www --with-apache=../apache_1.3.3 --with-mysql --with-imap --with-zlib --with-config-file-path --enable-debug=yes --enable-track-vars=yes --enable-magic-quotes=yes --enable-memory-limit=yes

php.ini not relevant.


When doing "one line at a time" stripping, the state engine simply removes any extraneous > signs.  When I wrote a function similar to this to handle individual lines of html (no multi-line processing), the function set a boolean if and when it sees an < sign. If it sees a > before it ever sees a <, the function logic "assumed" that everything leading up to the > was html and removed it.  Worked like a champ.

Something else, although this is purely asthetic. After a >, and the state engine goes back to zero, it should plunk a "space" into the spot vacated by all the removed html if the next character is not a whitespace character or a less-than sign (<). Otherwise this little test program:

<?php
Header("Content-type: text/plain");
$data = '<TABLE BORDER=0><TR><TD>Hi there</TD></TR><TD>Ooops</TD></TR></TABLE>';
$data = strip_tags($data);
echo "$data\n";
?>

Results in this:

Hi thereOoops

Something like this should fix that (I think)..

case '>':
	if (state == 1) {
		if( *(p+1)!='<' ) {
			if(*(p+1)!=' ')&&(*(p+1)!='	') {
				*(rp++) = ' ';
			}
		}
		lc = '>';
		state = 0;
	} else if (state == 2) {
		if (!br && lc != '\"' && *(p-1)=='?') {
			state = 0;
		}
	}
	break;

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [1999-11-14 03:47 UTC] joey at cvs dot php dot net
Moving to change request
 [2000-05-30 19:18 UTC] rasmus at cvs dot php dot net
You want to strip incomplete tags because you are doing it on a line-by-line basis and the tag might have been started on a previous line?  Wouldn't it be easier to just concatenate your lines and do the strip_tags() once for the whole thing?  Stripping incomplete tags seems like a bad idea to me and there is no way to ever get it right anyway since a tag that starts on line 1, continues on line 2 and ends on line 3 will be impossible to handle correctly.
 
PHP Copyright © 2001-2026 The PHP Group
All rights reserved.
Last updated: Mon Jun 15 22:00:02 2026 UTC