php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #2028 strip_tags state engine inappropriate for single line of html.
Submitted: 1999-08-10 23:57 UTC Modified: 2000-05-30 19:18 UTC
From: cdi at thewebmasters dot net Assigned:
Status: Closed Package: Feature/Change Request
PHP Version: 4.0 Beta 2 OS: RHLinux 5.1 2.0.35
Private report: No CVE-ID: None
 [1999-08-10 23:57 UTC] cdi at thewebmasters dot net
Demo script:

<?php
Header("Content-type: text/plain");
$data = 'HREF="blah.blah">test</A> inside <A HREF="brackets.com">brackets</A>. What\'s it gonna do?';
$data = strip_tags($data);
echo "$data\n";
?>

Output:

HREF="blah.blah"test inside brackets. What's it gonna do?

Config: ./configure --prefix=/www --with-apache=../apache_1.3.3 --with-mysql --with-imap --with-zlib --with-config-file-path --enable-debug=yes --enable-track-vars=yes --enable-magic-quotes=yes --enable-memory-limit=yes

php.ini not relevant.


When doing "one line at a time" stripping, the state engine simply removes any extraneous > signs.  When I wrote a function similar to this to handle individual lines of html (no multi-line processing), the function set a boolean if and when it sees an < sign. If it sees a > before it ever sees a <, the function logic "assumed" that everything leading up to the > was html and removed it.  Worked like a champ.

Something else, although this is purely asthetic. After a >, and the state engine goes back to zero, it should plunk a "space" into the spot vacated by all the removed html if the next character is not a whitespace character or a less-than sign (<). Otherwise this little test program:

<?php
Header("Content-type: text/plain");
$data = '<TABLE BORDER=0><TR><TD>Hi there</TD></TR><TD>Ooops</TD></TR></TABLE>';
$data = strip_tags($data);
echo "$data\n";
?>

Results in this:

Hi thereOoops

Something like this should fix that (I think)..

case '>':
	if (state == 1) {
		if( *(p+1)!='<' ) {
			if(*(p+1)!=' ')&&(*(p+1)!='	') {
				*(rp++) = ' ';
			}
		}
		lc = '>';
		state = 0;
	} else if (state == 2) {
		if (!br && lc != '\"' && *(p-1)=='?') {
			state = 0;
		}
	}
	break;

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [1999-11-14 03:47 UTC] joey at cvs dot php dot net
Moving to change request
 [2000-05-30 19:18 UTC] rasmus at cvs dot php dot net
You want to strip incomplete tags because you are doing it on a line-by-line basis and the tag might have been started on a previous line?  Wouldn't it be easier to just concatenate your lines and do the strip_tags() once for the whole thing?  Stripping incomplete tags seems like a bad idea to me and there is no way to ever get it right anyway since a tag that starts on line 1, continues on line 2 and ends on line 3 will be impossible to handle correctly.
 
PHP Copyright © 2001-2026 The PHP Group
All rights reserved.
Last updated: Mon Jun 15 20:00:02 2026 UTC