php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #22820 script kicks out to command prompt.
Submitted: 2003-03-21 23:35 UTC Modified: 2003-05-09 07:32 UTC
Votes:2
Avg. Score:5.0 ± 0.0
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:2 (100.0%)
From: nick at axelis dot com Assigned:
Status: No Feedback Package: Reproducible crash
PHP Version: 4.3.2-RC OS: Windows 2000 sp3
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2003-03-21 23:35 UTC] nick at axelis dot com
I've tried running this in a browser and end up with a "document contains no data" error. The script is intended to run from the command prompt. I'm running it in two environments: 1. Red Hat 8.0, PHP 4.2.2, Apache 2.0.40. The other is win2k sp3, PHP 4.3.1, Apache 2.0.44. On the linux box it runs like a champ. It's fast, it's furious. On windows it starts out fine, but then at a certain point it just starts hammering the hard drive and leaves me at a command prompt. It doesn't seem to happen at a specific place in the script. It's  seems more like a memory allocation problem. It does not retur n any errors. I've found nothing in any of the system logs, apache log, php error log, nothing. I did once get an error that said: "erealloc(), failed to allocate 11 bytes." This did only happen once though, all of the other times it just dies. The script is a search engine spider. If I run it on a site with 20 or 30 pages to index it works great. If I hit a site that's bigger, it dies, but in a different place depending on the site. I've tested on at least 10 different sites with over 200 pages. The timing is consistent within a particular site, it always dies at the same place. I've done enought testing to ensure that the sites themselves are not the problem. Here's the script below:

<?php
require('../includes/config.inc');
global $robots, $keywords, $description, $title, $body, $url, $spiderday;
set_time_limit(0);

echo "##### The Spider is Running, Do Not Close This Console #####\n\n";

// Start the big loop
do {

// Open the database and start looking at URLs
$sql = mysql_query("SELECT * FROM search WHERE flag=0");
while($rslt = mysql_fetch_array($sql)){
	$flag = $rslt["flag"];
	$url = $rslt["url"];
	$crc = $rslt["checksum"];
	$date = $rslt["date"];

// Don't make them wait
	echo "\n\nWorking . . .\n";

// Don't go there if you don't have to
	if($flag == 1){
		continue;
	}

// Set the user agent to be sent
	ini_set('user_agent',$spiderhost);

// Open URL for parsing
	$open = @fopen("$url", "r");
	if($open){
		$read = fread($open, 100000);
		fclose($open);
	}
	else{
		$kill = mysql_query("DELETE FROM search WHERE url='$url'");
		continue;
	}

// Set date and checksum info
	$today = date("Y-m-d");
	$checksum = crc32($read);
	$chkyr = strftime(date("Y"));
	$chkmo = strftime(date("m"));
	$chkdy = strftime(date("d"));
	$chkdy = $chkdy - $spiderday;
	$daycheck = strftime("%Y-%m-%d", mktime(0,0,0,$chkmo,$chkdy,$chkyr));

// Get meta tags and use get_meta_tags to check if the file is actually there
	$meta = @get_meta_tags($url);
	if(!$meta){
		$kill = mysql_query("DELETE FROM search WHERE url='$url'");
		continue;
	}
	$robots = $meta["robots"];
	$keywords = $meta["keywords"];
	$description = $meta["description"];

// Check robots meta tags
	$metarobots = "noindex";
	if(checkmetarobots($metarobots)){
		echo "Indexing disallowed by robots meta tag: $url\n";
		continue;
	}
	$metarobots = "none";
	if(checkmetarobots($metarobots)){
		echo "Indexing disallowed by robots meta tag: $url\n";
		continue;
	}


// Get the page title
	$temp = spliti("title>",$read,3);
	$title = substr($temp[1],0,-2);

// Get the page body
	$body = str_replace("'","`",trim(strip_tags($read)));

// Make an announcement
	echo "Now Processing: $url\n";

// "Put the stuff in the search database\n";
	if($crc != $checksum){
		echo "Updating for CRC: $title\n$url\n";
		$renew = @mysql_query("UPDATE search SET url='$url', title='$title', metak='$keywords', metad='$description', mrobot='$robots', checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE url='$url'");
		if(!$renew){
			echo "NOT UPDATED: $url<br>mysql_error()\n";
			$kill = mysql_query("DELETE FROM search WHERE url='$url'");
			continue;		
		}
	}
	elseif($date <= $daycheck){
		echo "Updating for date: $title\n$url\n";
		$renew = @mysql_query("UPDATE search SET url='$url', title='$title', metak='$keywords', metad='$description', mrobot='$robots', checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE url='$url'");
		if(!$renew){
			echo "NOT UPDATED: $url<br>mysql_error()\n";
			$kill = mysql_query("DELETE FROM search WHERE url='$url'");
			continue;		
		}

	}
	else{
		$renew = @mysql_query("UPDATE search SET flag=1 WHERE url='$url'");
		if(!$renew){
			echo "NOT UPDATED: $url" . mysql_error() . "\n";
			$kill = mysql_query("DELETE FROM search WHERE url='$url'");
		}
		continue;
	}

// Check robots meta tags
	$metarobots = "nofollow";
	if(checkmetarobots($metarobots)){
		echo "Following disallowed by robots meta tag: $url\n";
		continue;
	}
	$metarobots = "none";
	if(checkmetarobots($metarobots)){
		echo "Following disallowed by robots meta tag: $url\n";
		continue;
	}

// "Parse the main URL\n";
	$top = parse_url($url);
	$tschm = $top["scheme"];
	$thost = $top["host"];
	$tpath = $top["path"];
	$tqury = $top["query"];
	$tfrag = $top["fragment"];

$currentdomain = $tschm . "://" . $thost;

// Parse all the links on the page
	$rtemp = stristr($read,"href");	
	$temp = stristr($rtemp,">");
	while($rtemp){
	//"Parse the href out of the string\n";
		$rtemp = stristr($temp,"href");	
		$lpos = strlen($rtemp) - strlen($temp);
		$temp = stristr($rtemp,">");
		$lend = strlen($rtemp) - strlen($temp);
		$alink = str_replace('"'," ",strip_tags(trim(substr($rtemp, 6, ($lend)))));
		$blink = stristr($alink," ");
		$alen = strlen($alink) - strlen($blink);
		$link = substr($alink, 0, $alen);

	// Kill any trailing slashes
		if(substr($link,(strlen($link)-1)) == "/"){
			$link = substr($link,0,(strlen($link)-1));
		}

		if(checkforgarbage()){
			continue;
		}

	// Parse the current link
		$bot = @parse_url($link);
		if(!$bot){
			continue;
		}
		$bschm = $bot["scheme"];
		$bhost = $bot["host"];
		$bpath = $bot["path"];
		$bqury = $bot["query"];
		$bfrag = $bot["fragment"];

	// Execute robots exclusion standard via robots.txt
		if(checkrobotstxt()){
			echo "Disallowed by robots.txt: $link\n";
			continue;
		}

	// Kill off any fragment based URLs
		if(strlen($bfrag) > 0){
			continue;
		}

	// Get rid of outside links
		if($bhost != "" && $bhost != $thost){
			continue;
		}

	// Kill off any dot dots ../../ 
		$ddotcheck = substr_count($bpath,"../");
		if($ddotcheck != ""){
			$lpos = strrpos($bpath,"..");
			$bpath = substr($bpath,$lpos);
		}

	// Comparitive analisys
		if($bpath != "" && substr($bpath,0,1) != "/"){
			if(strrpos($tpath,".") === false){
				$bpath = $tpath . "/" . $bpath;
			}
			if(strrpos($tpath,".")){
				$ttmp = substr($tpath,0,(strrpos($tpath,"/")+1));
				$bpath = $ttmp . $bpath;
				if(substr($bpath,0,1) != "/"){
					$bpath = "/" . $bpath;
				}
			}
		}

	// Check to see if the scheme and domain are in the url
		if($bhost == ""){
			$link = $tschm . "://" . $thost . $bpath;
		}

	// Kill any trailing slashes
		if(substr($link,(strlen($link)-1)) == "/"){
			$link = substr($link,0,(strlen($link)-1));
		}

	// If there is a query string put it back on
		if($bqury != ""){
			$link = $link . "?" . $bqury;
		}

	// Don't be overly recursive
		if($link == $currentdomain){
			continue;
		}

	// It it's a usless link, kill it
		if($link == ""){
			continue;
		}

		if(!checkandupdatetoindexer()){
			continue;
		}
	}

// Take the new URLs and put them in the search database, or finish if there are no more
$movem = mysql_query("SELECT url FROM indexer");
while($mvrslt = mysql_fetch_array($movem)){
	$murl = $mvrslt["url"];
	$putem = mysql_query("INSERT INTO search SET url='$murl'");
}
$kill = mysql_query("DELETE FROM indexer");
}
$preloop = mysql_fetch_row(mysql_query("SELECT COUNT(checksum) AS count FROM search WHERE checksum='0'"));
$loopcount = $preloop[0];
} while($loopcount > 0);

$done = mysql_query("UPDATE search SET flag=0 WHERE flag=1");

echo "\n\n##### The Spider is Finished, You Can Now Close This Console #####\n";


//////  Spider Functions   //////

function checkandupdatetoindexer(){
	global $link;
	// "Put the new URL in the search database\n";
		$chk = @mysql_query("SELECT url FROM search");
		while($curec = mysql_fetch_array($chk)){
			$curchk = $curec["url"];
			if($curchk == $link){
				return FALSE;
			}
		}
		echo "Adding: $link\n";
		$putup = mysql_query("INSERT INTO indexer SET url='$link'");
		return TRUE;
}

function checkforgarbage(){
		global $link;
		// "Get rid of any garbage and most binary files in the link\n";
		if(substr_count(strtolower($link),"&?") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"@") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"javascript") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"mailto") != 0){
			return TRUE;
		}
		
		if(substr_count(strtolower($link),"jpg") != 0){
			return TRUE;
		}
		
		if(substr_count(strtolower($link),"gif") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"pdf") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"pnf") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"mpg") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"mpeg") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"avi") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"mp3") != 0){
			return TRUE;
		}

		if(substr_count(strtolower($link),"wav") != 0){
			return TRUE;
		}
		
		return FALSE;
}

function checkmetarobots(){
	global $robots, $metarobots;
	if(substr_count($robots,$metarobots) > 0){
		return TRUE;
	}
	return FALSE;
}

function checkrobotstxt(){
	global $currentdomain, $bpath, $spiderhost;

	$getbot = $currentdomain . "/robots.txt";
	$robotay = @file($getbot);
		if(!$robotay){
			return FALSE;
		}
	$robotaycount = count($rebotay);
	$roop = 0;
	while($roop <= $robotaycount){
		$curele = $robotay[$roop];
		if($curele == ""){
			continue;
		}
		$thecolon = strpos($curele,":");
		if(substr($curele,0,$thecolon) == "User-agent:"){
			$robgent = trim(substr($curele,$thecolon+1));
			if($robgent == "*" || $robgent == $spiderhost){
				$dospider = 1;
			}
			else{
				$dospider = 0;
			}
		}
		if(substr($curele,0,$thecolon) == "Disallow:"){
			$robdis = trim(substr($curele,$thecolon+1));
			echo "$robdis\n";
			$roblen = strlen($robdis);
			if(substr($bpath,0,$roblen) == $robdis && $dospider == 1){
				return TRUE;
		}
		}
		++$roop;
	}
	return FALSE;
}


?>

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-03-23 19:35 UTC] nick at axelis dot com
Ok. I got the latest snapshot and applied it. The results where not what I would expect. Wit the new snapshot I can't use the sapi mod for apache 2, apache won't load when with it. I've now got it configured to use the CGI, and that works. The problem, however, still remains, there is no change.
 [2003-03-24 03:52 UTC] sniper@php.net
About the Apache2 sapi, you need Apache 2.0.44 installed.

About the cli problem, please provide a _SHORT_ example
script which we can use to test this. And I mean a script
that is max. 15-20 lines long and runs as-is.

 [2003-03-26 19:16 UTC] nick at axelis dot com
Ok, let's try this again:

As stated in my original message, I already have Apache 2.0.44 installed. As far as sending a _SHORT_ chunck of code that will reproduce this problem, I wouldn't know where to begin. The problem happens at various locations in the code where different, unrelated, stuff is happening. If I could isolate this to a specific subset of the code I would have already fixed it. I'll try to put together something that might reproduce this, but I can't be sure what will happen.
 [2003-03-27 07:49 UTC] edink@php.net
Please try using this CVS snapshot:

  http://snaps.php.net/php4-STABLE-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php4-win32-STABLE-latest.zip

Apache2 problems are fixed now.
 [2003-03-30 23:30 UTC] nick at axelis dot com
Ok, I got that snapshot and you're right, the apache problem is fixed. In fact it works great. I've been trying to whiddle this down to 15-20 lines like you asked but I just can't pinpoint the problem. Given the way this script works, it's impossible to make it do anything except what it does. What I really need to know is, is this a problem with php 4.3.x? or with Windows? I was going to downgrade to 4.2.3 but I can't do that with this version of apache, and I'd rather not go through the trouble of downgrading that as well on a production server.
 [2003-03-31 14:52 UTC] nick at axelis dot com
I think I've got something here that will work. I've reinstalled 4.2.3 and configured to run as CGI. This works. At this point I think it's safe to assume that this is a problem with 4.3.X. I don't have a linux/unix box with 4.3.x on it so I can't report anything there, but I think this really narrows things down to where we should be able to isolate the problem pretty effectively. I'm going to start doing that and I'll let you know what, if anything, I come up with.
 [2003-03-31 18:30 UTC] sniper@php.net
Does 4.3.2-RC works if you configure it as CGI? Or run with CLI binary?


 [2003-03-31 20:25 UTC] nick at axelis dot com
No, I tried that with 4.3.1 and 4.3.2-RC. I got the same results. Remember, I'm running this from the command line so it's using CGI or CLI anyway.
 [2003-04-01 06:20 UTC] sniper@php.net
You really need to provide a shorter (and complete!!) example
script. Otherwise this report is as good as nothing.

 [2003-04-06 06:58 UTC] sniper@php.net
No feedback was provided. The bug is being suspended because
we assume that you are no longer experiencing the problem.
If this is not the case and you are able to provide the
information that was requested earlier, please do so and
change the status of the bug back to "Open". Thank you.


 [2003-04-28 07:35 UTC] info at paradigmdirect dot com
I have a bit of code that does the exact same thing. It is unfortunately too big as Nick pointed out with his. The common thing is fopen on a url and a connection to mysql. The errors occur all over the place according to dr watson. Here are some examples:

function: efree
FAULT ->100b8fb4 8b4608           mov     eax,[esi+0x8]          ds:00a7d5c2=????????

...another...
function: zend_hash_index_update_or_next_insert
FAULT ->100adc49 892cb1           mov     [ecx+esi*4],ebp        ds:0000000b=????????

...another (whole)...
function: zend_hash_rehash
        100add42 c1e902           shr     ecx,0x2
        100add45 f3ab             rep     stosd                  es:00ce8c60=011931f0
        100add47 8bce             mov     ecx,esi
        100add49 33f6             xor     esi,esi
        100add4b 83e103           and     ecx,0x3
        100add4e f3aa             rep     stosb                        es:00ce8c60=f0
        100add50 8b4214           mov     eax,[edx+0x14]         ds:01766eb2=????????
        100add53 3bc6             cmp     eax,esi
        100add55 7427             jz      do_bind_function_or_class+0x2b2e (100b687e)
        100add57 8b4a04           mov     ecx,[edx+0x4]          ds:01766eb2=????????
FAULT ->100add5a 8b38             mov     edi,[eax]              ds:65332d36=????????
        100add5c 23cf             and     ecx,edi
        100add5e 8b7a1c           mov     edi,[edx+0x1c]         ds:01766eb2=????????
        100add61 8b3c8f           mov     edi,[edi+ecx*4]        ds:0000000f=????????
        100add64 89701c           mov     [eax+0x1c],esi         ds:65db0308=????????
        100add67 3bfe             cmp     edi,esi
        100add69 897818           mov     [eax+0x18],edi         ds:65db0308=????????
        100add6c 7403             jz      do_bind_function_or_class+0x2921 (100b6671)
        100add6e 89471c           mov     [edi+0x1c],eax         ds:01766232=????????
        100add71 8b7a1c           mov     edi,[edx+0x1c]         ds:01766eb2=????????
        100add74 89048f           mov     [edi+ecx*4],eax        ds:0000000f=????????
        100add77 8b4010           mov     eax,[eax+0x10]         ds:65db0308=????????

Hope this helps a little.

J
 [2003-04-28 09:46 UTC] wez@php.net
Please try using this CVS snapshot:

  http://snaps.php.net/php4-STABLE-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php4-win32-STABLE-latest.zip

Please try the *next* CVS snapshot; I've just fixed a bug related to fopen() of http URLs.

Next STABLE Win32 snapshot in: 1 hour(s) and 28 minute(s)

 [2003-05-09 07:32 UTC] sniper@php.net
No feedback was provided. The bug is being suspended because
we assume that you are no longer experiencing the problem.
If this is not the case and you are able to provide the
information that was requested earlier, please do so and
change the status of the bug back to "Open". Thank you.


 [2003-05-10 03:58 UTC] php-bugs at webfreezer dot com
I experience the same problem with PHP 4.3.1 CLI on Windows XP.
And as "info at paradigmdirect dot com" points out it really happens in a script that does a FOPEN on a url and uses a MySQL database.
The error happens from time to time and does not seem to be reproducible. The script is too long to post here but please notice that this error occurs in the Windows CLI Version of PHP 4.3.1.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Apr 27 23:01:30 2024 UTC