|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2003-03-21 23:35 UTC] nick at axelis dot com
I've tried running this in a browser and end up with a "document contains no data" error. The script is intended to run from the command prompt. I'm running it in two environments: 1. Red Hat 8.0, PHP 4.2.2, Apache 2.0.40. The other is win2k sp3, PHP 4.3.1, Apache 2.0.44. On the linux box it runs like a champ. It's fast, it's furious. On windows it starts out fine, but then at a certain point it just starts hammering the hard drive and leaves me at a command prompt. It doesn't seem to happen at a specific place in the script. It's seems more like a memory allocation problem. It does not retur n any errors. I've found nothing in any of the system logs, apache log, php error log, nothing. I did once get an error that said: "erealloc(), failed to allocate 11 bytes." This did only happen once though, all of the other times it just dies. The script is a search engine spider. If I run it on a site with 20 or 30 pages to index it works great. If I hit a site that's bigger, it dies, but in a different place depending on the site. I've tested on at least 10 different sites with over 200 pages. The timing is consistent within a particular site, it always dies at the same place. I've done enought testing to ensure that the sites themselves are not the problem. Here's the script below:
<?php
require('../includes/config.inc');
global $robots, $keywords, $description, $title, $body, $url, $spiderday;
set_time_limit(0);
echo "##### The Spider is Running, Do Not Close This Console #####\n\n";
// Start the big loop
do {
// Open the database and start looking at URLs
$sql = mysql_query("SELECT * FROM search WHERE flag=0");
while($rslt = mysql_fetch_array($sql)){
$flag = $rslt["flag"];
$url = $rslt["url"];
$crc = $rslt["checksum"];
$date = $rslt["date"];
// Don't make them wait
echo "\n\nWorking . . .\n";
// Don't go there if you don't have to
if($flag == 1){
continue;
}
// Set the user agent to be sent
ini_set('user_agent',$spiderhost);
// Open URL for parsing
$open = @fopen("$url", "r");
if($open){
$read = fread($open, 100000);
fclose($open);
}
else{
$kill = mysql_query("DELETE FROM search WHERE url='$url'");
continue;
}
// Set date and checksum info
$today = date("Y-m-d");
$checksum = crc32($read);
$chkyr = strftime(date("Y"));
$chkmo = strftime(date("m"));
$chkdy = strftime(date("d"));
$chkdy = $chkdy - $spiderday;
$daycheck = strftime("%Y-%m-%d", mktime(0,0,0,$chkmo,$chkdy,$chkyr));
// Get meta tags and use get_meta_tags to check if the file is actually there
$meta = @get_meta_tags($url);
if(!$meta){
$kill = mysql_query("DELETE FROM search WHERE url='$url'");
continue;
}
$robots = $meta["robots"];
$keywords = $meta["keywords"];
$description = $meta["description"];
// Check robots meta tags
$metarobots = "noindex";
if(checkmetarobots($metarobots)){
echo "Indexing disallowed by robots meta tag: $url\n";
continue;
}
$metarobots = "none";
if(checkmetarobots($metarobots)){
echo "Indexing disallowed by robots meta tag: $url\n";
continue;
}
// Get the page title
$temp = spliti("title>",$read,3);
$title = substr($temp[1],0,-2);
// Get the page body
$body = str_replace("'","`",trim(strip_tags($read)));
// Make an announcement
echo "Now Processing: $url\n";
// "Put the stuff in the search database\n";
if($crc != $checksum){
echo "Updating for CRC: $title\n$url\n";
$renew = @mysql_query("UPDATE search SET url='$url', title='$title', metak='$keywords', metad='$description', mrobot='$robots', checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE url='$url'");
if(!$renew){
echo "NOT UPDATED: $url<br>mysql_error()\n";
$kill = mysql_query("DELETE FROM search WHERE url='$url'");
continue;
}
}
elseif($date <= $daycheck){
echo "Updating for date: $title\n$url\n";
$renew = @mysql_query("UPDATE search SET url='$url', title='$title', metak='$keywords', metad='$description', mrobot='$robots', checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE url='$url'");
if(!$renew){
echo "NOT UPDATED: $url<br>mysql_error()\n";
$kill = mysql_query("DELETE FROM search WHERE url='$url'");
continue;
}
}
else{
$renew = @mysql_query("UPDATE search SET flag=1 WHERE url='$url'");
if(!$renew){
echo "NOT UPDATED: $url" . mysql_error() . "\n";
$kill = mysql_query("DELETE FROM search WHERE url='$url'");
}
continue;
}
// Check robots meta tags
$metarobots = "nofollow";
if(checkmetarobots($metarobots)){
echo "Following disallowed by robots meta tag: $url\n";
continue;
}
$metarobots = "none";
if(checkmetarobots($metarobots)){
echo "Following disallowed by robots meta tag: $url\n";
continue;
}
// "Parse the main URL\n";
$top = parse_url($url);
$tschm = $top["scheme"];
$thost = $top["host"];
$tpath = $top["path"];
$tqury = $top["query"];
$tfrag = $top["fragment"];
$currentdomain = $tschm . "://" . $thost;
// Parse all the links on the page
$rtemp = stristr($read,"href");
$temp = stristr($rtemp,">");
while($rtemp){
//"Parse the href out of the string\n";
$rtemp = stristr($temp,"href");
$lpos = strlen($rtemp) - strlen($temp);
$temp = stristr($rtemp,">");
$lend = strlen($rtemp) - strlen($temp);
$alink = str_replace('"'," ",strip_tags(trim(substr($rtemp, 6, ($lend)))));
$blink = stristr($alink," ");
$alen = strlen($alink) - strlen($blink);
$link = substr($alink, 0, $alen);
// Kill any trailing slashes
if(substr($link,(strlen($link)-1)) == "/"){
$link = substr($link,0,(strlen($link)-1));
}
if(checkforgarbage()){
continue;
}
// Parse the current link
$bot = @parse_url($link);
if(!$bot){
continue;
}
$bschm = $bot["scheme"];
$bhost = $bot["host"];
$bpath = $bot["path"];
$bqury = $bot["query"];
$bfrag = $bot["fragment"];
// Execute robots exclusion standard via robots.txt
if(checkrobotstxt()){
echo "Disallowed by robots.txt: $link\n";
continue;
}
// Kill off any fragment based URLs
if(strlen($bfrag) > 0){
continue;
}
// Get rid of outside links
if($bhost != "" && $bhost != $thost){
continue;
}
// Kill off any dot dots ../../
$ddotcheck = substr_count($bpath,"../");
if($ddotcheck != ""){
$lpos = strrpos($bpath,"..");
$bpath = substr($bpath,$lpos);
}
// Comparitive analisys
if($bpath != "" && substr($bpath,0,1) != "/"){
if(strrpos($tpath,".") === false){
$bpath = $tpath . "/" . $bpath;
}
if(strrpos($tpath,".")){
$ttmp = substr($tpath,0,(strrpos($tpath,"/")+1));
$bpath = $ttmp . $bpath;
if(substr($bpath,0,1) != "/"){
$bpath = "/" . $bpath;
}
}
}
// Check to see if the scheme and domain are in the url
if($bhost == ""){
$link = $tschm . "://" . $thost . $bpath;
}
// Kill any trailing slashes
if(substr($link,(strlen($link)-1)) == "/"){
$link = substr($link,0,(strlen($link)-1));
}
// If there is a query string put it back on
if($bqury != ""){
$link = $link . "?" . $bqury;
}
// Don't be overly recursive
if($link == $currentdomain){
continue;
}
// It it's a usless link, kill it
if($link == ""){
continue;
}
if(!checkandupdatetoindexer()){
continue;
}
}
// Take the new URLs and put them in the search database, or finish if there are no more
$movem = mysql_query("SELECT url FROM indexer");
while($mvrslt = mysql_fetch_array($movem)){
$murl = $mvrslt["url"];
$putem = mysql_query("INSERT INTO search SET url='$murl'");
}
$kill = mysql_query("DELETE FROM indexer");
}
$preloop = mysql_fetch_row(mysql_query("SELECT COUNT(checksum) AS count FROM search WHERE checksum='0'"));
$loopcount = $preloop[0];
} while($loopcount > 0);
$done = mysql_query("UPDATE search SET flag=0 WHERE flag=1");
echo "\n\n##### The Spider is Finished, You Can Now Close This Console #####\n";
////// Spider Functions //////
function checkandupdatetoindexer(){
global $link;
// "Put the new URL in the search database\n";
$chk = @mysql_query("SELECT url FROM search");
while($curec = mysql_fetch_array($chk)){
$curchk = $curec["url"];
if($curchk == $link){
return FALSE;
}
}
echo "Adding: $link\n";
$putup = mysql_query("INSERT INTO indexer SET url='$link'");
return TRUE;
}
function checkforgarbage(){
global $link;
// "Get rid of any garbage and most binary files in the link\n";
if(substr_count(strtolower($link),"&?") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"@") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"javascript") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"mailto") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"jpg") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"gif") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"pdf") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"pnf") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"mpg") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"mpeg") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"avi") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"mp3") != 0){
return TRUE;
}
if(substr_count(strtolower($link),"wav") != 0){
return TRUE;
}
return FALSE;
}
function checkmetarobots(){
global $robots, $metarobots;
if(substr_count($robots,$metarobots) > 0){
return TRUE;
}
return FALSE;
}
function checkrobotstxt(){
global $currentdomain, $bpath, $spiderhost;
$getbot = $currentdomain . "/robots.txt";
$robotay = @file($getbot);
if(!$robotay){
return FALSE;
}
$robotaycount = count($rebotay);
$roop = 0;
while($roop <= $robotaycount){
$curele = $robotay[$roop];
if($curele == ""){
continue;
}
$thecolon = strpos($curele,":");
if(substr($curele,0,$thecolon) == "User-agent:"){
$robgent = trim(substr($curele,$thecolon+1));
if($robgent == "*" || $robgent == $spiderhost){
$dospider = 1;
}
else{
$dospider = 0;
}
}
if(substr($curele,0,$thecolon) == "Disallow:"){
$robdis = trim(substr($curele,$thecolon+1));
echo "$robdis\n";
$roblen = strlen($robdis);
if(substr($bpath,0,$roblen) == $robdis && $dospider == 1){
return TRUE;
}
}
++$roop;
}
return FALSE;
}
?>
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Fri Nov 07 02:00:01 2025 UTC |
I have a bit of code that does the exact same thing. It is unfortunately too big as Nick pointed out with his. The common thing is fopen on a url and a connection to mysql. The errors occur all over the place according to dr watson. Here are some examples: function: efree FAULT ->100b8fb4 8b4608 mov eax,[esi+0x8] ds:00a7d5c2=???????? ...another... function: zend_hash_index_update_or_next_insert FAULT ->100adc49 892cb1 mov [ecx+esi*4],ebp ds:0000000b=???????? ...another (whole)... function: zend_hash_rehash 100add42 c1e902 shr ecx,0x2 100add45 f3ab rep stosd es:00ce8c60=011931f0 100add47 8bce mov ecx,esi 100add49 33f6 xor esi,esi 100add4b 83e103 and ecx,0x3 100add4e f3aa rep stosb es:00ce8c60=f0 100add50 8b4214 mov eax,[edx+0x14] ds:01766eb2=???????? 100add53 3bc6 cmp eax,esi 100add55 7427 jz do_bind_function_or_class+0x2b2e (100b687e) 100add57 8b4a04 mov ecx,[edx+0x4] ds:01766eb2=???????? FAULT ->100add5a 8b38 mov edi,[eax] ds:65332d36=???????? 100add5c 23cf and ecx,edi 100add5e 8b7a1c mov edi,[edx+0x1c] ds:01766eb2=???????? 100add61 8b3c8f mov edi,[edi+ecx*4] ds:0000000f=???????? 100add64 89701c mov [eax+0x1c],esi ds:65db0308=???????? 100add67 3bfe cmp edi,esi 100add69 897818 mov [eax+0x18],edi ds:65db0308=???????? 100add6c 7403 jz do_bind_function_or_class+0x2921 (100b6671) 100add6e 89471c mov [edi+0x1c],eax ds:01766232=???????? 100add71 8b7a1c mov edi,[edx+0x1c] ds:01766eb2=???????? 100add74 89048f mov [edi+ecx*4],eax ds:0000000f=???????? 100add77 8b4010 mov eax,[eax+0x10] ds:65db0308=???????? Hope this helps a little. J