curl - How to minimize memory consumption from PHP Spider when using curl_multi_getcontent()? -
i hope can me out this. i'm writing spider function in php recursively crawls across website (via links finds on site's pages) until pre-specified depth.
so far spider works 2 levels of depth. problem when depth 3 or more levels down, on larger websites. fatal memory error, i'm thinking has of recursive multi-processing using curl (and because 3 levels down on sites can mean thousands of urls processed).
fatal error: allowed memory size of 134217728 bytes exhausted (tried allocate 366030 bytes) in c:\xampp\htdocs\crawler.php on line 105
my question might doing wrong (or ought doing) minimize memory consumption.
here's code looks like, important areas related memory usage left intact, , more complex processing sections replaced pseudo-code/comments (to make simpler read). thanks!
<?php function crawler( $urlarray, $visitedurlarray, $depth ){ /* recursion check --------------- */ if( empty( $urlarray) || ( $depth < 1 ) ){ return; } /* set multi-handler -------------------- */ $multicurlhandler = curl_multi_init(); $curlhandlearray= array(); foreach( $urlarray $url ){ $curlhandlearray[$url] = curl_init(); curl_setopt( $curlhandlearray[$url], curlopt_url, $url ); curl_setopt( $curlhandlearray[$url], curlopt_header, 0 ); curl_setopt( $curlhandlearray[$url], curlopt_timeout, 1000 ); curl_setopt( $curlhandlearray[$url], curlopt_returntransfer , 1 ); curl_multi_add_handle( $multicurlhandler, $curlhandlearray[$url] ); } /* run multi-exec -------------- */ $running = null; { curl_multi_exec( $multicurlhandler, $running ); } while ( $running > 0 ); /* process url pages find links traverse ------------------------------------------- */ foreach( $curlhandlearrayas $key => $curlhandle ){ /* grab content handle , close --------------------------------------- */ $urlcontent = curl_multi_getcontent( $curlhandle ); curl_multi_remove_handle( $multicurlhandler, $curlhandle ); curl_close( $curlhandle ); /* place content in domdocument easy link processing ------------------------------------------------------- */ $domdoc = new domdocument( '1.0' ); $success = @$domdoc -> loadhtml( $urlcontent ); /* array hold urls pass recursively -------------------------------------------------- */ $recursionurlsarray = array(); /* grab links domdocument , add new url array ---------------------------------------------------------------- */ $anchors = $domdoc -> getelementsbytagname( 'a' ); foreach( $anchors $element ){ // ---clean link // ---check if link in $visited // ---if so, continue; // ---if not, add $recursionurlsarray , $visitedurlarray } /* call function recursively parsed urls -------------------------------------------------- */ $visitedurlarray = crawler( $recursionurlsarray, $visitedurlarray, $depth - 1 ); } /* close , unset variables ------------------------- */ curl_multi_close( $multicurlhandler ); unset( $multicurlhandler ); unset( $curlhandlearray ); return $visitedurlarray; } ?>
this problem:
"i'm writing spider function in php recursively crawls across website"
don't that. going infinite loop , cause denial of service. real problem not running out of memory. real problem going take down sites crawling.
real webspiders not attack website , hit every page boom boom boom doing. way doing more attack legitimate webcrawler. called "crawlers" because "crawl" in "go slow." plus, legitimate webcrawler read robots.txt file , not read pages off limits according file.
you should more this:
read 1 page , save links database url has unique constraint don't same 1 in there more once. table should have status field show if url has been read or not.
grab url database status field shows unread. read it, save urls links database. update status field on database show been read.
repeat #2 needed..but @ pace of crawl.
from http://en.wikipedia.org/wiki/web_crawler#politeness_policy :
anecdotal evidence access logs shows access intervals known crawlers vary between 20 seconds , 3–4 minutes.
Comments
Post a Comment