1.phpQuery
<?php
require("phpQuery-onefile.php");
phpQuery::newDocumentFile('https://www.haiyun.me/archives.html');
//foreach (pq('body .main li')->find('a') as $a) {
foreach (pq('body .main li a') as $a) {
$hrefs[] = pq($a)->attr('href');
$hrefs[] = pq($a)->text();
}
print_r($hrefs);
?>
记得释放内存:
phpQuery::$documents = array();
phpQuery::unloadDocuments();
2.基于tidy的HtmlParserModel,可解析不正规的HTML页面:
yum install php-tidy
git clone https://github.com/bupt1987/HtmlParserModel.git
<?php
include_once "HtmlParserModel.php";
$html = file_get_contents('http://www.amazon.com/s/node=3564986011');
$html_dom = new HtmlParserModel($html);
$p_array = $html_dom->find('a.title');
foreach ($p_array as $p){
echo $p->getPlainText();
}
?>
标签:none