Wiki的php搜寻器出现错误 [英] php crawler for wiki getting error
问题描述
在下面的代码中,我尝试使用php代码从网站中提取内容,当我使用getElementByIdAsString('www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp",简介");
In the below code I am trying to extract the content from the website using the php code, which is working fine when I use getElementByIdAsString('www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp', 'synopsis');
但是当我使用相同的代码从Wikipedia中提取内容时,getElementByIdAsString(' https: //en.wikipedia.org/wiki/A_Brief_History_of_Time ','Summary');
But it is not working when I use the same code to extract content from wikipedia, getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary');
下面是我的代码,也是我使用后者时遇到的异常.有人可以更正我的代码以根据id提取Wikipedia内容
Below is my code and the exception I am getting when I use the latter one.Can someone correct my code to extract wikipedia content based on the id
谢谢.
<?php
function getElementByIdAsString($url, $id, $pretty = true) {
$doc = new DOMDocument();
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
// var_dump($doc->loadHTMLFile($url)); die;
error_reporting(E_ERROR | E_PARSE);
if(!$result) {
throw new Exception("Failed to load $url");
}
$doc->loadHTML($result);
// Obtain the element
$element = $doc->getElementById($id);
if(!$element) {
throw new Exception("An element with id $id was not found");
}
if($pretty) {
$doc->formatOutput = true;
}
// Return the string representation of the element
return $doc->saveXML($element);
}
//Here I am dispalying the output in bold text
echo getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary');
?>
例外
Fatal error: Uncaught exception 'Exception' with message 'Failed to load http://en.wikipedia.org/wiki/A_Brief_History_of_Time' in C:\xampp\htdocs\example2.php:18 Stack trace: #0 C:\xampp\htdocs\example2.php(40): getElementByIdAsString() #1 {main} thrown in C:\xampp\htdocs\example2.php on line 18
您的帮助将非常有用:-)
Your help would be very greatful :-)
推荐答案
尝试添加:
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
经过评论讨论后更新:
<?php
function getElementByIdAsString($url, $id, $pretty = true) {
$doc = new DOMDocument();
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($ch);
error_reporting(E_ERROR | E_PARSE);
if(!$result) {
throw new Exception("Failed to load $url");
}
$doc->loadHTML($result);
// Obtain the element
$element = $doc->getElementById($id);
if(!$element) {
throw new Exception("An element with id $id was not found");
}
if($pretty) {
$doc->formatOutput = true;
}
$output = '';
$node = $element->parentNode;
while(true) {
$node = $node->nextSibling;
if(!$node) {
break;
}
if($node->nodeName == 'p') {
$output .= $node->nodeValue;
}
if($node->nodeName == 'h2') {
break;
}
}
return $output;
}
//Here I am dispalying the output in bold text
var_dump(getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary'));
您可能还可以使用xPath或仅使用整个响应并使用正则表达式剪切任何内容
You probably could also use xPaths or just use the whole response and cut whatever you want with regex
这篇关于Wiki的php搜寻器出现错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!