用PHP爬网站,但网站运行JS生成标记 [英] Crawling a website with PHP, but the website runs JS to generate markup

查看:20
本文介绍了用PHP爬网站,但网站运行JS生成标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去几周我一直在做网络爬虫.使用 PHP 库(PHP Simple DOM),我运行一个 php 脚本(使用终端)以从中获取一些 URL 和 JSON 一些数据.到目前为止,这一直工作得很好.

I have been doing webcrawling for the last couple weeks. Using a PHP library (PHP Simple DOM), im running a php script (using terminal) to fetch some URLs and JSON some data from it. This has been working very nice so far.

最近想扩展一个特定站点的爬取,遇到如下问题:

Recently i wanted to expand the crawling for a specific site and encountered the following problem:

与目前为止的任何其他站点不同,这个站点仅响应准系统标记服务器端,而是依赖单个 JS 脚本来构建相关的标记加载.

Unlike any other site so far, this one only echos a barebones markup server side and instead relies on a single JS script to build up the relevant markup onload.

显然,我的 PHP 脚本无法处理(因为它没有执行 JS,因此该站点基本上保持空白,据我所知),因此我无法抓取该站点,因为尚未创建内容.

Obviously my PHP script cant handle that (as it is not executing the JS and hence the site stays mostly blank from what i can tell) and so i cant crawl the site, since the content is not yet created.

我不确定如何继续.是否真的有可能将我当前的 PHP 脚本转换为与该站点兼容",或者我是否需要换档并加入浏览器,即选择一条完全不同的路线?

Im unsure how to proceed. Is it actually possibly to convert my current PHP script to be "compatible" with that site, or do i need to change gears and incorporate a browser, i.e. pick a completely different route ?

我目前认为我需要创建 html/js 站点,在 iFrame 中打开 URL,这样我就可以通过控制台手动运行 JS 函数来提取数据.但是,我希望有更可行的方法.

Im currently thinking i would need to create html/js site which opens the URL in an iFrame and that way i could run a JS function manually via the console to extract the data. However, im hoping there is a more feasible way.

谢谢,

推荐答案

当我需要废弃一个网站时,我通常:

When I need to scrap a website I normally:

1 - 在普通浏览器(ff、chrome 等)上导航目标网站,同时监控/记录任何POST/GET 通过 Developer Tools -> Network Tab 请求包含相关信息.
请特别注意 XHR 请求,因为它们通常包含 json 编码数据.
这是我制作的一个小视频来说明这一点:

1 - Navigate the target website on a normal browser (ff, chrome, etc.), while monitoring/logging any POST/GET requests containing pertinent info via Developer Tools -> Network Tab.
Pay special attention to XHR requests, as they normally contain json encoded data.
Here's a small video I've made exemplifying this:

https://www.youtube.com/watch?v=JbiZBGt8cos

您可以模仿之前制作的请求标头(在视频中进行了解释)并将其用于curl请求,即:

You can mimic the request headers made previously (explained in the video) and use it on a curl request, i.e.:

$headers = [
    "Connection: keep-alive",
    "Accept: application/json, text/javascript, */*; q=0.01",
    "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "DNT: 1",
    "Accept-Language: pt,en-US;q=0.9,en;q=0.8,pt-PT;q=0.7,pt-BR;q=0.6",
];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://s1te.com/json_rand.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$server_output = curl_exec ($ch);
curl_close ($ch);
print  $server_output ;

<小时>

2 - 在某些情况下,如果没有 启用 JavaScript 的客户端,就不可能抓取某些 URL,当发生这种情况时,我通常使用 SeleniumChromeFirefox.您还可以使用 PhantomJS,一个无头浏览器.最新版本的 GeckoDriver(Selenium 使用)也支持无头浏览.


2 - In some cases, it's impossible to crawl certain URL's without a JavaScript Enabled Client, when this happens, I normally use Selenium with Chrome or Firefox. You can also use PhantomJS, a headless browser. Latest versions of GeckoDriver (used by Selenium) also support headless browsing.

我知道问题是关于 PHP,但如果 OP 需要使用 SeleniumPython 更直观我d 说.基于此,这是 Python 中的 Selenium 示例:

I'm aware the question is about PHP, but if the OP needs to use Selenium, Python is way more intuitive I'd say. Based on that, here's a Selenium example in Python:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

示例源

这篇关于用PHP爬网站,但网站运行JS生成标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆