如何解析DOM（REACT） [英] How to parse DOM (REACT)

查看：141 发布时间：2017/6/24 23:28:47 javascript html reactjs dom web-scraping

本文介绍了如何解析DOM（REACT）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从网站上删除数据。该网站使用Facebook的反应。因此，我可以使用 Jaunt 解析的源代码与使用Chrome检查器检查元素时看到的代码完全不同。 p>

我很少知道这一切，但是做了一些研究，我认为这是与DOM而不是源代码有关的。我需要一种方式来掌握这个DOM代码，因为原始的源代码不包含任何内容，但是我没有这个最开心的想法（甚至在这里读过很多答案）。

这里是我想要删除的页面的一个例子。例如，要删除描述，我想抓住标签之间的内容：

 < span class =light -font extended-card-description list-group-item>示例描述....< / span>

但是，您可以看到此元素仅在检查元素 ，而不是当我只是查看页面的来源。

我在这里的天才的问题是，我如何抓住这个DOM代码，并开始刮擦我实际上的元素想要吗？

如果我的术语完全失效，原谅我，但正如我所说，这是一个全新的领域，我已经做了我可以做的研究。

非常感谢您提前！

解决方案

ReactJS，like许多其他Javascript库/框架，使用客户端代码（Javascript）来呈现最终的HTML。这意味着当您，Jaunt或您的浏览器从服务器获取HTML源代码时，它尚未包含用户将看到的最终代码。浏览器需要运行页面中包含的Javascript程序，才能生成要删除的最终内容。

我最喜欢的这种工具工作是 CasperJS

它（或者说是PhantomJS CasperJS使用的工具）是无头浏览器，这意味着它是一种已经剥离所有GUI（Windows，按钮，菜单）的Webkit（如Chrome或Safari）的版本。剩下的是可以从终端运行的工具或从您的Java程序。它不会在屏幕上显示任何窗口，但它会获取您要求的网页;运行它们包含的任何Javascript;然后响应您的命令，例如点击此链接，给我该文本，捕获屏幕截图等。

让我们开始使用简单的 ReactJS示例：

我们想要删除你好约翰文本，但是如果您查看纯HTML源代码（ Ctrl + U 或 Alt + Ctrl + U ）你不会看到它。另一方面，如果您在浏览器中打开控制台并使用以下选择器，您将收到以下文本：

 > ; document.querySelector（'＃helloExample .playgroundPreview'）。textContent 
Hello John

这是一个简单的CasperJS脚本来做同样的事情：

  var casper = require（casper）。 
 
 casper.start（http://facebook.github.io/react/index.html，function（）{
 this.echo（this.fetchText（＃helloExample。 playgroundPreview））; 
}）; 
 
 casper.run（）;

您可以将其另存为 hello.js 并从终端执行 casperjs hello.js ，或使用等效的Java代码 Runtime.getRuntime（）。exec（...）

这是一个更好的脚本，可以避免加载图像和第三方资源（如Facebook按钮，Twitter按钮，Google Analytics（分析）等）将加载时间缩短一半。它还添加了一个 waitForSelector 步骤，以便在ReactJS有机会创建文本之前，我们不会冒险尝试获取文本。

  var casper = require（casper）。create（{
 pageSettings：{
 loadImages：false 
} 
}）; 
 
 casper.on（'resource.requested'，function（requestData，request）{
 if（requestData.url.indexOf（http://facebook.github.io/） ！= 0）{
 request.abort（）; 
} 
}）; 
 
 casper.start（http://facebook.github.io/react/index.html，function（）{
 this.waitForSelector（＃helloExample .playgroundPreview，function （）{
 this.echo（this.fetchText（＃helloExample .playgroundPreview））; 
}）; 
}）; 
 
 casper.run（）;

如何安装CasperJS

我在使用PhantomJS和CasperJS的旧版本的时候抓住了ReactJS和其他现代JavaScript页面，所以我建议您从GitHub安装PhantomJS 2.0和最新的CasperJS。

对于PhantomJS，您只需下载官方2.0软件包。

对于CasperJS，由于它是一个Python脚本，您应该可以检查GitHub的最新提交，并将 bin / casperjs 链接到您的PATH上。这是Linux或Mac OS X的脚本：

 > git clone git：//github.com/n1k0/casperjs.git 
> cd casperjs 
> ln -sf`pwd` / bin / casperjs / usr / local / bin / casperjs

你也可以想要从你的 bin / bootstrap.js 文件中注释掉线打印警告PhantomJS v2.0 ...

I am trying to scrape data from a website. The website uses Facebook's React. As such the source code that I can parse using Jaunt is completely different to the code I see when inspecting the elements using Chrome's inspector.

I know very little about all of this, but having done some research I think this is something to do with DOM rather than the source code. I need a way to be able to get my hands on this DOM code as the original source contains nothing I want, but I don't have the foggiest idea where to begin (even having read many answers on here).

Here is an example of one the pages I want to scrape. For example to scrape the description I'd want to grab what is in between the tag:

<span class="light-font extended-card-description list-group-item">Example description....</span>

But as you can see this element only appears when you "Inspect Element", and not when I just view the page's source.

My question to you geniuses on here is, how can I grab this DOM Code and start scraping the elements I actually want to?

Forgive me if my terminology is completely off but as I say this is a completely new area for me, and I've done the research that I can.

Thank you very much in advance!

解决方案

ReactJS, like many other Javascript libraries / frameworks, uses client-side code (Javascript) to render the final HTML. This means that when you, Jaunt, or your browser fetch the HTML source code from the server, it doesn't yet contain the final code the user will see. The browser needs to run the Javascript program(s) contained in the page, in order to generate the final content you wish to scrape.

My favorite tool for this kind of job is CasperJS

It (or rather the PhantomJS tool that CasperJS uses) is a headless browser, meaning it's a version of Webkit (like Chrome or Safari) that has been stripped of all the GUI (windows, buttons, menus.) What's left is a tool that you can run from a terminal or from your Java program. It won't show any window on the screen, but it will fetch the webpages you ask it to; run any Javascript they contain; and then respond to your commands, such as "click on this link", "give me that text", "capture a screenshot", and so on.

Let's start with a simple ReactJS example:

We want to scrape the "Hello John" text, but if you look at the plain HTML source (Ctrl+U or Alt+Ctrl+U) you won't see it. On the other hand, if you open the console in your browser and use the following selector, you will get the text:

> document.querySelector('#helloExample .playgroundPreview').textContent
"Hello John"

Here is a simple CasperJS script to do the same thing:

var casper = require("casper").create();

casper.start("http://facebook.github.io/react/index.html", function() {
    this.echo(this.fetchText("#helloExample .playgroundPreview"));
});

casper.run();

You can save it as hello.js and execute it with casperjs hello.js from a terminal, or use the equivalent Java code Runtime.getRuntime().exec(...)

Here is a better script, that avoids loading images and third-party resources (such as Facebook button, Twitter button, Google Analytics, and such) cutting the loading time by half. It also adds a waitForSelector step, so that we don't risk trying to fetch the text before ReactJS has had a chance to create it.

var casper = require("casper").create({
    pageSettings: {
        loadImages: false
    }
});

casper.on('resource.requested', function(requestData, request) {
    if (requestData.url.indexOf("http://facebook.github.io/") != 0) {
        request.abort();
    }
});

casper.start("http://facebook.github.io/react/index.html", function() {
    this.waitForSelector("#helloExample .playgroundPreview", function() {
        this.echo(this.fetchText("#helloExample .playgroundPreview"));
    });
});

casper.run();

How to install CasperJS

I have had some trouble scraping ReactJS and other modern Javascript pages with the older versions of PhantomJS and CasperJS, so I recommend installing PhantomJS 2.0 and the latest CasperJS from GitHub.

For PhantomJS you can just download the official 2.0 package.

For CasperJS, since it's a Python script, you should be able to check out the latest commit from GitHub and link bin/casperjs onto your PATH. Here's a script for Linux or Mac OS X:

> git clone git://github.com/n1k0/casperjs.git
> cd casperjs
> ln -sf `pwd`/bin/casperjs /usr/local/bin/casperjs

You may also want to comment out the line printing Warning PhantomJS v2.0 ... from your bin/bootstrap.js file.

这篇关于如何解析DOM（REACT）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何解析DOM（REACT） [英] How to parse DOM (REACT)

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何解析DOM（REACT） [英] How to parse DOM (REACT)

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭