如何在 JavaScript 代码中获取 JavaScript 对象? [英] How to get JavaScript object in JavaScript code?

查看:50
本文介绍了如何在 JavaScript 代码中获取 JavaScript 对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望 parseParameter 像下面的代码一样解析 JSON.someCrawledJSCode 是抓取的 JavaScript 代码.

I want parseParameter that parse JSON like the following code. someCrawledJSCode is crawled JavaScript code.

const data = parseParameter(someCrawledJSCode);
console.log(data);  // data1: {...}

问题

我正在使用 puppeteer 抓取一些 JavaScript 代码,我想从中提取一个 JSON 对象,但我不知道如何解析给定的 JavaScript 代码.

Problem

I'm crawling some JavaScript code with puppeteer and I want to extract a JSON object from it, but I don't know how to parse the given JavaScript code.

抓取的 JavaScript 代码示例:

const somecode = 'somevalue';
arr.push({
  data1: {
    prices: [{
      prop1: 'hi',
      prop2: 'hello',
    },
    {
      prop1: 'foo',
      prop2: 'bar',
    }]
  }
});

在这段代码中,我想获取 prices 数组(或 data1).

In this code, I want to get prices array (or data1).

我尝试将代码解析为 JSON,但它不起作用.所以我搜索了解析工具并得到了 Esprima.但我认为这对解决这个问题没有帮助.

I tried parsing code into JSON, but it's not working. So I searched parsing tools and got Esprima. But I think it's not helpful for solving this problem.

推荐答案

简短回答:不要(重新)在 Node.js 中构建解析器,而是使用浏览器

如果您无论如何都在使用 puppeteer 进行爬网,我强烈建议您不要在 Node.js 中评估或解析已爬取的数据.当您使用 puppeteer 时,您已经拥有一个浏览器,其中包含一个用于在另一个进程中运行的 JavaScript 代码的强大沙箱.为什么要冒这种隔离风险并在 Node.js 脚本中重建"解析器?如果您的 Node.js 脚本中断,您的整个脚本就会失败.在最坏的情况下,当您尝试在主线程中运行不受信任的代码时,您甚至可能使您的机器面临严重的风险.

Short answer: Don't (re)build a parser in Node.js, use the browser instead

I strongly advise against evaluating or parsing crawled data in Node.js if you are anyway using puppeteer for crawling. When you are using puppeteer you already have a browser with a great sandbox for JavaScript code running in another process. Why risk that kind of isolation and "rebuild" a parser in your Node.js script? If your Node.js script breaks, your whole script will fail. In the worst case, you might even expose your machine to serious risks when you try to run untrusted code inside your main thread.

相反,尝试在页面上下文中进行尽可能多的解析.您甚至可以在那里进行 evil eval 调用.有可能发生的最坏情况吗?您的浏览器挂起或崩溃.

Instead, try to do as much parsing as possible inside the context of the page. You can even do an evil eval call there. There worst that could happen? Your browser hangs or crashes.

想象下面的 HTML 页面(非常简化).您正在尝试读取推送到数组中的文本.您拥有的唯一信息是有一个附加属性 id 设置为 target-data.

Imagine the following HTML page (very much simplified). You are trying to read the text which is pushed into an array. The only information you have is that there is an additional attribute id which is set to target-data.

<html>
<body>
  <!--- ... -->
  <script>
    var arr = [];
    // some complex code...
    arr.push({
      id: 'not-interesting-data',
      data: 'some data you do not want to crawl',
    });
    // more complex code here...
    arr.push({
      id: 'target-data',
      data: 'THIS IS THE DATA YOU WANT TO CRAWL', // <---- You want to get this text
    });
    // more code...
    arr.push({
      id: 'some-irrelevant-data',
      data: 'again, you do not want to crawl this',
    });
  </script>
  <!--- ... -->
</body>
</html>

错误代码

这是一个简单的示例,您的代码现在可能是什么样子:

Bad code

Here is a simple example what your code might look like right now:

await page.goto('http://...');
const crawledJsCode = await page.evaluate(() => document.querySelector('script').innerHTML);

在此示例中,脚本从页面中提取 JavaScript 代码.现在我们有了页面中的 JavaScript 代码,我们只"需要解析它,对吗?嗯,这是错误的方法.不要尝试在 Node.js 中重建解析器.用浏览器就行了.在您的情况下,您基本上可以采用两种方法来做到这一点.

In this example, the script extracts the JavaScript code from the page. Now we have the JavaScript code from the page and we "only" need to parse it, right? Well, this is the wrong approach. Don't try to rebuild a parser inside Node.js. Just use the browser. There are basically two approaches you can take to do that in your case.

  1. 在页面中注入代理函数并伪造一些内置函数(推荐)
  2. 使用 JSON.parse、正则表达式或 eval(仅在真正需要时才进行 eval)解析客户端 (!) 的数据
  1. Inject proxy functions into the page and fake some built-in functions (recommended)
  2. Parse the data on the client-side (!) by using JSON.parse, a regex or eval (eval only if really necessary)

<小时>

方案一:在页面中注入代理函数

在这种方法中,您将用自己的假函数"替换本机浏览器功能.示例:


Option 1: Inject proxy functions into the page

In this approach you are replacing native browser functions with your own "fake functions". Example:

const originalPush = Array.prototype.push;
Array.prototype.push = function (item) {
    if (item && item.id === 'target-data') {
        const data = item.data; // This is the data we are trying to crawl
        window.exposedDataFoundFunction(data); // send this data back to Node.js
    }
    originalPush.apply(this, arguments);
}

这段代码用我们自己的函数替换了原来的 Array.prototype.push 函数.一切正常,但是当具有我们目标 id 的项目被推入数组时,会触发一个特殊条件.要将此功能注入页面,您可以使用 <代码>page.evaluateOnNewDocument.要从 Node.js 接收数据,您必须通过 page.exposeFunction:

This code replaces the original Array.prototype.push function with our own function. Everything works as normal, but when an item with our target id is pushed into an array, a special condition is triggered. To inject this function into the page, you could use page.evaluateOnNewDocument. To receive the data from Node.js you would have to expose a function to the browser via page.exposeFunction:

// called via window.dataFound from within the fake Array.prototype.push function
await page.exposeFunction('exposedDataFoundFunction', data => {
    // handle the data in Node.js
});

现在,页面代码的复杂程度并不重要,它是否发生在某个异步处理程序中,或者页面是否更改了周围的代码.只要目标数据正在将数据推入数组,我们就会得到它.

Now it doesn't really matter how complex the code of the page is, whether it happens inside some asynchronous handler or whether the page changes the surrounding code. As long as the target data is pushing the data into an array, we will get it.

您可以使用这种方法进行大量抓取.检查数据的处理方式,并用您自己的代理版本替换处理数据的低级函数.

You can use this approach for a lot of crawling. Check how the data is processed and replace the low level functions processing the data with your own proxy version of it.

让我们假设第一种方法由于某种原因不起作用.数据在某个脚本标签中,但您无法使用假函数获取它.

Let's assume the first approach does not work for some reason. The data is in some script tag, but you are not able to get it by using fake functions.

然后你应该解析数据,但不是在你的 Node.js 环境中.在页面上下文中执行此操作.您可以运行正则表达式或使用 JSON.parse.但是要在将数据返回给 Node.js 之前这样做.这种方法的好处是,如果您的代码由于某种原因使您的环境崩溃,那么崩溃的不是您的主脚本,而是只是您的浏览器.

Then you should parse the data, but not inside your Node.js environment. Do it inside the page context. You could run a regular expression or use JSON.parse. But do it before returning the data back to Node.js. This approach has the benefit that if your code will crash your environment for some reason, it will not be your main script, but just your browser that crashes.

给出一些示例代码.我们没有运行原始坏代码"示例中的代码,而是将其更改为:

To give some example code. Instead of running the code from the original "bad code" sample, we change it to this:

const crawledJsCode = await page.evaluate(() => {
    const code = document.querySelector('script').innerHTML; // instead of returning this
    const match = code.match(/some tricky regex which extracts the data you want/); // we run our regex in the browser
    return match; // and only return the results
});

这只会返回我们需要的代码部分,然后可以在 Node.js 中对其进行进一步处理.

This will only return the parts of the code we need, which can then be fruther processed from within Node.js.

与您选择哪种方法无关,这两种方法都比在主线程中运行未知代码要好得多,也更安全.如果您绝对必须在 Node.js 环境中处理数据,请使用正则表达式,如 trincot 的回答所示.您应该永远使用 eval 运行不受信任的代码.

Independent of which approach you choose, both ways are much better and more secure than running unknown code inside your main thread. If you absolutely have to process the data in your Node.js environment, use a regular expression for it like shown in the answer from trincot. You should never use eval to run untrusted code.

这篇关于如何在 JavaScript 代码中获取 JavaScript 对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆