继续处理结果的空值(Nodejs、Puppeteer) [英] Continue on Null Value of Result (Nodejs, Puppeteer)

查看:70
本文介绍了继续处理结果的空值(Nodejs、Puppeteer)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用 Puppeteer(无头 Chrome)和 Nodejs.我正在抓取一些测试站点,当所有值都存在时,一切都很好,但是如果缺少值,我会收到如下错误:

I'm just starting to play around with Puppeteer (Headless Chrome) and Nodejs. I'm scraping some test sites, and things work great when all the values are present, but if the value is missing I get an error like:

Cannot read property 'src' of null(所以在下面的代码中,前两遍可能有所有值,但第三遍没有图片,所以它只是出错了).

Cannot read property 'src' of null (so in the code below, the first two passes might have all values, but the third pass, there is no picture, so it just errors out).

在我使用 if(!picture) continue; 之前,但我认为由于 for 循环,它现在不起作用.

Before I was using if(!picture) continue; but I think it's not working now because of the for loop.

任何帮助将不胜感激,谢谢!

Any help would be greatly appreciated, thanks!

for (let i = 1; i <= 3; i++) {
//...Getting to correct page and scraping it three times
  const result = await page.evaluate(() => {
      let title = document.querySelector('h1').innerText;
      let article = document.querySelector('.c-entry-content').innerText;
      let picture = document.querySelector('.c-picture img').src;

      if (!document.querySelector('.c-picture img').src) {
        let picture = 'No Link';     }  //throws error

      let source = "The Verge";
      let categories = "Tech";

      if (!picture)
                continue;  //throws error

      return {
        title,
        article,
        picture,
        source,
        categories
      }
    });
}

推荐答案

let picture = document.querySelector('.c-picture img').src;

if (!document.querySelector('.c-picture img').src) {
    let picture = 'No Link';     }  //throws error

如果没有图片,则 document.querySelector() 返回 null,它没有 src 属性.在尝试读取 src 属性之前,您需要检查您的查询是否找到了一个元素.

If there is no picture, then document.querySelector() returns null, which does not have a src property. You need to check that your query found an element before trying to read the src property.

将空值检查移到函数的顶部还有一个额外的好处,那就是在您无论如何都要去救市时节省不必要的计算.

Moving the null-check to the top of the function has the added benefit of saving unnecessary calculations when you are just going to bail out anyway.

async function scrape3() {
  // ... 
  for (let i = 1; i <= 3; i++) {
  //...Getting to correct page and scraping it three times
    const result = await page.evaluate(() => {
        const pictureElement = document.querySelector('.c-picture img');

        if (!pictureElement) return null;

        const picture = pictureElement.src;
        const title = document.querySelector('h1').innerText;
        const article = document.querySelector('.c-entry-content').innerText;

        const source = "The Verge";
        const categories = "Tech";

        return {
          title,
          article,
          picture,
          source,
          categories
        }
    });

    if (!result) continue;

    // ... do stuff with result
  }

<小时>

回答评论问题:有没有办法跳过任何空白,然后返回其余部分?"

是的.在尝试从中读取属性之前,您只需要检查可能丢失的每个元素是否存在.在这种情况下,我们可以省略提前返回,因为您总是对所有结果感兴趣.


Answering comment question: "Is there a way just to skip anything blank, and return the rest?"

Yes. You just need to check the existence of each element that could be missing before trying to read a property off of it. In this case we can omit the early return since you're always interested in all the results.

async function scrape3() {
  // ...
  for (let i = 1; i <= 3; i++) {
    const result = await page.evaluate(() => {
        const img = document.querySelector('.c-picture img');
        const h1 = document.querySelector('h1');
        const content = document.querySelector('.c-entry-content');

        const picture = img ? img.src : '';
        const title = h1 ? h1.innerText : '';
        const article = content ? content.innerText : '';
        const source = "The Verge";
        const categories = "Tech";

        return {
          title,
          article,
          picture,
          source,
          categories
        }
    });
    // ... 
  }
}

<小时>

进一步的想法

因为我还在这个问题上,让我更进一步,用一些你可能感兴趣的更高层次的技术重构它.不确定这是否正是你所追求的,但它应该给你一些关于编写更易于维护的代码的想法.


Further thoughts

Since I'm still on this question, let me take this one step further, and refactor it a bit with some higher level techniques you might be interested in. Not sure if this is exactly what you are after, but it should give you some ideas about writing more maintainable code.

// Generic reusable helper to return an object property
// if object exists and has property, else a default value
// 
// This is a curried function accepting one argument at a
// time and capturing each parameter in a closure.
//
const maybeGetProp = default => key => object =>
  (object && object.hasOwnProperty(key)) ? object.key : default

// Pass in empty string as the default value
//
const getPropOrEmptyString = maybeGetProp('')

// Apply the second parameter, the property name, making 2
// slightly different functions which have a default value
// and a property name pre-loaded. Both functions only need
// an object passed in to return either the property if it
// exists or an empty string.
//
const maybeText = getPropOrEmptyString('innerText')
const maybeSrc = getPropOrEmptyString('src')

async function scrape3() {
  // ...

  // The _ parameter name is acknowledging that we expect a
  // an argument passed in but saying we plan to ignore it.
  //
  const evaluate = _ => page.evaluate(() => {

    // Attempt to retrieve the desired elements
    // 
    const img = document.querySelector('.c-picture img');
    const h1 = document.querySelector('h1')
    const content = document.querySelector('.c-entry-content')

    // Return the results, with empty string in
    // place of any missing properties.
    // 
    return {
      title: maybeText(h1),
      article: maybeText(article),
      picture: maybeSrc(img),
      source: 'The Verge',
      categories: 'Tech'
    }
  }))

  // Start with an empty array of length 3
  // 
  const evaluations = Array(3).fill()

    // Then map over that array ignoring the undefined
    // input and return a promise for a page evaluation
    //
    .map(evaluate)

  // All 3 scrapes are occuring concurrently. We'll
  // wait for all of them to finish.
  //
  const results = await Promise.all(evaluations)

  // Now we have an array of results, so we can 
  // continue using array methods to iterate over them
  // or otherwise manipulate or transform them
  // 
  results
    .filter(result => result.title && result.picture)
    .forEach(result => {
      //
      // Do something with each result
      // 
    })
}

这篇关于继续处理结果的空值(Nodejs、Puppeteer)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆