如何使用 puppeteer js 抓取多级链接? [英] How to scrape multi-level links using puppeteer js?

查看:46
本文介绍了如何使用 puppeteer js 抓取多级链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Puppeteer 抓取站点页面的表格行.我有代码来抓取内容并将它们分配给表中每个对象的对象.在表格的每一行中,我需要在新页面 (puppeteer) 中打开一个链接,然后抓取特定元素,然后将其分配给同一个对象,并将带有新键的整个对象返回给 puppeteer.Puppeteer 怎么可能做到这一点?

I am scraping table rows of site page using Puppeteer. I have the code to scrape content and assign them to an object for each in the table. In each table row there is a link that I need to open in a new page (puppeteer) and then scrape for a particular element then assign it to the same object and return the whole object with the new keys to puppeteer. How is that possible with Puppeteer?

async function run() {
    const browser = await puppeteer.launch({
        headless: false
    })
    const page = await browser.newPage()

    await page.goto('https://tokenmarket.net/blockchain/', {waitUntil: 'networkidle0'})
    await page.waitFor(5000)
    var onlink = ''
    var result = await page.$$eval('table > tbody tr .col-actions a:first-child', (els) => Array.from(els).map(function(el) {

        //running ajax requests to load the inner page links.
     $.get(el.children[0].href, function(response) {
            onlink = $(response).find('#page-wrapper > main > div.container > div > table > tbody > tr > td:nth-child(2)').text()
        })



        return {
            icoImgUrl: el.children[0].children[0].children[0].currentSrc,
            icoDate: el.children[2].innerText.split('\n').shift() === 'To be announced' ? null : new Date( el.children[2].innerText.split('\n').shift() ).toISOString(),
            icoName:el.children[1].children[0].innerText,
            link:el.children[1].children[0].children[0].href,
            description:el.children[3].innerText,
            assets :onlink
        }

    }))

    console.log(result)

    UpcomingIco.insertMany(result, function(error, docs) {})


    browser.close()
}

run()

推荐答案

如果您尝试同时为每个 ICO 页面打开一个新选项卡,您最终可能会同时加载 100 多个页面.

If you try opening a new tab for each ICO page in parallel you might end up with 100+ pages loading at the same time.

所以你能做的最好的事情就是先收集 URL,然后一个一个循环访问它们.

So the best thing you could do is to first collect the URLs and then visit them one by one in a loop.

这也允许保持代码简单易读.

This also allows keeping the code simple and readable.

例如(请看我的评论):

For example (please, see my comments):

const browser = await puppeteer.launch({ headless: false });

const page = await browser.newPage();

await page.goto('https://tokenmarket.net/blockchain/');

// Gather assets page urls for all the blockchains
const assetUrls = await page.$$eval(
  '.table-assets > tbody > tr .col-actions a:first-child',
  assetLinks => assetLinks.map(link => link.href)
);

const results = [];

// Visit each assets page one by one
for (let assetsUrl of assetUrls) {
  await page.goto(assetsUrl);

  // Now collect all the ICO urls.
  const icoUrls = await page.$$eval(
    '#page-wrapper > main > div.container > div > table > tbody > tr > td:nth-child(2) a',
    links => links.map(link => link.href)
  );

  // Visit each ICO one by one and collect the data.
  for (let icoUrl of icoUrls) {
    await page.goto(icoUrl);

    const icoImgUrl = await page.$eval('#asset-logo-wrapper img', img => img.src);
    const icoName = await page.$eval('h1', h1 => h1.innerText.trim());
    // TODO: Gather all the needed info like description etc here.

    results.push([{
      icoName,
      icoUrl,
      icoImgUrl
    }]);
  }
}

// Results are ready
console.log(results);

browser.close();

这篇关于如何使用 puppeteer js 抓取多级链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆