没有“下一页"时的分页按钮但一堆“页码"页 [英] Pagination when there is no "next page" button but bunch of "page numbers" pages
问题描述
我很高兴使用 R 进行抓取,但发现了它的局限性.试图刮阿根廷最高法院的案件摘要,我发现了一个我无法找到答案的问题.这很可能是边做边学的结果---所以请做指出我的代码在哪里工作但遵循了一个相当糟糕的做法.无论如何,我设法:
- 访问搜索页面.
- 在
#voces
中输入相关分类术语(例如DECRETO DE NECESIDAD Y URGENCIA"),点击搜索并刮取.datosSumarios
,我需要的信息在哪里(案例名称、日期、报告人等).代码如下:
<预><代码>const puppeteer = require('puppeteer');让刮=异步()=>{const browser = await puppeteer.launch({headless: false});const page = await browser.newPage();等待 page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');//等待元素准备好等待 Promise.all([page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),page.waitForSelector('.ui-menu-item')]);等待 page.click('.ui-menu-item');等待 Promise.all([page.click('.glyphicon-search'),page.waitForNavigation({ waitUntil: 'networkidle0' }),]);//这里我们就在我们想要的地方,然后捕捉我们需要的东西:const 结果 = 等待 page.evaluate(() => {让数据 = [];//创建一个空数组来存储我们的数据let elements = document.querySelectorAll('.row');//选择所有产品for (var element of elements){//遍历每个输出让 title = document.querySelector('.datosSumario').innerText;data.push({title});//将带有数据的对象推送到我们的数组中}返回数据;//返回我们的数据数组});//回顾->等待 page.click('#paginate_button2')浏览器关闭();返回结果;};刮().然后((值)=> {控制台日志(值);//成功!});
我似乎无法做的是浏览不同的页面.如果您按照页面进行操作,您会发现分页很奇怪:没有下一页".按钮,但是一堆页码按钮",我可以按下但不能迭代上面代码的 scraping 部分.我试过一个循环功能(没有设法让它工作).我查看了一些分页教程,但没有找到解决这种特殊问题的教程.
# 更新
我能够解决分页的问题,但目前我似乎无法创建一个函数来实际刮取我需要在分页内工作的文本(它在外部工作,在单个页).分享以防有人指出我可能犯的明显错误.
const puppeteer = require('puppeteer');const fs = require('fs');让刮=异步()=>{const browser = await puppeteer.launch({headless: false});const page = await browser.newPage();等待 page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');//等待元素准备好等待 Promise.all([page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),page.waitForSelector('.ui-menu-item')]);等待 page.click('.ui-menu-item');等待 Promise.all([page.click('.glyphicon-search'),page.waitForNavigation({ waitUntil: 'networkidle0' }),]);var 结果 = [];//保存sumarios"的变量我需要var lastPageNumber = 2;//我使用 2 来测试,但我可以选择任何数字并且它有效(在这种情况下,我需要刮掉 31 页)for (let index = 0; index < lastPageNumber; index++) {//等待 1 秒页面加载等待页面.waitFor(5000);//每次迭代调用并等待extractedEvaluateCall 并连接结果.//可以使用results.push,但会在迭代结束时得到集合的集合结果 = results.concat(await MyFunction);//我调用了我的函数,但该函数不起作用,见下文如果(索引!= lastPageNumber - 1){await page.click('li.paginate_button.active + li a[onclick]');//这就是诀窍等待页面.waitFor(5000);}}浏览器关闭();返回结果;};异步函数 MyFunction() {const data = await page.evaluate( () =>//这个位在异步函数环境之外工作,我在一个页面中得到我需要的文本数组.from(document.querySelectorAll('div[class="col-sm-8 col-lg-9 datosSumario"]'), element =>element.textContent));}刮().然后((结果)=> {控制台日志(结果);//成功!});
你可以试试 document.querySelector('li.paginate_button.active + li a[onclick]')
作为下一页按钮相等的.点击它后,您可以等待以 'https://sjconsulta.csjn.gov.ar/sjconsulta/consultaSumarios/paginarSumarios.html?startIndex='
开头的 URL 的响应.>
#用于更新
乍一看,有一些问题:
MyFunction
未被调用:您需要await MyFunction()
而不是await MyFunction
.你需要将
page
转入MyFunction()
作用域:
results = results.concat(await MyFunction(page));//...异步函数 MyFunction(page) {//...}
I was happy doing my scraping with R but found its limits. Trying to scrape the summary of cases of Argentina's Supreme Court, I found a problem for which I cannot find an answer. It is likely the outcome of learning by doing --- so please, do point out where my code works but is following a rather bad practice. Anyway, I managed to:
- Access the search page.
- Entry a relevant taxonomy term (e.g. 'DECRETO DE NECESIDAD Y URGENCIA') in
#voces
, click search and scrape the.datosSumarios
, where lies the information I need (case name, date, reporter, and so on). The code is bellow:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');
// wait until element ready
await Promise.all([
page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
page.waitForSelector('.ui-menu-item')
]);
await page.click('.ui-menu-item');
await Promise.all([
page.click('.glyphicon-search'),
page.waitForNavigation({ waitUntil: 'networkidle0' }),
]);
//Here we are in the place we want to be, and then capture what we need:
const result = await page.evaluate(() => {
let data = []; // Create an empty array that will store our data
let elements = document.querySelectorAll('.row'); // Select all Products
for (var element of elements){ // Loop through each proudct
let title = document.querySelector('.datosSumario').innerText;
data.push({title}); // Push an object with the data onto our array
}
return data; // Return our data array
});
//review ->
await page.click('#paginate_button2')
browser.close();
return result;
};
scrape().then((value) => {
console.log(value); // Success!
});
What I can't seem to do is to go through different pages. If you follow the page you'll see that the pagination is rather strange: there is no "next page" button but a bunch of "page number buttons", which I can press but cannot iterate the scraping section of the code above. I've tried a loop function (that did not manage to make it work). I've reviewed a few pagination tutorials but could not found one that faces this particular kind of problem.
# Update
I was able to solve the pagination thing, but currently I can't seem to make a function to actually scrape the text I need to work within the pagination (it works outside, in a single page). Sharing in case someone can point the obvious mistake I am probably making.
const puppeteer = require('puppeteer');
const fs = require('fs');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');
// wait until element ready
await Promise.all([
page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
page.waitForSelector('.ui-menu-item')
]);
await page.click('.ui-menu-item');
await Promise.all([
page.click('.glyphicon-search'),
page.waitForNavigation({ waitUntil: 'networkidle0' }),
]);
var results = []; // variable to hold the "sumarios" I need
var lastPageNumber = 2; // I am using 2 to test, but I can choose any number and it works (in this case, the 31 pages I need to scrape)
for (let index = 0; index < lastPageNumber; index++) {
// wait 1 sec for page load
await page.waitFor(5000);
// call and wait extractedEvaluateCall and concatenate results every iteration.
// You can use results.push, but will get collection of collections at the end of iteration
results = results.concat(await MyFunction); // I call my function but the function does not work, see below
if (index != lastPageNumber - 1) {
await page.click('li.paginate_button.active + li a[onclick]'); //This does the trick
await page.waitFor(5000);
}
}
browser.close();
return results;
};
async function MyFunction() {
const data = await page.evaluate( () => // This bit works outside of the async function environment and I get the text I need in a single page
Array.from(
document.querySelectorAll('div[class="col-sm-8 col-lg-9 datosSumario"]'), element => element.textContent)
);
}
scrape().then((results) => {
console.log(results); // Success!
});
You can try document.querySelector('li.paginate_button.active + li a[onclick]')
as a next page button equivalent. After the click on it, you can wait for a response with URL started with 'https://sjconsulta.csjn.gov.ar/sjconsulta/consultaSumarios/paginarSumarios.html?startIndex='
.
# For update
At first glance, there are some issues:
MyFunction
is not called: you needawait MyFunction()
instead ofawait MyFunction
.You need to transfer
page
intoMyFunction()
scope:
results = results.concat(await MyFunction(page));
//...
async function MyFunction(page) {
// ...
}
这篇关于没有“下一页"时的分页按钮但一堆“页码"页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!