如何使用javascript抓取弹出窗口中呈现的内容:使用scrapy的链接 [英] How to scrape content rendered in popup window with javascript: links using scrapy
问题描述
我正在尝试使用 scrapy 来获取仅在 javascript: 链接被点击后呈现的内容.由于链接似乎没有遵循系统的编号方案,我不知道如何
I'm trying to use scrapy to get content rendered only after a javascript: link is clicked. As the links don't appear to follow a systematic numbering scheme, I don't know how to
1 - 激活 javascript: 链接以展开折叠面板
1 - activate a javascript: link to expand a collapsed panel
2 - 激活一个(现在可见的)javascript: 链接使弹出窗口被渲染,这样它的内容(摘要)就可以被抓取
2 - activate a (now visible) javascript: link to cause the popup to be rendered so that its content (the abstract) can be scraped
网站 https://b-com.mci-group.com/EventProgramme/EHA19.aspx 包含指向将在我计划参加的会议上展示的摘要的链接.该网站导出为 PDF 有问题,因为它在 PDF 生成时复制了大量数据.我没有处理这个错误,而是转向了scrapy,结果才意识到我已经无法自拔了.我读过:
The site https://b-com.mci-group.com/EventProgramme/EHA19.aspx contains links to abstracts that will be presented at a conference I plan to attend. The site's export to PDF is buggy, in that it duplicates a lot of data at PDF generation time. Rather than dealing with the bug, I turned to scrapy only to realize that I'm in over my head. I've read:
可以scrapy 用于从使用 AJAX 的网站抓取动态内容?
和
如何抓取优惠券网站的优惠券代码(优惠券代码在点击按钮时出现)
但我认为我无法将这些点联系起来.我也看到了对 Selenium 的提及,但我不确定我是否必须诉诸那个.
But I don't think I'm able to connect the dots. I've also seen mentions to Selenium, but am not sure that I must resort to that.
我几乎没有取得什么进展,我想知道我是否可以朝着正确的方向前进,掌握以下信息:
I have made little progress, and wonder if I can get a push in the right direction, with the following information in hand:
为了发出将展开折叠面板(上面的第 1 项)的 POST 请求,我跟踪了页面上的 JS javascript:ShowCollapsiblePanel(116114,1695,44,191);将导致对 TARGETURLOFWEBSITE/EventSessionAjaxService/GetSessionDetailsHtml 的 POST 请求与有效负载:
In order to make the POST request that will expand the collapsed panel (item 1 above) I have a traced that the on-page JS javascript:ShowCollapsiblePanel(116114,1695,44,191); will result in a POST request to TARGETURLOFWEBSITE/EventSessionAjaxService/GetSessionDetailsHtml with payload:
{"eventSessionID":116114,"eventSessionWebSiteSetupViewID":191}
{"eventSessionID":116114,"eventSessionWebSiteSetupViewID":191}
eventSessionID 和 eventSessionWebSiteSetupViewID 的参数在 javascript:ShowCollapsiblePanel 文本中很清楚.
The parameters for eventSessionID and eventSessionWebSiteSetupViewID are clearly in the javascript:ShowCollapsiblePanel text.
如何使用scrapy 遍历javascript:ShowCollapsiblePanel 表单的所有链接?我尝试使用 SgmlLinkExtractor,但没有返回任何 javascript:ShowCollapsiblePanel() 链接 - 我怀疑它们不符合链接"的标准.
How do I use scrapy to iterate over all of the links of form javascript:ShowCollapsiblePanel? I tried to use SgmlLinkExtractor, but that didn't return any of the javascript:ShowCollapsiblePanel() links - I suspect that they don't meet the criteria for "links".
更新
取得进展,我发现 SgmlLinkExtractor 不是正确的方法,而且要简单得多:
Making progress, I've found that SgmlLinkExtractor is not the right way to go, and the much simpler:
sel.xpath('//a[contains(@href, "javascript:ShowCollapsiblePanel")]').re('((\d+)\,(\d+)\,(\d+)\,(\d+)')
sel.xpath('//a[contains(@href, "javascript:ShowCollapsiblePanel")]').re('((\d+)\,(\d+)\,(\d+)\,(\d+)')
在scrapy 控制台中返回每个javascript:ShowCollapsiblePanel() 的所有数字参数(当然,现在它们都在一个长字符串中,但我只是在控制台中乱搞).
in scrapy console returns me all of the numeric parameters for each javascript:ShowCollapsiblePanel() (of course, right now they are all in one long string, but I'm just messing around in the console).
下一步将采用第一个 javascript:ShowCollapsiblePanel() 并生成 POST 请求并分析响应以查看响应是否包含我在浏览器中单击链接时看到的内容.
The next step will be to take the first javascript:ShowCollapsiblePanel() and generate the POST request and analyze the response to see if the response contains what I see when I click the link in the browser.
推荐答案
我遇到了一个类似的问题,在拔了很多头发之后,我用 import.io 拉出了我需要的数据集,它有一个视觉类型刮板,但它能够在启用 javascript 的情况下运行,这正是我需要的,而且是免费的.昨晚我在 git hub 上看到了一个分支,看起来就像它所说的 import io scraper ..... 给我一分钟波西亚,但我不知道它是否会做你想要的https://codeload.github.com/scrapinghub/portia/zip/master好
I fought with a similar problem and after much pulling out hair I pulled the data set I needed with import.io which has a visual type scraper but it's able to run with javascript enabled which did just what I needed and it's free. There's also a fork on git hub I saw last night of scrapy that looked just like the import io scraper it called ..... give me a min Portia but I don't know if it'll do what you want https://codeload.github.com/scrapinghub/portia/zip/master Good
这篇关于如何使用javascript抓取弹出窗口中呈现的内容:使用scrapy的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!