搜寻似乎没有网址的网页 [英] Scraping pages that do not seem to have URLs

查看:80
本文介绍了搜寻似乎没有网址的网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试在属于我的客户的网站上抓取这些职位清单,并为这些职位清单提供更多的曝光机会.问题是我需要能够链接到特定的工作清单,以便求职者申请. 这是页面,我正尝试保存列表链接.

I'm trying to scrape these listings and provide more exposure for these job listings on a site that belongs to a client of mine. The issue is that I need to be able to link to the specific job listing in order for the job seeker to apply. This is the page I'm trying to save listing links from.

如果我可以保存一个地址供求职者单击以查看原始列表然后申请,那将是理想的.

It would be ideal if I could save an address for the job seeker to click on to see the original listing and then apply.

  1. 该网站在不显示这些页面的URL的情况下做什么
  2. 是否可以提供商家信息特定的地址
  3. 如果可能的话,我该如何生成该地址?

如果我找不到特定的地址,我想我可以获取,以便用户单击一个链接,该链接会触发客户网站上带有列表ID的内部脚本,并对该脚本进行搜索,并在我发现该列表的网站上进行搜索,并且然后将用户重定向到该特定列表.

If I can't get a specific address I think I could get it so that the user clicks a link that triggers an internal script on my client's site which takes the listing ID and searches the site I found that listing on, and then redirects the user to that specific listing.

不利之处在于,用户将不得不稍等片刻,具体取决于目录在目录中的位置.我可以在进度条中添加一个愉快的正在搜索您的列表!感谢您耐心等待"的消息.

The downside to this is that the user will have to wait a little while depending on how far back the listing is on a directory. I could put some kind of progress bar with a pleasant "Searching for your listing! Thanks for being patient" message.

不过,如果我能避免这样做,那就太好了!

If I can avoid having to do this, though, that'd be great!

我正在使用Nokogiri和Mechanize.

I'm using Nokogiri and Mechanize.

推荐答案

您所引用的页面似乎是由Oracle产品生成的,因此人们会认为他们愿意正确构建Web表单(并参考)有关可访问性的问题).他们没有,所以在我看来,要么是他们的工程师过得很糟糕,要么是他们故意(稍微)使它变得更难抓.

The page you refer to appears to be generated by an Oracle product, so one would think they'd be willing to construct a web form properly (and with reference to accessibility concerns). They haven't, so it occurs to me that either their engineer was having a bad day, or they are deliberately making it (slightly) harder to scrape.

当您将鼠标悬停在那些链接上时,浏览器不显示href的原因是没有链接.该页面要做的是使用JavaScript捕获click事件,用一些隐藏值填充POST表单,然后以编程方式调用submit方法.这可能会导致屏幕阅读器和其他辅助功能设备出现问题,也可能导致后退按钮必须重新提交页面的方式出现问题.

The reason your browser shows no href when you hover over those links is that there isn't one. What the page does instead is to use JavaScript to capture the click event, populate a POST form with some hidden values, and call the submit method programmatically. This can cause problems with screen-readers and other accessibility devices, as well as causing problems with the way in which back buttons have to re-submit the page.

好消息是,通常可以通过在第三方页面上使用真实表单或通过爬网程序库自己创建表单来取消这种构造.如果将正确的值发布到目标URI(是通过检查页面的脚本进行反向工程而来的),则生成的文档应该是您期望的链接"页面.

The good news is that constructions of this kind can usually be scraped by creating a form yourself, either using a real one on a third party page, or via a crawler library. If you post the right values to the target URI, reverse-engineered from examining the page's script, the resulting document should be the "linked" page you expect.

这篇关于搜寻似乎没有网址的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆