需要使用“显示更多"信息从网页中抓取信息.按钮,有什么建议吗? [英] Need to scrape information from a webpage with a "show more" button, any recommendations?

查看:92
本文介绍了需要使用“显示更多"信息从网页中抓取信息.按钮,有什么建议吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于教育原因,目前正在开发履带式爬行器",

Currently developing a "crawler" for educational reasons,

一切正常,我可以提取url的&信息与将其保存在一个json文件中,一切都很好,非常好...除了

Everything is working fine, i can extract url's & information & save it in a json file, everything is all fine and dandy... EXCEPT

页面上有一个需要加载更多"按钮,我需要与该按钮进行交互,以便抓取工具继续查找更多网址.

the page has a "load more" button that i NEED to interact with in order for the crawler to continue looking for more urls.

这是我可以使用您的好人&的地方女孩!

This is where i could use you amazing guys & girls!

有关如何执行此操作的任何建议?

Any recommendations on how to do this?

我想与加载更多"按钮进行交互,然后将HTML信息重新发送到我的搜寻器.

I would like to interact with the "load more" button and re-send the HTML information to my crawler.

真的,非常感谢你们提供的任何帮助!

Would really, appreciate any amount of help from you guys!

网站: http://virali.se/photo/gallery/

用于查找公司名称的示例代码的一部分:

bit of example code for finding business names:

def base_spider(self, max_pages, max_CIDS):
    url = "http://virali.se/photo/gallery/photog/"  # Input URL

    for pages in range(0, max_pages):
        source_code = requests.get(url)  # gets the source_code from the URL
        plain_text = source_code.text  # Pure text transform for BeautifulSoup
        soup = BeautifulSoup(plain_text, "html.parser")  # Use HTML parser to read the plain_text var
    for article in soup.find_all("article"):
            business_name_pattern = re.compile(r"<h1>(.*?)</?h1>")
            business_name_raw = str(re.findall(business_name_pattern, str(article)))
            business_name_clean = re.sub("[\[\]\'\"]", "", business_name_raw)
            self.myprint(business_name_clean)  # custom print function for weird chars

此代码仅用于查找商户名称,但是,当然,如果URL上的显示更多结果"按钮未与之交互,它将用完商户名称.

This code is only looking for the business names, but of course, it is going to run out of business names to search for if the "show more results" button on the url is not interacted with.

推荐答案

如果您使用开发人员工具查看该网站(我使用的是Chrome),则可以看到单击显示更多"后会触发XHR发布请求结果"按钮.

If you look at the site with a developer tool (I used Chrome) then you can see that an XHR post request is fired when you click the "Show more results" button.

在这种情况下,您可以模拟此请求以收集数据:

In this case you can emulate this request to gather the data:

with requests.Session() as session:
    response = session.post("http://virali.se/photo/gallery/search", data={'start':0})
    print(response.content)

魔术"位于session.postdata参数中:这是从此偏移量加载图像的必需参数.在上面的示例中,0是您在网站上默认看到的第一堆图像.

The "magic" is in the data parameter of the session.post: it is the required argument to load the images from this offset. In the example above 0 is the first bunch of images you see per default on the site.

您可以使用BeautifulSoup解析response.content.

And you can parse response.content with BeautifulSoup.

我希望这对您有所帮助,尽管该示例使用的是Python 3,但也可以以相同的方式(不使用with构造)使用Python 2进行解决.

I hope this helps you get started, although the example uses Python 3 but it can be solved with Python 2 too in the same manner (without using the with construct).

这篇关于需要使用“显示更多"信息从网页中抓取信息.按钮,有什么建议吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆