使用python抓取时获取javascript变量值 [英] Getting javascript variable value while scraping with python

查看:62
本文介绍了使用python抓取时获取javascript变量值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这也是以前被问到的,但是我是刮板和python的新手.请帮助我,这对我的学习道路会非常有帮助.

I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.

我正在使用带有 Beautiful Soup 等软件包的python抓取新闻网站.

I am scraping a news site using python with packages such as Beautiful Soup and etc.

在获取 script 标记中声明的 java脚本变量的值时,我面临困难,并且在那里也要对其进行更新.

I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.

这是我要抓取的HTML页面的一部分:(仅包含脚本部分)

Here is the part of HTML page which I am scraping:(containing only script part)

<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>

从上面的部分,我想在python中获取 min_news_id 的值.如果从第2行进行更新,我也应该获得相同变量的值.

From the above part, I want to get the value of min_news_id in python. I should also get the value of same variable if updated from line 2.

这是我的做法:

    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
    page = bs(htmlPage, "html.parser")
    //find all the scripts tag
    scripts = page.find_all("script")
    for script in scripts:
        for line in script:
            scriptString = str(line)
            if "min_news_id" in scriptString:
                scriptString.replace('"', '\\"')
                print(scriptString)
                if(self.pattern.match(str(scriptString))):
                    print("matched")
                    data = self.pattern.match(scriptString)
                    jsVariable = json.loads(data.groups()[0])
                    InShortsScraper.newsOffset = jsVariable
                    print(InShortsScraper.newsOffset)

但是我从来没有得到变量的值.我的正则表达式或其他任何问题吗?请帮我.预先谢谢你.

But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me. Thank You in advance.

推荐答案

您无法使用 BeautifulSoup 监视javascript变量的更改,此处介绍如何在 while 时获取下一页新闻>循环, re json

you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json

from bs4 import BeautifulSoup
import requests, re

page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'

htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...

# get current min_news_id
min_news_id = re.search('min_news_id\s+=\s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break

这篇关于使用python抓取时获取javascript变量值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆