Python的硒，刮网页中的JavaScript表 [英] Python Selenium, scraping webpage javascript table

查看：123 发布时间：2016/8/5 19:10:05 javascript python selenium beautifulsoup

本文介绍了Python的硒，刮网页中的JavaScript表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我要报废里面以下链接的JavaScript表。
http://data2.7m.cn/history_Matches_Data/2009-2010 /92/en/index.shtml

 进口codeCS
进口lxml.html为LH
从LXML进口etree
进口要求
硒进口的webdriver
进口的urllib2
从BS4进口BeautifulSoupURL ='http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml
外形= webdriver.FirefoxProfile（）
profile.set_ preference（'network.http.max的连接，30）
profile.update_ preferences（）
浏览器= webdriver.Firefox（配置文件）
browser.get（URL）
内容= browser.page_source
汤= BeautifulSoup（''。加入（内容））

当我得到的网页内容，然后我需要知道圆足球的数量在特定的联赛。

下面codeS只发现了唯一的表，我可能知道如何让所有38足球比赛的餐桌？谢谢你。

 ＃废钢轮足球比赛
soup.findAll（'TD'，ATTRS = {'类'：'lsm2'}）＃打印默认一轮的足球比赛的结果，但也有38轮（编号从S1到S38）
打印soup.find（格，{ID：Match_Table}）prettify（）

解决方案

 ＃======================= =====================================
进口codeCS
进口lxml.html为LH
从LXML进口etree
进口要求
硒进口的webdriver
进口的urllib2
从BS4进口BeautifulSoup
从熊猫进口数据框，系列
进口html5libURL ='http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml
外形= webdriver.FirefoxProfile（）
profile.set_ preference（'network.http.max的连接，30）
profile.update_ preferences（）
浏览器= webdriver.Firefox（配置文件）
browser.get（URL）内容= browser.page_source
汤= BeautifulSoup（''。加入（内容））
＃NUM = soup.findAll（'TD'，ATTRS = {'类'：'lsm2'}）
＃NUM = soup.findAll（'表'）[2] .findAll（'TD'）[37]的.text
＃soup.findAll（'表'，ATTRS = {'类'：'e_run_tb'}）    NUM1 = soup.findAll（'表'）[2] .findAll（TR）
    因为我在范围内（1，LEN（NUM1）+1）：
        对于在范围Ĵ（1，LEN（NUM1 [I-1]）+ 1）：
            ＃点击网站按钮
            clickme = browser.find_element_by_xpath（'// * [@ id中=e_run_tb] / tbody的/ TR'+'['+ STR（I）+']'+'/ TD'+'['+ STR（j）条+]）
            clickme.click（）            内容= browser.page_source
            汤= BeautifulSoup（''。加入（内容））            表= soup.find（'格'，ATTRS = {'类'：'e_matches'}）
            行= table.findAll（'TR'）
＃为行TR：
＃COLS = tr.findAll（'TD'）
＃在COLS TD：
＃文本= td.find（文= TRUE）
＃打印的文字，
＃打印
            对于行TR [5:16]：#from行5至16
                COLS = tr.findAll（'TD'）
                在COLS TD：
                    文字= td.find（文= TRUE）
                    打印文本，
                打印
            打印

I am going to scrap the javascript tables inside below link. http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml

import codecs
import lxml.html as lh
from lxml import etree
import requests
from selenium import webdriver
import urllib2
from bs4 import BeautifulSoup

URL = 'http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml'
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.max-connections', 30)
profile.update_preferences()
browser = webdriver.Firefox(profile)
browser.get(URL)
content = browser.page_source
soup = BeautifulSoup(''.join(content))

When I get the contents of the webpage, then I need to know the number of round of soccer matches in that particular league.

Below codes has only found out the only table, may I know how to get all 38 soccer matches' tables? Thank you.

# scrap the round of soccer matches
soup.findAll('td', attrs={'class': 'lsm2'})

# print the soccer matches' result of default round, but there have 38 rounds (id from s1 to s38)
print soup.find("div", {"id": "Match_Table"}).prettify()

解决方案

# ============================================================
import codecs
import lxml.html as lh
from lxml import etree
import requests
from selenium import webdriver
import urllib2
from bs4 import BeautifulSoup
from pandas import DataFrame, Series
import html5lib

URL = 'http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml'
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.max-connections', 30)
profile.update_preferences()
browser = webdriver.Firefox(profile)
browser.get(URL)

content = browser.page_source
soup = BeautifulSoup(''.join(content))
# num = soup.findAll('td', attrs={'class': 'lsm2'})
# num = soup.findAll('table')[2].findAll('td')[37].text
# soup.findAll('table',attrs={'class':'e_run_tb'})

    num1 = soup.findAll('table')[2].findAll('tr')
    for i in range(1,len(num1)+1):
        for j in range(1,len(num1[i-1])+1):
            # click button on website
            clickme = browser.find_element_by_xpath('//*[@id="e_run_tb"]/tbody/tr'+'['+str(i)+']'+'/td'+'['+str(j)+']')
            clickme.click()

            content = browser.page_source
            soup = BeautifulSoup(''.join(content))

            table = soup.find('div', attrs={'class': 'e_matches'})
            rows = table.findAll('tr')
#           for tr in rows:
#             cols = tr.findAll('td')
#             for td in cols:
#                    text = td.find(text=True)
#                    print text,
#                print
            for tr in rows[5:16]: #from row 5 to 16
                cols = tr.findAll('td')
                for td in cols:
                    text = td.find(text=True)
                    print text,
                print
            print

这篇关于Python的硒，刮网页中的JavaScript表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python的硒，刮网页中的JavaScript表 [英] Python Selenium, scraping webpage javascript table

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python的硒，刮网页中的JavaScript表 [英] Python Selenium, scraping webpage javascript table

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭