Python的硒,刮网页中的JavaScript表 [英] Python Selenium, scraping webpage javascript table
本文介绍了Python的硒,刮网页中的JavaScript表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我要报废里面以下链接的JavaScript表。
http://data2.7m.cn/history_Matches_Data/2009-2010 /92/en/index.shtml
进口codeCS
进口lxml.html为LH
从LXML进口etree
进口要求
硒进口的webdriver
进口的urllib2
从BS4进口BeautifulSoupURL ='http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml
外形= webdriver.FirefoxProfile()
profile.set_ preference('network.http.max的连接,30)
profile.update_ preferences()
浏览器= webdriver.Firefox(配置文件)
browser.get(URL)
内容= browser.page_source
汤= BeautifulSoup(''。加入(内容))
当我得到的网页内容,然后我需要知道圆足球的数量在特定的联赛。
下面codeS只发现了唯一的表,我可能知道如何让所有38足球比赛的餐桌?谢谢你。
#废钢轮足球比赛
soup.findAll('TD',ATTRS = {'类':'lsm2'})#打印默认一轮的足球比赛的结果,但也有38轮(编号从S1到S38)
打印soup.find(格,{ID:Match_Table})prettify()
解决方案
#======================= =====================================
进口codeCS
进口lxml.html为LH
从LXML进口etree
进口要求
硒进口的webdriver
进口的urllib2
从BS4进口BeautifulSoup
从熊猫进口数据框,系列
进口html5libURL ='http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml
外形= webdriver.FirefoxProfile()
profile.set_ preference('network.http.max的连接,30)
profile.update_ preferences()
浏览器= webdriver.Firefox(配置文件)
browser.get(URL)内容= browser.page_source
汤= BeautifulSoup(''。加入(内容))
#NUM = soup.findAll('TD',ATTRS = {'类':'lsm2'})
#NUM = soup.findAll('表')[2] .findAll('TD')[37]的.text
#soup.findAll('表',ATTRS = {'类':'e_run_tb'}) NUM1 = soup.findAll('表')[2] .findAll(TR)
因为我在范围内(1,LEN(NUM1)+1):
对于在范围Ĵ(1,LEN(NUM1 [I-1])+ 1):
#点击网站按钮
clickme = browser.find_element_by_xpath('// * [@ id中=e_run_tb] / tbody的/ TR'+'['+ STR(I)+']'+'/ TD'+'['+ STR(j)条+])
clickme.click() 内容= browser.page_source
汤= BeautifulSoup(''。加入(内容)) 表= soup.find('格',ATTRS = {'类':'e_matches'})
行= table.findAll('TR')
#为行TR:
#COLS = tr.findAll('TD')
#在COLS TD:
#文本= td.find(文= TRUE)
#打印的文字,
#打印
对于行TR [5:16]:#from行5至16
COLS = tr.findAll('TD')
在COLS TD:
文字= td.find(文= TRUE)
打印文本,
打印
打印
I am going to scrap the javascript tables inside below link. http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml
import codecs
import lxml.html as lh
from lxml import etree
import requests
from selenium import webdriver
import urllib2
from bs4 import BeautifulSoup
URL = 'http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml'
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.max-connections', 30)
profile.update_preferences()
browser = webdriver.Firefox(profile)
browser.get(URL)
content = browser.page_source
soup = BeautifulSoup(''.join(content))
When I get the contents of the webpage, then I need to know the number of round of soccer matches in that particular league.
Below codes has only found out the only table, may I know how to get all 38 soccer matches' tables? Thank you.
# scrap the round of soccer matches
soup.findAll('td', attrs={'class': 'lsm2'})
# print the soccer matches' result of default round, but there have 38 rounds (id from s1 to s38)
print soup.find("div", {"id": "Match_Table"}).prettify()
解决方案
# ============================================================
import codecs
import lxml.html as lh
from lxml import etree
import requests
from selenium import webdriver
import urllib2
from bs4 import BeautifulSoup
from pandas import DataFrame, Series
import html5lib
URL = 'http://data2.7m.cn/history_Matches_Data/2009-2010/92/en/index.shtml'
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.max-connections', 30)
profile.update_preferences()
browser = webdriver.Firefox(profile)
browser.get(URL)
content = browser.page_source
soup = BeautifulSoup(''.join(content))
# num = soup.findAll('td', attrs={'class': 'lsm2'})
# num = soup.findAll('table')[2].findAll('td')[37].text
# soup.findAll('table',attrs={'class':'e_run_tb'})
num1 = soup.findAll('table')[2].findAll('tr')
for i in range(1,len(num1)+1):
for j in range(1,len(num1[i-1])+1):
# click button on website
clickme = browser.find_element_by_xpath('//*[@id="e_run_tb"]/tbody/tr'+'['+str(i)+']'+'/td'+'['+str(j)+']')
clickme.click()
content = browser.page_source
soup = BeautifulSoup(''.join(content))
table = soup.find('div', attrs={'class': 'e_matches'})
rows = table.findAll('tr')
# for tr in rows:
# cols = tr.findAll('td')
# for td in cols:
# text = td.find(text=True)
# print text,
# print
for tr in rows[5:16]: #from row 5 to 16
cols = tr.findAll('td')
for td in cols:
text = td.find(text=True)
print text,
print
print
这篇关于Python的硒,刮网页中的JavaScript表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文