使用beautifulsoup从页面中抓取表格,找不到表格 [英] Scraping a table from a page using beautifulsoup, table is not found
问题描述
我一直在努力从此处,但在我看来,BeautifulSoup找不到任何桌子.
I've been trying to scrape the table from here but it seems to me that BeautifulSoup doesn't find any table.
我写道:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv
url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r=requests.get(url)
data=r.text
soup=BeautifulSoup(data,'xml')
table=soup.find_all('table')
print table #prints nothing..
基于其他类似的问题,我认为HTML某种程度上已损坏,但我不是专家.在这些地方找不到答案:(美丽的汤缺少一些html表标签),(从网站中提取表格),(使用BeautifulSoup刮擦表),甚至是(">Python+BeautifulSoup:从网页上抓取特定表)
Based on other similar questions, I assume that the HTML is broken in someway, but I'm not an expert.. Couldn't find an answer in those: (Beautiful soup missing some html table tags), (Extracting a table from a website), (Scraping a table using BeautifulSoup), or even (Python+BeautifulSoup: scraping a particular table from a webpage)
感谢一堆!
推荐答案
您正在解析 html
,但是您使用了 xml
解析器.
您应该使用 soup = BeautifulSoup(data,"html.parser")
您所需的数据在 script
标记中,实际上实际上没有 table
标记.因此,您需要在 script
中查找文本.
注意:如果您使用的是Python 2.x,请使用"HTMLParser"而不是"html.parser".
You are parsing html
but you used xml
parser.
You should use soup=BeautifulSoup(data,"html.parser")
Your necessary data is in script
tag, in fact there is no table
tag actually. So, you need to find texts inside script
.
N.B: If you are using Python 2.x then use "HTMLParser" instead of "html.parser".
这是代码.
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r=requests.get(url)
data=r.text
soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")
file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []
list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])
for script in scripts:
text = script.text
start = 0
end = 0
if(len(text) > 10000):
while(start > -1):
start = text.find('"School Name":"',start)
if(start == -1):
break
start += len('"School Name":"')
end = text.find('"',start)
school_name = text[start:end]
start = text.find('"Early Career Median Pay":"',start)
start += len('"Early Career Median Pay":"')
end = text.find('"',start)
early_pay = text[start:end]
start = text.find('"Mid-Career Median Pay":"',start)
start += len('"Mid-Career Median Pay":"')
end = text.find('"',start)
mid_pay = text[start:end]
start = text.find('"Rank":"',start)
start += len('"Rank":"')
end = text.find('"',start)
rank = text[start:end]
start = text.find('"% High Job Meaning":"',start)
start += len('"% High Job Meaning":"')
end = text.find('"',start)
high_job = text[start:end]
start = text.find('"School Type":"',start)
start += len('"School Type":"')
end = text.find('"',start)
school_type = text[start:end]
start = text.find('"% STEM":"',start)
start += len('"% STEM":"')
end = text.find('"',start)
stem = text[start:end]
list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
writer.writerows(list_to_write)
file_name.close()
这将在csv中生成您所需的表.完成操作后,别忘了关闭文件.
This will generate your necessary table in csv. Don't forget to close the file when you are done.
这篇关于使用beautifulsoup从页面中抓取表格,找不到表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!