如何清除此网络抓取脚本中的数据? [英] How to clean up the data from this webscraping script?
问题描述
这是我的代码:
import requests
from bs4 import BeautifulSoup
import lxml
r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")
tables = soup.find_all('table')
print(tables)
print(tables)
由于它是一个ASP页,因此我必须执行发布请求,并且必须获取正确的数据.在商学院寻找特定学期的所有表格.问题是输出:
I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:
<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA 4721 </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>
我希望beautifulsoup能够解析文本,并将其整洁地返回到一个数据帧中,每一列都分开.我想将其放在数据框中,或者将其保存到CSV文件中....但是我不知道如何摆脱所有这些CSS选择器和标签.我尝试使用此代码执行此操作,并删除了指定的代码,但是td和tr无法正常工作:
I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:
for tag in soup():
for attribute in ["class", "id", "name", "style", "td", "tr"]:
del tag[attribute]
然后,我尝试使用称为bleach的程序包,但是在将表"放入其中时,但它指定它必须是文本输入.所以我显然不能把桌子放进去. 理想情况下,这是我希望在输出中看到的内容.
Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.
因此,我对如何以适当的方式格式化此方法确实感到茫然.非常感谢您的帮助.
So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.
推荐答案
尝试一下.我想这就是您的期望.顺便说一句,如果该页面中有多个表,并且您想要另一个表,则按soup.select('table')[n]
的方式抽动索引.谢谢.
Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]
. Thanks.
import requests
from bs4 import BeautifulSoup
res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")
tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
for list_item in tables.select("tr")]
for data in list_items:
print(' '.join(data))
部分结果:
Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree Department: SCHACCOUNT
Course: ACG 2021 Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1 Completed Forms: 36
这篇关于如何清除此网络抓取脚本中的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!