如何清除此网络抓取脚本中的数据? [英] How to clean up the data from this webscraping script?

查看:98
本文介绍了如何清除此网络抓取脚本中的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的代码:

import requests
from bs4 import BeautifulSoup
import lxml

r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")

tables = soup.find_all('table')
print(tables)



print(tables)

由于它是一个ASP页,因此我必须执行发布请求,并且必须获取正确的数据.在商学院寻找特定学期的所有表格.问题是输出:

I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:

<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA   4721  </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>

我希望beautifulsoup能够解析文本,并将其整洁地返回到一个数据帧中,每一列都分开.我想将其放在数据框中,或者将其保存到CSV文件中....但是我不知道如何摆脱所有这些CSS选择器和标签.我尝试使用此代码执行此操作,并删除了指定的代码,但是td和tr无法正常工作:

I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:

for tag in soup():
    for attribute in ["class", "id", "name", "style", "td", "tr"]:
        del tag[attribute]

然后,我尝试使用称为bleach的程序包,但是在将表"放入其中时,但它指定它必须是文本输入.所以我显然不能把桌子放进去. 理想情况下,这是我希望在输出中看到的内容.

Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.

因此,我对如何以适当的方式格式化此方法确实感到茫然.非常感谢您的帮助.

So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.

推荐答案

尝试一下.我想这就是您的期望.顺便说一句,如果该页面中有多个表,并且您想要另一个表,则按soup.select('table')[n]的方式抽动索引.谢谢.

Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.

import requests
from bs4 import BeautifulSoup

res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")

tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
                    for list_item in tables.select("tr")] 

for data in list_items:
    print(' '.join(data))

部分结果:

Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree   Department: SCHACCOUNT
Course: ACG   2021   Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1  Completed Forms: 36

这篇关于如何清除此网络抓取脚本中的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆