如何清除此网络抓取脚本中的数据? [英] How to clean up the data from this webscraping script?

查看：98 发布时间：2020/9/20 8:09:30 python css python-3.x web-scraping beautifulsoup

本文介绍了如何清除此网络抓取脚本中的数据?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的代码:

import requests
from bs4 import BeautifulSoup
import lxml

r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")

tables = soup.find_all('table')
print(tables)



print(tables)

由于它是一个ASP页，因此我必须执行发布请求，并且必须获取正确的数据.在商学院寻找特定学期的所有表格.问题是输出:

I had to do a post request due to the fact that it's an ASP page, and I had to grab the correct data. Looking in the college of Business for all tables from a specific semester. The problem is the output:

<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA   4721  </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>

我希望beautifulsoup能够解析文本，并将其整洁地返回到一个数据帧中，每一列都分开.我想将其放在数据框中，或者将其保存到CSV文件中....但是我不知道如何摆脱所有这些CSS选择器和标签.我尝试使用此代码执行此操作，并删除了指定的代码，但是td和tr无法正常工作:

I expected beautifulsoup to be able to parse the text, and return it nice and neat into a dataframe with each column separated. I would like to put it into a dataframe after, or perhaps save it to a CSV file.... But I have no idea how to get rid of all of these CSS selectors and tags. I tried using this code to do so, and it removed the ones specified, but td and tr didn't work:

for tag in soup():
    for attribute in ["class", "id", "name", "style", "td", "tr"]:
        del tag[attribute]

然后，我尝试使用称为bleach的程序包，但是在将表"放入其中时，但它指定它必须是文本输入.所以我显然不能把桌子放进去. 理想情况下，这是我希望在输出中看到的内容.

Then, I tried to use this package called bleach, but when putting the 'tables' into it but it specified that it must be a text input. So I can't put my table into it apparently. This is ideally what I would like to see with my output.

因此，我对如何以适当的方式格式化此方法确实感到茫然.非常感谢您的帮助.

So I'm truly at a loss here of how to format this in a proper way. Any help is much appreciated.

推荐答案

尝试一下.我想这就是您的期望.顺便说一句，如果该页面中有多个表，并且您想要另一个表，则按soup.select('table')[n]的方式抽动索引.谢谢.

Give this a try. I suppose this is what you expected. Btw, if there are more than one tables in that page and if you want another table then twitch the index, as in soup.select('table')[n]. Thanks.

import requests
from bs4 import BeautifulSoup

res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")

tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
                    for list_item in tables.select("tr")] 

for data in list_items:
    print(' '.join(data))

部分结果:

Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree   Department: SCHACCOUNT
Course: ACG   2021   Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1  Completed Forms: 36

这篇关于如何清除此网络抓取脚本中的数据?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何清除此网络抓取脚本中的数据? [英] How to clean up the data from this webscraping script?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何清除此网络抓取脚本中的数据? [英] How to clean up the data from this webscraping script?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭