用漂亮的汤和 pandas 刮桌子时如何保留链接 [英] how to preserve links when scraping a table with beautiful soup and pandas
本文介绍了用漂亮的汤和 pandas 刮桌子时如何保留链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
使用Beautiful Soup
和Pandas
抓取网络以获得一张桌子.其中一列有一些网址.当我将 html 传递给 Pandas 时,href
丢失了.
Scraping a web to get a table, using Beautiful soup
and Pandas
. One of the columns got some urls. When I pass html to pandas, href
are lost.
有没有办法只为该列保留 url 链接?
is there any way of preserving the url link just for that column?
示例数据(为更好地适应情况而进行了编辑):
Example data (edited for better suit ral case):
<html>
<body>
<table>
<tr>
<td>customer</td>
<td>country</td>
<td>area</td>
<td>website link</td>
</tr>
<tr>
<td>IBM</td>
<td>USA</td>
<td>EMEA</td>
<td><a href="http://www.ibm.com">IBM site</a></td>
</tr>
<tr>
<td>CISCO</td>
<td>USA</td>
<td>EMEA</td>
<td><a href="http://www.cisco.com">cisco site</a></td>
</tr>
<tr>
<td>unknown company</td>
<td>USA</td>
<td>EMEA</td>
<td></td>
</tr>
</table>
</body>
</html>
我的python代码:
My python code:
file = open(url,"r")
soup = BeautifulSoup(file, 'lxml')
parsed_table = soup.find_all('table')[1]
df = pd.read_html(str(parsed_table),encoding='utf-8')[0]
df
输出(导出为 CSV):
Output (exported to CSV):
customer;country;area;website
IBM;USA;EMEA;IBM site
CISCO;USA;EMEA;cisco site
unknown company;USA;EMEA;
df 输出正常,但链接丢失.我需要保留链接.至少是网址.
df output is ok but the link is lost. I need to preserve the link. The URL at least.
有什么提示吗?
推荐答案
只要这样检查标签是否存在:
Just check if tag exists this way:
import numpy as np
with open(url,"r") as f:
sp = bs.BeautifulSoup(f, 'lxml')
tb = sp.find_all('table')[56]
df = pd.read_html(str(tb),encoding='utf-8', header=0)[0]
df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')]
这篇关于用漂亮的汤和 pandas 刮桌子时如何保留链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文