在用美丽的汤和 pandas 刮桌子时如何保存链接 [英] how to preserve links when scraping a table with beautiful soup and pandas

查看:52
本文介绍了在用美丽的汤和 pandas 刮桌子时如何保存链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Beautiful soupPandas刮擦网以获取表格.其中一列有一些网址.当我将html传递给熊猫时,href丢失了.

Scraping a web to get a table, using Beautiful soup and Pandas. One of the columns got some urls. When I pass html to pandas, href are lost.

有什么方法可以只保留该列的url链接吗?

is there any way of preserving the url link just for that column?

示例数据(为更好地适用于情况而进行了编辑):

Example data (edited for better suit ral case):

  <html>
        <body>
          <table>
              <tr>
               <td>customer</td>
               <td>country</td>
               <td>area</td>
               <td>website link</td>
             </tr>
             <tr>
               <td>IBM</td>
               <td>USA</td>
               <td>EMEA</td>
               <td><a href="http://www.ibm.com">IBM site</a></td>
            </tr>
          <tr>
            <td>CISCO</td>
            <td>USA</td>
            <td>EMEA</td>
            <td><a href="http://www.cisco.com">cisco site</a></td>
         </tr>
           <tr>
            <td>unknown company</td>
            <td>USA</td>
            <td>EMEA</td>
            <td></td>
         </tr>
       </table>
     </body>
  </html>

我的python代码:

My python code:

    file = open(url,"r")

    soup = BeautifulSoup(file, 'lxml')

    parsed_table = soup.find_all('table')[1] 

    df = pd.read_html(str(parsed_table),encoding='utf-8')[0]

 df

输出(导出为CSV):

Output (exported to CSV):

customer;country;area;website
IBM;USA;EMEA;IBM site
CISCO;USA;EMEA;cisco site
unknown company;USA;EMEA;

df输出正常,但链接丢失.我需要保留链接.至少是网址.

df output is ok but the link is lost. I need to preserve the link. The URL at least.

有任何提示吗?

推荐答案

只需检查标记是否以这种方式存在:

Just check if tag exists this way:

 import numpy as np

 with open(url,"r") as f:
     sp = bs.BeautifulSoup(f, 'lxml')
     tb = sp.find_all('table')[56] 
     df = pd.read_html(str(tb),encoding='utf-8', header=0)[0]
     df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')]

这篇关于在用美丽的汤和 pandas 刮桌子时如何保存链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆