从python查找或选择元素以使用beautifulsoup进行刮擦 [英] Find or select elements from python to scrape with beautifulsoup

查看:48
本文介绍了从python查找或选择元素以使用beautifulsoup进行刮擦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定如何在table class ="table-info"内选择以下项目

I am not sure how to select below items inside the table class="table-info"

我要使用python和beautifulsoup提取以下内容:

Using python and beautifulsoup, I want to extract the:

  1. 电话

电子邮件

网站

主要活动(不含div的li元素文本)计算机咨询活动".

main activity (li element text without the div) "Computer consultancy activities".

 <table class="table-info">
 <tbody>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Business name</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">Company XYZ</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Register code:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">112233558</div>
         </td>
     </tr>


     <tr>
         <td class="col-1">
             <div class="col-1-text">Operating address:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
                     class="link-location">Some location strt. 233</a></div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Legal address</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">
                 <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
                     location
                 </a>
             </div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">VAT No:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
                     liability</a></div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Age:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">1 year&nbsp;3 months</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Founded:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">20/09/2019</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Capital:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">2000 USD</div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Phone:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">123456789</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">E-mail:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Representatives:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">
                 <div class="box-message">
                     <p class="desc">To access information, please</p>
                     <p>
                         <a href="#" onclick="return loginClicked(this, '#');"
                             class="btn btn-small btn-purple link-login">Log in</a>
                     </p>
                 </div>
             </div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">
                 Main activity:
                 <span class="tip info" title=""
                     data-original-title="Activities are classified according to EMTAK 2008"></span>
             </div>
         </td>
         <td class="col-2">
             <div class="col-2-text" id="activity_top5ffe2eab23d13">
                 <ul>
                     <li>
                         Computer consultancy activities
                         <div class="main_activities_top_link_wrapper">
                             <a href="https://www.somesite.com/" target="_blank"
                                 onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
                                 class="btn btn-simple btn-open-graph">
                                 <span>Open TOP 20</span> </a>
                         </div>
                     </li>
                 </ul>

             </div>
         </td>
     </tr>


 </tbody>

注意:上面的代码是一个查询结果/html示例,但有时查询结果/公司没有电子邮件或网站,反之亦然.因此,很重要的一点是,如果找不到所需的html内容,则代码不会出错.我发现跟随类名或id而不是计算表/div嵌套的深度(xpath)更好.

Note: Above code is one query result / html example, but sometimes query result / company does not have email or website / vice versa. So, its important that code does not run into error if it does not find the html content what its looking for. I find its better to follow the class names or ids rather than counting how deep the table/div nesting goes (xpath).

我有无法在atm上正常工作的代码:

I have code which is not working great atm:

import csv
import requests
import datetime
import time
 
from requests import get
from bs4 import BeautifulSoup
 
 
with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)
 
    count = 0
     
    for row in reader:
         
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 
        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
         
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")
 
        table_info = soup.select_one('.table-info')
 
        mail = table_info.select_one('.col-2 a[href^=mailto]')
        mail = mail.get('href')
        mail_clean = mail.split(':')[1]
 
        website = soup.find(text='Website:')
        website = table_info.select_one('.col-2 a[target^=_blank]')
        website = website.get('href') 
         
        collected_data = row[1], mail_clean, website, timestamp
 
        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
        with open('extracted.csv', 'w', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerows(data_list)
 
        print(row[1], "|", mail_clean,"|", website,"|", timestamp)
        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1
     
  

推荐答案

您是否考虑过使用计算表子项的css选择器?如果您的表将始终镜像示例代码,则使用 nth-child 属性可能会更容易.

Have you considered using css selectors that count the table's children? If your table will always mirror the example code, it just might be easier to use the nth-child property.

  • 电话: tr:nth-​​child(10).col-2-text
  • 电子邮件: tr:nth-​​child(11)a
  • 网站: span
  • 主要活动: li

我使用了选择器小工具来抓取这些标签.您可能希望直接在页面上运行它,以查看是否还有其他更易于实现的代码.

I used Selector Gadget to grab these tags. You might want to run it on your page directly to see if there are any other ones that are easier to implement.

这篇关于从python查找或选择元素以使用beautifulsoup进行刮擦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆