从python查找或选择元素以使用beautifulsoup进行刮擦 [英] Find or select elements from python to scrape with beautifulsoup

查看：48 发布时间：2021/4/15 19:20:54 python web-scraping beautifulsoup scrape

本文介绍了从python查找或选择元素以使用beautifulsoup进行刮擦的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不确定如何在table class ="table-info"内选择以下项目

I am not sure how to select below items inside the table class="table-info"

我要使用python和beautifulsoup提取以下内容:

Using python and beautifulsoup, I want to extract the:

电话

电子邮件

网站

主要活动(不含div的li元素文本)计算机咨询活动".

main activity (li element text without the div) "Computer consultancy activities".

 <table class="table-info">
 <tbody>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Business name</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">Company XYZ</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Register code:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">112233558</div>
         </td>
     </tr>


     <tr>
         <td class="col-1">
             <div class="col-1-text">Operating address:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
                     class="link-location">Some location strt. 233</a></div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Legal address</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">
                 <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
                     location
                 </a>
             </div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">VAT No:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
                     liability</a></div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Age:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">1 year&nbsp;3 months</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Founded:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">20/09/2019</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Capital:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">2000 USD</div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Phone:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">123456789</div>
         </td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">E-mail:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">Representatives:</div>
         </td>
         <td class="col-2">
             <div class="col-2-text">
                 <div class="box-message">
                     <p class="desc">To access information, please</p>
                     <p>
                         <a href="#" onclick="return loginClicked(this, '#');"
                             class="btn btn-small btn-purple link-login">Log in</a>
                     </p>
                 </div>
             </div>
         </td>
     </tr>
     <tr>
         <td colspan="2" class="sep"></td>
     </tr>
     <tr>
         <td class="col-1">
             <div class="col-1-text">
                 Main activity:
                 <span class="tip info" title=""
                     data-original-title="Activities are classified according to EMTAK 2008"></span>
             </div>
         </td>
         <td class="col-2">
             <div class="col-2-text" id="activity_top5ffe2eab23d13">
                 <ul>
                     <li>
                         Computer consultancy activities
                         <div class="main_activities_top_link_wrapper">
                             <a href="https://www.somesite.com/" target="_blank"
                                 onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
                                 class="btn btn-simple btn-open-graph">
                                 <span>Open TOP 20</span> </a>
                         </div>
                     </li>
                 </ul>

             </div>
         </td>
     </tr>


 </tbody>

注意:上面的代码是一个查询结果/html示例，但有时查询结果/公司没有电子邮件或网站，反之亦然.因此，很重要的一点是，如果找不到所需的html内容，则代码不会出错.我发现跟随类名或id而不是计算表/div嵌套的深度(xpath)更好.

Note: Above code is one query result / html example, but sometimes query result / company does not have email or website / vice versa. So, its important that code does not run into error if it does not find the html content what its looking for. I find its better to follow the class names or ids rather than counting how deep the table/div nesting goes (xpath).

我有无法在atm上正常工作的代码:

I have code which is not working great atm:

import csv
import requests
import datetime
import time
 
from requests import get
from bs4 import BeautifulSoup
 
 
with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)
 
    count = 0
     
    for row in reader:
         
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
 
        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
         
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")
 
        table_info = soup.select_one('.table-info')
 
        mail = table_info.select_one('.col-2 a[href^=mailto]')
        mail = mail.get('href')
        mail_clean = mail.split(':')[1]
 
        website = soup.find(text='Website:')
        website = table_info.select_one('.col-2 a[target^=_blank]')
        website = website.get('href') 
         
        collected_data = row[1], mail_clean, website, timestamp
 
        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
        with open('extracted.csv', 'w', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerows(data_list)
 
        print(row[1], "|", mail_clean,"|", website,"|", timestamp)
        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1

从python查找或选择元素以使用beautifulsoup进行刮擦 [英] Find or select elements from python to scrape with beautifulsoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从python查找或选择元素以使用beautifulsoup进行刮擦 [英] Find or select elements from python to scrape with beautifulsoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭