使用beautifulSoup、Python在h3和div标签中抓取文本 [英] Scraping text in h3 and div tags using beautifulSoup, Python

查看：26 发布时间：2021/12/23 20:43:59 python html selenium beautifulsoup web-crawler

本文介绍了使用beautifulSoup、Python在h3和div标签中抓取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我没有使用 python、BeautifulSoup、Selenium 等的经验，但我很想从网站上抓取数据并存储为 csv 文件.我需要的单个数据样本编码如下(单行数据).


<div class="row"><div class="col-lg-10"><h3>标题</h3><div><i class="fa user"></i>&nbsp;&nbsp;NAME</div><div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div><div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div><div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div><div class="space">&nbsp;</div><div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i>&nbsp;更多信息</a></div>
<div class="col-lg-2">

我需要的输出是Heading,NAME,MOBILE,NUMBER,XYZ_ADDRESS

我发现这些数据没有 id 或 class 还没有作为一般文本出现在网站上.我为此分别尝试了 BeautifulSoup 和 Python Selenium，但由于没有看到教程，我在这两种方法中都坚持提取，指导我从这些和标签中提取文本

我使用 BeautifulSoup 的代码

导入 urllib2从 bs4 导入 BeautifulSoup进口请求导入 csv最大值 = 2'''with open("lg.csv", "a") as f:w=csv.writer(f)'''##for i in range(1,MAX+1)url="http://www.example_site.com"页面=requests.get(url)汤 = BeautifulSoup(page.content,"html.parser")对于soup.find_all('h3') 中的h:打印(h.get('h3'))

我的硒代码

导入csv从硒导入网络驱动程序MAX_PAGE_NUM = 2驱动程序 = webdriver.Firefox()对于范围内的 i (1, MAX_PAGE_NUM+1):url = "http://www.example_site.com"driver.get(url)name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3')#contact = driver.find_elements_by_xpath('//span[@class="item-price"]')# 电话 =# 移动 =# 地址 =# 打印(len(买家))# num_page_items = len(买家)# with open('res.csv','a') as f:# for i in range(num_page_items):# f.write(buyers[i].text + "," + 价格[i].text + "
")打印(名称)驱动程序关闭()

解决方案

您可以使用 CSS 选择器来查找您需要的数据.在您的情况下 div >h3 ~ div 将查找直接位于 div 元素内并由 h3 元素处理的所有 div 元素.>

导入 bs4页="""<div class="盒子效果"><div class="row"><div class="col-lg-10"><h3>标题</h3><div><i class="fa user"></i>&nbsp;&nbsp;NAME</div><div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div><div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div><div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>

soup = bs4.BeautifulSoup(page, 'lxml')# 找到类 col-lg-10 的 div 元素中的所有元素选择器 = 'div.col-lg-10 >*'# 找到包含我们想要的数据的元素找到 = 汤.选择(选择器)# 从找到的元素中提取数据data = [x.text.split(';')[-1].strip() for x in found]对于数据中的 x:打印(x)

<div class="box effect"> <div class="row"> <div class="col-lg-10"> <h3>HEADING</h3> <div><i class="fa user"></i>  NAME</div> <div><i class="fa phone"></i>  MOBILE</div> <div><i class="fa mobile-phone fa-2"></i>   NUMBER</div> <div><i class="fa address"></i>   XYZ_ADDRESS</div> <div class="space"> </div> <div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i>  more info</a></div> </div> <div class="col-lg-2"> </div> </div> </div>

import urllib2 from bs4 import BeautifulSoup import requests import csv MAX = 2 '''with open("lg.csv", "a") as f: w=csv.writer(f)''' ##for i in range(1,MAX+1) url="http://www.example_site.com" page=requests.get(url) soup = BeautifulSoup(page.content,"html.parser") for h in soup.find_all('h3'): print(h.get('h3'))

import csv from selenium import webdriver MAX_PAGE_NUM = 2 driver = webdriver.Firefox() for i in range(1, MAX_PAGE_NUM+1): url = "http://www.example_site.com" driver.get(url) name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3') #contact = driver.find_elements_by_xpath('//span[@class="item-price"]') # phone = # mobile = # address = # print(len(buyers)) # num_page_items = len(buyers) # with open('res.csv','a') as f: # for i in range(num_page_items): # f.write(buyers[i].text + "," + prices[i].text + " ") print (name) driver.close()

import bs4 page= """ <div class="box effect"> <div class="row"> <div class="col-lg-10"> <h3>HEADING</h3> <div><i class="fa user"></i>  NAME</div> <div><i class="fa phone"></i>  MOBILE</div> <div><i class="fa mobile-phone fa-2"></i>   NUMBER</div> <div><i class="fa address"></i>   XYZ_ADDRESS</div> </div> </div> </div> """ soup = bs4.BeautifulSoup(page, 'lxml') # find all div elements that are inside a div element # and are proceeded by an h3 element selector = 'div > h3 ~ div' # find elements that contain the data we want found = soup.select(selector) # Extract data from the found elements data = [x.text.split(';')[-1].strip() for x in found] for x in data: print(x)

soup = bs4.BeautifulSoup(page, 'lxml') # find all elements inside a div element of class col-lg-10 selector = 'div.col-lg-10 > *' # find elements that contain the data we want found = soup.select(selector) # Extract data from the found elements data = [x.text.split(';')[-1].strip() for x in found] for x in data: print(x)

使用beautifulSoup、Python在h3和div标签中抓取文本 [英] Scraping text in h3 and div tags using beautifulSoup, Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用beautifulSoup、Python在h3和div标签中抓取文本 [英] Scraping text in h3 and div tags using beautifulSoup, Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭