如何在Python中抓取时同时打印段落和标题? [英] How to print paragraphs and headings simultaneously while scraping in Python?

查看:46
本文介绍了如何在Python中抓取时同时打印段落和标题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的初学者.我目前正在使用Beautifulsoup抓取网站.

I am a beginner in python. I am currently using Beautifulsoup to scrape a website.

str='' #my_url
source = urllib.request.urlopen(str);
soup = bs.BeautifulSoup(source,'lxml');
match=soup.find('article',class_='xyz');
for paragraph in match.find_all('p'):
    str+=paragraph.text+"\n"

我的标签结构-

My tag Structure -

<article class="xyz" >
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>         
</article>


我正在得到这样的输出(因为我能够提取段落)-

I am getting output like this (as I am able to extract the paragraphs) -

 efkl
 efkl
 efkl
 efkl

我想要的输出(我想要标题和段落)-

Output I want ( I want the headings as well as the paragraphs) -

 dr
 efkl
 dr
 efkl
 dr
 efkl
 dr
 efkl     

我希望我的输出还包含标题和段落.如何修改代码,使其在段落之前包含标题(如原始HTML).

I want my output to also contains headings along with paragraphs.How to modify code in such a way that it contains header before paragraphs (Like in original HTML) .

推荐答案

您可以用不同的方法去皮相同的苹果以达到目的.以下是其中一些:

You can peel the same apple in different ways to serve the purpose. Here are few of them:

使用 .find_next():

from bs4 import BeautifulSoup

content="""
<article class="xyz" >
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>
<h4>dr</h4>
<p>efkl</p>         
</article>
"""
soup = BeautifulSoup(content,"lxml")

for items in soup.find_all(class_="xyz"):
    data = '\n'.join(['\n'.join([item.text,item.find_next("p").text]) for item in items.find_all("h4")])
    print(data)

使用 .find_previous_sibling():

for items in soup.find_all(class_="xyz"):
    data = '\n'.join(['\n'.join([item.find_previous_sibling("h4").text,item.text]) for item in items.find_all("p")])
    print(data)

常用的方法:列表中使用了多个标签:

Commonly used approach: multiple tags used within list:

for items in soup.find_all(class_="xyz"):
    data = '\n'.join([item.text for item in items.find_all(["h4","p"])])
    print(data)

这三种方法都产生相同的结果:

All the three approaches produce the same result:

dr
efkl
dr
efkl
dr
efkl
dr
efkl

这篇关于如何在Python中抓取时同时打印段落和标题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆