BeautifulSoup:HTML提取项目符号点,但不提取导航栏 [英] BeautifulSoup: HTML Extracting Bullet points but not navigation bar

查看:53
本文介绍了BeautifulSoup:HTML提取项目符号点,但不提取导航栏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoup4进行一些HTML抓取.我正在尝试提取重要信息,例如标题,元数据,段落和列出的信息.

I am using BeautifulSoup4 to do some HTML scraping. I am trying to extract important info such as the title, meta data, paragraphs and listed information.

我的问题是我可以这样写段落:

My issue is I can take the paragraphs like so:

def main():
    response = urllib.request.urlopen('https://ecir2019.org/industry-day/')
    html = response.read()
    soup = BeautifulSoup(html,features="html.parser")
    text = [e.get_text() for e in soup.find_all('p')]
    article = '\n'.join(text)


    print(article)

main()

但是,如果我的网站链接的正文中有项目符号点,它将包含导航栏.即如果我将 p 更改为 li ul

But if my website link has bullet points in the body of text it would include the navigation bar. i.e. if i change p to li or ul

例如,我想要获得的输出是:

For example what I want to get as output is:

The Industry Day's objectives are three-fold:

The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.
The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.
Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.

我实际上得到的是:工业日的目标有三个:

HTML来源中的标记:

The tags in the HTML Source:

<p>The Industry Day's objectives are three-fold:</p>
<ol>
<li>The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.</li>
<li>The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.</li>
<li>Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.</li>
</ol>

推荐答案

您可以使用Or css选择器语法,以便也可以选择 li 元素.

You can use Or css selector syntax so you can select the li elements as well.

import requests
from bs4 import BeautifulSoup

url = 'https://ecir2019.org/industry-day/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('p, ol li')]

print(items)


仅此部分:


Just that section:

import requests
from bs4 import BeautifulSoup

url = 'https://ecir2019.org/industry-day/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('.kg-card-markdown p:nth-of-type(2), .kg-card-markdown p:nth-of-type(2) + ol li')]

print(items)


页面似乎已更改,因此我使用的是缓存版本(这仅在更新缓存之前有效).您可以使用其他类选择器来限制帖子正文:


The page appears to have changed so I am using a cached version (this will only work until cache is updated). You can limit to the post body with an additional class selector:

import requests
from bs4 import BeautifulSoup

url = 'http://webcache.googleusercontent.com/search?q=cache:https://ecir2019.org/industry-day'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('.post-body p, .post-body ol li, .post-body ul li')]

print(items)

这篇关于BeautifulSoup:HTML提取项目符号点,但不提取导航栏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆