使用 BeautifulSoup 和 Python 获取元标记内容属性 [英] Get meta tag content property with BeautifulSoup and Python

查看:22
本文介绍了使用 BeautifulSoup 和 Python 获取元标记内容属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python和beautiful soup来提取下面标签的内容部分:

I am trying to use python and beautiful soup to extract the content part of the tags below:

<meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://superfunevents.com/events/super-fun-event-1/" />

我让 BeautifulSoup 加载页面并找到其他东西(这也从隐藏在源代码中的 id 标签中获取文章 id),但我不知道搜索 html 并找到的正确方法这些位,我尝试了 find 和 findAll 的变体,但无济于事.代码遍历当前的 url 列表...

I'm getting BeautifulSoup to load the page just fine and find other stuff (this also grabs the article id from the id tag hidden in the source), but I don't know the correct way to search the html and find these bits, I've tried variations of find and findAll to no avail. The code iterates over a list of urls at present...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#importing the libraries
from urllib import urlopen
from bs4 import BeautifulSoup

def get_data(page_no):
    webpage = urlopen('http://superfunevents.com/?p=' + str(i)).read()
    soup = BeautifulSoup(webpage, "lxml")
    for tag in soup.find_all("article") :
        id = tag.get('id')
        print id
# the hard part that doesn't work - I know this example is well off the mark!        
    title = soup.find("og:title", "content")
    print (title.get_text())
    url = soup.find("og:url", "content")
    print (url.get_text())
# end of problem

for i in range (1,100):
    get_data(i)

如果有人能帮我整理一下找到 og:title 和 og:content 那就太棒了!

If anyone can help me sort the bit to find the og:title and og:content that'd be fantastic!

推荐答案

提供 meta 标签名称作为 find() 的第一个参数.然后,使用关键字参数来检查特定的属性:

Provide the meta tag name as the first argument to find(). Then, use keyword arguments to check the specific attributes:

title = soup.find("meta", property="og:title")
url = soup.find("meta", property="og:url")

print(title["content"] if title else "No meta title given")
print(url["content"] if url else "No meta url given")

如果您知道 title 和 url 元属性将始终存在,则此处的 if/else 检查将是可选的.

The if/else checks here would be optional if you know that the title and url meta properties would always be present.

这篇关于使用 BeautifulSoup 和 Python 获取元标记内容属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆