特定的meta标签中提取内容未使用BeautifulSoup关闭 [英] Extracting contents from specific meta tags that are not closed using BeautifulSoup

查看:704
本文介绍了特定的meta标签中提取内容未使用BeautifulSoup关闭的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图分析出特定的meta标签的内容。这里的meta标签的结构。前两个封闭用反斜杠,但其余没有任何结束标记。当我拿到第3个meta标签,在℃之间的全部内容; HEAD> 标记返回。我也试过 soup.findAll(文= re.compile('关键词'))但由于关键字是meta标签的一个属性,不返回任何东西。

 < META NAME =CSRF-参数CONTENT =authenticity_token/>
< META NAME =CSRF令牌CONTENT =OrpXIt / y9zdAFHWzJXY2EccDi1zNSucxcCOu8 + 6Mc9c =/>
<元含量='text / html的;字符集= UTF-8HTTP-EQUIV =Content-Type的'>
<元含量='EN_USHTTP的当量='内容的语言'>
<元含量='c2y_K2CiLmGeet7GUQc9e3RVGp_gCOxUC4IdJg_RBVo的名字='谷歌定点验证'>
<元含量='初始规模= 1.0,最大规模= 1.0,宽=设备宽度的名字='视'>
<元含量=notranslateNAME ='谷歌'>
<元CONTENT =了解尤伯杯的产品,创始人,投资人及团队每个人的私人司机 - 从任何移动电话,短信,iPhone和Android应用要求汽车在几分钟之内,在一个光滑的黑色车专职司机将抵达。路边,自动记入档案中的信用卡,包括尖。名称='描述'>

这里的code:

 导入CSV
进口重
进口SYS
从BS4进口BeautifulSoup
从urllib.request里导入请求,的urlopenREQ3 =请求(https://angel.co/uber,标题= {'的User-Agent:Mozilla的/ 5.0')
第3页=的urlopen(REQ3).read()
soup3 = BeautifulSoup(第3页)##这将返回整个网页,因为meta标签不关闭
DESC = soup3.findAll(ATTRS = {名:说明})


解决方案

虽然我不能肯定它会为每个页面工作:

 从BS4进口BeautifulSoup
进口的urllib第3页=了urllib.urlopen(https://angel.co/uber).read()
soup3 = BeautifulSoup(第3页)DESC = soup3.findAll(ATTRS = {名:说明})
打印说明[0] ['内容'。EN code(UTF-8)

收益率:

 了解尤伯杯的产品,创始人,投资和团队。每个人的私人直接还原铁
版本 - 要求汽车从任何移动phoneΓÇötext消息,iPhone和Android应用
秒。几分钟后,在光滑的黑色轿车专业司机将到达curbsi
德。自动记入档案中的信用卡,包括尖。

I'm trying to parse out content from specific meta tags. Here's the structure of the meta tags. The first two are closed with a backslash, but the rest don't have any closing tags. As soon as I get the 3rd meta tag, the entire contents between the <head> tags are returned. I've also tried soup.findAll(text=re.compile('keyword')) but that does not return anything since keyword is an attribute of the meta tag.

<meta name="csrf-param" content="authenticity_token"/>
<meta name="csrf-token" content="OrpXIt/y9zdAFHWzJXY2EccDi1zNSucxcCOu8+6Mc9c="/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'>
<meta content='en_US' http-equiv='Content-Language'>
<meta content='c2y_K2CiLmGeet7GUQc9e3RVGp_gCOxUC4IdJg_RBVo' name='google-site-    verification'>
<meta content='initial-scale=1.0,maximum-scale=1.0,width=device-width' name='viewport'>
<meta content='notranslate' name='google'>
<meta content="Learn about Uber's product, founders, investors and team. Everyone's Private Driver - Request a car from any mobile phone—text message, iPhone and Android apps. Within minutes, a professional driver in a sleek black car will arrive curbside. Automatically charged to your credit card on file, tip included." name='description'>

Here's the code:

import csv
import re
import sys
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

req3 = Request("https://angel.co/uber", headers={'User-Agent': 'Mozilla/5.0')
page3 = urlopen(req3).read()
soup3 = BeautifulSoup(page3)

## This returns the entire web page since the META tags are not closed
desc = soup3.findAll(attrs={"name":"description"}) 

解决方案

Although I'm not sure it will work for every page:

from bs4 import BeautifulSoup
import urllib

page3 = urllib.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name":"description"}) 
print desc[0]['content'].encode('utf-8')

Yields:

Learn about Uber's product, founders, investors and team. Everyone's Private Dri
ver - Request a car from any mobile phoneΓÇötext message, iPhone and Android app
s. Within minutes, a professional driver in a sleek black car will arrive curbsi
de. Automatically charged to your credit card on file, tip included.

这篇关于特定的meta标签中提取内容未使用BeautifulSoup关闭的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆