在python中使用BeautifulSoup提取html标签之间的数据 [英] extract data between html tags using BeautifulSoup in python

查看：104 发布时间：2021/4/15 19:19:44 python html beautifulsoup extraction

本文介绍了在python中使用BeautifulSoup提取html标签之间的数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想提取html标记"title"和"meta"标记之间的数据，我想提取URL属性的值以及?"之前的文本.

I want to extract the data between the html tag 'title' and in the 'meta' tag, I want to extract value of URL attribute and that too the text just before the '?'.

<html lang="en" id="facebook" class="no_js">
<head>
    <meta charset="utf-8" />
    <script>
        function envFlush(a) {function b(c){for(var d in)c[d]=a[d];}if(window.requireLazy){window.requireLazy(['Env'],b);}else{window.Env=window.Env||{};b(window.Env);}}envFlush({"ajaxpipe_token":"AXjbmsNXDxPlvhrf","lhsh":"4AQFQfqrV","khsh":"0`sj`e`rm`s-0fdu^gshdoer-0gc^eurf-3gc^eurf;1;enbtldou;fduDmdldourCxO`ld-2YLMIuuqSdptdru;qsnunuxqd;rdoe"});
    </script>
    <script>CavalryLogger=false;</script>
    <noscript>
        <meta http-equiv="refresh" content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" />
    </noscript>
    <meta name="referrer" content="default" id="meta_referrer" />
    <title id="pageTitle">
        &quot; CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN &quot;
    </title>
    <link rel="shortcut icon" href="https://fbstatic-a.akamaihd.net/rsrc.php/yl/r/H3nktOa7ZMg.ico" />

即CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN和685004288208871.

i.e. CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN and 685004288208871.

我尝试了以下代码:

>>> soup.title.contents

输出为

[u'" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "']

在此，我不需要字符'[]'，'u'和单引号.

In this I don't want the characters '[]' , 'u' and single quotes.

此外，在执行以下操作时:

Also, on implementing the following :

>>> soup.meta.contents

我得到的o/p为:

[]

请帮助我.我是BeautifulSoup的新手.

Please help me at this. I am new to BeautifulSoup.

推荐答案

.contents() 方法返回一个列表.在这种情况下，它只有一个元素，即Unicode字符串.您应该发现所需的表达式实际上是

The .contents() method of Beautiful Soup objects returns a list. In this case it has only one element, which is a Unicode string. You should find that the expression you want is actually

>>> soup.title.contents[0]

请注意，单引号仅出现是因为您要让交互式解释器显示字符串值.你会发现

Note that the single quotes only appear because you are asking the interactive interpreter to display a string value. You will find that

>>> print(soup.title.contents[0])

显示

" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "

，实际上是标题标签的内容.您将看到Beautiful Soup已将&" HTML实体转换为所需的双引号字符.要丢失引号和相邻空格，您可以使用

and that is actually the contents of the title tag. You will observe that Beautiful Soup has converted the " HTML entities into the required double-quote characters. To lose the quotes and adjacent spaces you can use

soup.title.contents[0][2:-2]

meta标记有点欺骗.我假设只有一个< meta> 标记和 http-equiv 属性的值是"refresh"，所以检索返回一个列表一个元素.您可以像这样检索该元素:

The meta tag is a little tricker. I make the assumption that there is only one <meta> tag with an http-equiv attribute whose value is "refresh", so the retrieval returns a list of one element. You retrieve that element like so:

>>> meta = soup.findAll("meta", {"http-equiv": "refresh"})[0]
>>> meta
<meta content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" http-equiv="refresh"/>

请注意，顺便说一下，meta不是字符串，而是汤元素:

Note, by the way, that meta isn't a string but a soup element:

>>> type(meta)
<class 'bs4.element.Tag'>

您可以像Python字典一样使用索引来检索汤元素的属性，因此您可以按以下方式获取 content 属性的值:

You can retrieve attributes of a soup element using indexing just like Python dicts, so you can get the value of the contentattribute as follows:

>>> content = meta["content"]
>>> content
u'0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

要提取URL值，您可以可以查找第一个等号，然后取出字符串的其余部分.我更喜欢使用一种更加规范的方法，在分号处进行分割，然后在(仅一个)等号上分割该分割的右侧元素.

In order to extract the URL value you could just look for the first equals sign and take the rest of the string. I prefer to use a rather more disciplined approach, splitting at the semicolon and then splitting the right-hand element of that split on (only one) equals sign.

>>> url = content.split(";")[1].split("=", 1)[1]
>>> url
u'/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

这篇关于在python中使用BeautifulSoup提取html标签之间的数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python中使用BeautifulSoup提取html标签之间的数据 [英] extract data between html tags using BeautifulSoup in python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在python中使用BeautifulSoup提取html标签之间的数据 [英] extract data between html tags using BeautifulSoup in python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭