正则表达式在bs4中不起作用 [英] regex not working in bs4

查看:86
本文介绍了正则表达式在bs4中不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从watchseriesfree.to网站上的特定Filehoster中提取一些链接.在以下情况下,我需要快速视频链接,因此我使用正则表达式过滤掉包含快速视频文本的标签

I am trying to extract some links from a specific filehoster on watchseriesfree.to website. In the following case I want rapidvideo links, so I use regex to filter out those tags with text containing rapidvideo

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

但是,上面的代码返回一个空白列表.我在做什么错了?

However, the above code returns a blank list. What am I doing wrong?

推荐答案

问题在这里:

firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

BeautifulSoup将应用文本正则表达式模式时,它将使用所有匹配的tr元素的c1>属性值.现在,.string有一个重要的警告-当元素具有多个子元素时,.stringNone :

When BeautifulSoup will apply your text regex pattern, it would use .string attribute values of all the matched tr elements. Now, the .string has this important caveat - when an element has multiple children, .string is None:

如果标签包含多个内容,则不清楚.string应该指的是什么,因此.string被定义为None.

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None.

因此,您没有结果.

您可以做的是使用tr元素的实际文本. rel ="noreferrer">搜索函数并调用.get_text():

What you can do is to check the actual text of the tr elements by using a searching function and calling .get_text():

soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())

这篇关于正则表达式在bs4中不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆