正则表达式在bs4中不起作用 [英] regex not working in bs4
问题描述
我正在尝试从watchseriesfree.to网站上的特定Filehoster中提取一些链接.在以下情况下,我需要快速视频链接,因此我使用正则表达式过滤掉包含快速视频文本的标签
I am trying to extract some links from a specific filehoster on watchseriesfree.to website. In the following case I want rapidvideo links, so I use regex to filter out those tags with text containing rapidvideo
import re
import urllib2
from bs4 import BeautifulSoup
def gethtml(link):
req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
con = urllib2.urlopen(req)
html = con.read()
return html
def findLatest():
url = "https://watchseriesfree.to/serie/Madam-Secretary"
head = "https://watchseriesfree.to"
soup = BeautifulSoup(gethtml(url), 'html.parser')
latep = soup.find("a", title=re.compile('Latest Episode'))
soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))
return firstVod
print(findLatest())
但是,上面的代码返回一个空白列表.我在做什么错了?
However, the above code returns a blank list. What am I doing wrong?
推荐答案
问题在这里:
firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))
当BeautifulSoup
将应用文本正则表达式模式时,它将使用所有匹配的tr
元素的c1>属性值.现在,.string
有一个重要的警告-当元素具有多个子元素时,.string
是None
:
When BeautifulSoup
will apply your text regex pattern, it would use .string
attribute values of all the matched tr
elements. Now, the .string
has this important caveat - when an element has multiple children, .string
is None
:
如果标签包含多个内容,则不清楚
.string
应该指的是什么,因此.string
被定义为None
.
If a tag contains more than one thing, then it’s not clear what
.string
should refer to, so.string
is defined to beNone
.
因此,您没有结果.
您可以做的是使用.get_text()
:
What you can do is to check the actual text of the tr
elements by using a searching function and calling .get_text()
:
soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())
这篇关于正则表达式在bs4中不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!