Python:使用Beautifulsoup从html获取文本 [英] Python:Getting text from html using Beautifulsoup

查看:442
本文介绍了Python:使用Beautifulsoup从html获取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此链接链接示例中提取排名文字编号:kaggle用户排名第1号 .图像更清晰:

I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:

我正在使用以下代码:

def get_single_item_data(item_url):
    sourceCode = requests.get(item_url)
    plainText = sourceCode.text
    soup = BeautifulSoup(plainText)
    for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
        print(item_name.string)

item_url = 'https://www.kaggle.com/titericz'   
get_single_item_data(item_url)

结果为None.问题是soup.findAll('h4',{'data-bind':"text: rankingText"})输出:

[<h4 data-bind="text: rankingText"></h4>]

但是在检查链接时在html中是这样的:

but in the html of the link when inspecting this is like:

<h4 data-bind="text: rankingText">1st</h4>.可以在图片中看到:

<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:

很明显,该文本丢失了.我该如何超越?

Its clear that the text is missing. How can I overpass that?

在终端中打印soup变量,我可以看到该值存在:

Printing the soup variable in the terminal I can see that this value exists:

因此应该有一种通过soup进行访问的方法.

So there should be a way to access through soup.

我尝试使用此

Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.

推荐答案

如果您不打算按照@Ali的建议通过selenium尝试浏览器自动化,则必须解析包含所需信息的javascript .您可以通过不同的方式进行此操作.这是一个通过 regular定位script的工作代码.表达式模式,然后提取profile对象,并用 json加载到Python字典中,并打印出所需的排名:

If you aren't going to try browser automation through selenium as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:

import re
import json

from bs4 import BeautifulSoup
import requests


response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")

pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)

print profile["ranking"], profile["rankingText"]

打印:

1 1st

这篇关于Python:使用Beautifulsoup从html获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆