的Python:使用Beatifulsoup从HTML获取文本 [英] Python:Getting text from html using Beatifulsoup
问题描述
我试图提取此链接链接例如排名文本编号:kaggle用户排名NO1 。图像中更清楚:
I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:
我用下面的code:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
print(item_name.string)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
结果是无
。问题是, soup.findAll('H4',{'数据绑定:文本:rankingText})
输出:
[< H4数据绑定=TEXT:rankingText>< / H4>]
但是在检查时的链接的HTML该是这样的:
but in the html of the link when inspecting this is like:
< H4数据绑定=TEXT:rankingText>第1和LT; / H4>
。它可以在图像中可以看出:
<h4 data-bind="text: rankingText">1st</h4>
. It can be seen in the image:
它明确指出,文本丢失。我怎样才能立交桥呢?
Its clear that the text is missing. How can I overpass that?
Printing the soup
variable in the terminal I can see that this value exists:
所以应该有一种方法,通过汤
访问。
So there should be a way to access through soup
.
编辑2:我曾尝试使用这一最投票的答案<一个href=\"http://stackoverflow.com/questions/24118337/fetch-data-of-variables-inside-script-tag-in-python-or-content-added-from-js\">stackoverflow问题。可能是一个解决方案,围在那里。
Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.
推荐答案
如果你不打算通过硒
作为@Ali建议,你必须尝试浏览器自动化到的解析包含所需信息的的JavaScript。您可以以不同的方式做到这一点。这是一个工作code,通过一个的剧本 #一个正规-EX pression相对=nofollow>常规EX pression模式,然后提取简介
对象,将其加载与 JSON
成Python字典,并打印出所需的排名:
If you aren't going to try browser automation through selenium
as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script
by a regular expression pattern, then extracts the profile
object, loads it with json
into a Python dictionary and prints out the desired ranking:
import re
import json
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)
print profile["ranking"], profile["rankingText"]
打印:
1 1st
这篇关于的Python:使用Beatifulsoup从HTML获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!