的Python：使用Beatifulsoup从HTML获取文本 [英] Python:Getting text from html using Beatifulsoup

查看：292 发布时间：2016/8/5 19:12:44 python html beautifulsoup html-parsing

本文介绍了的Python：使用Beatifulsoup从HTML获取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图提取此链接链接例如排名文本编号：kaggle用户排名NO1 。图像中更清楚：

I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:

我用下面的code：

def get_single_item_data(item_url):
    sourceCode = requests.get(item_url)
    plainText = sourceCode.text
    soup = BeautifulSoup(plainText)
    for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
        print(item_name.string)

item_url = 'https://www.kaggle.com/titericz'   
get_single_item_data(item_url)

结果是无。问题是， soup.findAll（'H4'，{'数据绑定：文本：rankingText}）输出：

[＆LT; H4数据绑定=TEXT：rankingText＆GT;＆LT; / H4＆GT;]

但是在检查时的链接的HTML该是这样的：

but in the html of the link when inspecting this is like:

＆LT; H4数据绑定=TEXT：rankingText＆GT;第1和LT; / H4＆GT; 。它可以在图像中可以看出：

<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:

它明确指出，文本丢失。我怎样才能立交桥呢？

Its clear that the text is missing. How can I overpass that?

编辑：
打印汤变量终端我可以看到这个值存在：

Printing the soup variable in the terminal I can see that this value exists:

所以应该有一种方法，通过汤访问。

So there should be a way to access through soup.

编辑2：我曾尝试使用这一最投票的答案<一个href=\"http://stackoverflow.com/questions/24118337/fetch-data-of-variables-inside-script-tag-in-python-or-content-added-from-js\">stackoverflow问题。可能是一个解决方案，围在那里。

Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.

推荐答案

如果你不打算通过硒作为@Ali建议，你必须尝试浏览器自动化到的解析包含所需信息的的JavaScript。您可以以不同的方式做到这一点。这是一个工作code，通过一个的剧本＃一个正规-EX pression相对=nofollow>常规EX pression模式，然后提取简介对象，将其加载与 JSON 成Python字典，并打印出所需的排名：

If you aren't going to try browser automation through selenium as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:

import re
import json

from bs4 import BeautifulSoup
import requests


response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")

pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)

print profile["ranking"], profile["rankingText"]

打印：

1 1st

这篇关于的Python：使用Beatifulsoup从HTML获取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

的Python：使用Beatifulsoup从HTML获取文本 [英] Python:Getting text from html using Beatifulsoup

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

的Python：使用Beatifulsoup从HTML获取文本 [英] Python:Getting text from html using Beatifulsoup

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭