保存麻烦的网页并重新导入Python [英] Save troublesome webpage and import back into Python

查看：43 发布时间：2021/5/30 21:52:08 python html python-requests lxml save-as

本文介绍了保存麻烦的网页并重新导入Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从各种页面中提取一些信息，并且有些挣扎.这显示了我的挑战:

I am trying to extract some information from a variety of pages and struggling a bit. This shows my challenge:

import requests
from lxml import html
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
response = requests.get(url)
print(response.content)

如果将输出复制到记事本中，则无法在输出的任何位置找到值"9.20"(A组赔率在网页右下方).但是，如果打开网页，请执行另存为"，然后像这样将其重新导入Python，则可以找到并提取9.20值:

If you copy the output into Notepad, you cannot find the value "9.20" anywhere in the output (the Team A odds in the bottom right of the webpage). However, if you open the webpage, do a Save-As and then import it back into Python like this, you can locate and extract the 9.20 value:

with open(r'HUL 1-7 TOT _ Hull - Tottenham _ Match Summary.html', "r") as f:
    page = f.read()
tree = html.fromstring(page)

output = tree.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()')  #the xpath for the TeamA odds or the 9.20 value
output # ['9.20']

不知道为什么这种解决方法有效，但是那超出了我.因此，我想做的就是将网页保存到本地驱动器中，然后如上所述用Python打开它，然后从那里继续进行.但是，如何在Python中复制另存为?这不起作用:

Not sure why this work-around works but that is above me. So what I would like to do is save a webpage to my local drive and open it in Python, as above and carry on from there. But how do I replicate the Save-As in Python? This does not work:

import urllib.request
response = urllib.request.urlopen(url)
webContent = response.read().decode('utf-8')
f = open('HUL 1-7 TOT _ Hull - Tottenham _ Match Summary.html', 'w')
f.write(webContent)
f.flush()
f.close()

它给了我一个网页，但它只是原始页面的一小部分...?

It gives me a webpage but it is a fraction of the original page...?

推荐答案

正如@Pedro Lobito所说.页面内容由 javascript 生成.因此，您需要一个可以运行JavaScript的模块.我将选择 requests_html 或 selenium .

As @Pedro Lobito said. Page content is generated by javascript. For this reason you need a module which can run JavaScript. I will choose requests_html or selenium.

Requests_html

from requests_html import HTMLSession

url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"

session = HTMLSession()
response = session.get(url)
response.html.render()
result = response.html.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()')
print(result)
#['9.20']

硒

from selenium import webdriver
from lxml import html

url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
dr = webdriver.Chrome()

try:
    dr.get(url)
    tree = html.fromstring(dr.page_source)
    ''' use it when browser closes before loading succeeds
    # https://selenium-python.readthedocs.io/waits.html
    WebDriverWait(dr, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
    '''
    output = tree.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()')  #the xpath for the TeamA odds or the 9.20 value
    print(output)

except Exception as e:
    raise e

finally:
    dr.close()
#['9.20']

这篇关于保存麻烦的网页并重新导入Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

保存麻烦的网页并重新导入Python [英] Save troublesome webpage and import back into Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

保存麻烦的网页并重新导入Python [英] Save troublesome webpage and import back into Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭