使用Python格式化HTML代码 [英] Format HTML code with Python

查看:3713
本文介绍了使用Python格式化HTML代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件列中的URL列表。我想用Python来浏览所有的URL,从URL中下载特定部分的HTML代码并保存到下一列。

I have a list of URLs in a column in a CSV-file. I would like to use Python to go through all the URLs, download a specific part of the HTML code from the URL and save it to the next column.

例如:
URL 我想提取这个div并将其写入下一列。

For example: From this URL I would like to extract this div and write it to the next column.

<div class="info-holder" id="product_bullets_section">
<p>
VM−2N ist ein Hochleistungs−Verteilverstärker für Composite− oder SDI−Videosignale und unsymmetrisches Stereo−Audio. Das Eingangssignal wird entkoppelt und isoliert, anschließend wird das Signal an zwei identische Ausgänge verteilt.
<span id="decora_msg_container" class="visible-sm-block visible-md-block visible-xs-block visible-lg-block"></span>
</p>
<ul>
<li>
<span>Hohe Bandbreite — 400 MHz (–3 dB).</span>
</li>
<li>
<span>Desktop–Grösse — Kompakte Bauform, zwei Geräte können mithilfe des optionalen Rackadapters RK–1 in einem 19 Zoll Rack auf 1 HE nebeneinander montiert werden.</span>
</li>
</ul>
</div>

我有这段代码,HTML代码保存在变量html中:

I have this code, the HTML code is saved in the variable html:

import csv
import urllib.request

with open("urls.csv", "r", newline="", encoding="cp1252") as f_input:
    csv_reader = csv.reader(f_input, delimiter=";", quotechar="|")
    header = next(csv_reader)
    items = [row[0] for row in csv_reader]

with open("results.csv", "w", newline="") as f_output:
    csv_writer = csv.writer(f_output, delimiter=";")
    for item in items:
        html = urllib.request.urlopen(item).read()

目前HTML代码非常难看。我怎样才能删除除了我想要提取的div之外的变量html中的所有内容?

Currently the HTML-Code is pretty ugly. How could I delete everything out of the variable html except the div I would like to extract?

推荐答案

鉴于您的所有网页都具有相同的结构,您可以使用此代码解析html。它将查找ID为 product_bullets_section 的第一个div。 HTML中的id应该是唯一的,但是给定的网站有两个相同的id,所以我们通过切分并将解析出的div转换回包含您的html的字符串来获得第一个。

Given that all of your webpages are have the same structure you can parse the html with this code. It will look for the first div with the id product_bullets_section. An id in html should be unique but the given website has two equal id's so we obtain the first one through slicing and convert the parsed div back to a string containing your html.

import csv
import urllib.request

from bs4 import BeautifulSoup

with open("urls.csv", "r", newline="", encoding="cp1252") as f_input:
    csv_reader = csv.reader(f_input, delimiter=";", quotechar="|")
    header = next(csv_reader)
    items = [row[0] for row in csv_reader]

items = ['https://www.kramerav.com/de/Product/VM-2N']
with open("results.csv", "w", newline="") as f_output:
    csv_writer = csv.writer(f_output, delimiter=";")
    for item in items:
        html = urllib.request.urlopen(item).read()
        the_div = str(BeautifulSoup(html).select('div#product_bullets_section')[0])

这篇关于使用Python格式化HTML代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆