用python解析HTML文档 [英] Parsing an HTML Document with python

查看:69
本文介绍了用python解析HTML文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对python完全陌生,我试图解析HTML文档以删除标签,而我只是想保留以前从我的计算机上下载过的报纸网站的标题和正文.

I am totally new on python and i am trying to parse an HTML document to remove the tags and I just want to keep the title and the body from a newspaper website I have previously downloaded on my computer.

我正在使用我在文档中找到的HTML Parser类,但是我不知道如何很好地使用它,我不太了解这种语言:(

I am using the class HTML Parser I found on the documentation, but I dont know how to use it very well, I dont understand this language very well :(

这是我的代码:

#importa la clase HTMLParser
from html.parser import HTMLParser

class HTMLCleaner(HTMLParser):
    container = ""

    def handle_data(self, data):
        if (data == '\n'):
            pass
        elif (data == " "):
            pass
        else:
            self.container += data

        return self.container

parser = HTMLCleaner()

#se va a abrir un fichero para parsearlo
archivo = open("C://Users//jotab//OneDrive//Documentos//Git//SRI//SRI_PR0//coleccionESuja2019//es_26142.html", "r", encoding="utf8")


while True:
    line = archivo.readline()
    if line == "":
        break
    else:
        parser.feed(line)

print(parser.container)

之所以这样做,是因为解析后得到很多行"\ n"和很多行".但是,当我尝试检查一行是否为空格时,即使两个变量在调试器上的显示完全相同,它也会返回false.

I am doing this because I am getting a lot of lines "\n" and a lot of lines " " after parsing. But when I try to check if a line is a blankspace, it returns false even if both variables appear on the debugger exactly the same.

我不知道为什么会这样,但是如果some1可以帮助我解析它,那就太好了

I don't know why this happens, but if some1 could help me to parse this, it would be so nice

推荐答案

根据您提供的代码,您似乎正在尝试打开一个html文件.

Based on the code you provided it looks like you are trying to open a html file that you have.

而不是像您所做的那样逐行解析html文件.只需将整个HTML文件输入解析器即可.

Instead of parsing the html file line by line like you are doing. Just feed the parser the entire HTML file.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

with open(r'C:\Users\...site_1.html', "r") as f:
    page = f.read()
    tree = html.fromstring(page)
parser.feed(tree)

Pythons HTML解析器要求提要是一个字符串.您可以做的就是将您拥有的整个HTML复制并粘贴到Feed中.可能不是最佳做法,但应阅读并解析html

Pythons HTML parser requires the feed to be a string. What you could do is copy paste the entire HTML that you have into the Feed. Might not be best practice but it should read and parse the html

parser.feed("THE ENTIRE HTML AS STRING HERE")

我希望这对您有帮助

编辑———-您是否尝试过将html放入一个像您一样的字符串中,然后在该字符串上调用 str.strip()从字符串的开头和结尾删除所有空格.

Edit———- Have you tried getting the html into a string like you have and then calling str.strip() on the string to remove all blank spaces from leading and trailing of the string.

仅供参考,您还可以使用 sentence.replace(",")从字符串中删除所有空格

FYI you can also use sentence.replace(" ", "") to remove all blank spaces from string

希望这对您有帮助

这篇关于用python解析HTML文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆