HTML逐行解析 [英] HTML parsing line by line

查看:75
本文介绍了HTML逐行解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究旨在解析HTML的python代码.这里的目的是在每一行中找到字符串,并按如下所示对其进行更改:

I'm working on a python code intended to parse HTML. The objective here is to find strings in each line, and change them as seen below:

原文:"Criar Alerta"

Original: "Criar Alerta"

<li><a href="http://..." target="_blank">Criar Alerta</a></li>

预期结果:创建警报"

Expected Result: "Create Alert"

<li><a href="http://..." target="_blank">Create alert</a></li>

然后,为确保创建的HTML具有与原始HTML相同的结构,我需要逐行解析后面的字符串,识别字符串,然后从字典中将其更改为与之等效的字符串.

Then, to assure that I'm creating a new HTML with the same structure of the original, I need to parse the later line by line, identify the string, and change it for its equivalent from a dictionary.

我看到了

I saw here that BeautifulSoup can parse specific tags. I tried it, but I'm not sure about the result.

然后我问:考虑到它可以与标签一起使用,并且每行有多个标签,是否可以对BeautifulSoup进行逐行解析?

Then I ask: Is it possible to proceed line-by-line parsing with BeautifulSoup, given that it works with tags, and there are multiple tags for each line?

预先感谢

Tiago

推荐答案

我相信以下是您正在寻找的东西.

I believe that the following is what you are looking for.

让我们使用3行,其中两行包含字典中的单词,而另一行则不行-仅用于测试代码:

Let's use 3 lines, two of which contain words in your dictionary, and one doesn't - just to test the code:

rep = """
      <li class="current"><a  style="color:#00233C;" href="index.html"><i class="icon icon-home"></i>  Início</a></li>
      <li class="current"><a  style="color:#00233C;" href="index.html"><i class="icon icon-home"></i>  Nunca</a></li>
      <li class="current"><a  style="color:#00233C;" href="index.html"><i class="icon icon-home"></i>  Criar Alerta</a></li>
    """

并使用您的字典(提示:将字典定义为dict从来不是一个好主意;它只是在路上某处问麻烦...)

And use your dictionary (hint: it's never a good idea to define a dictionary as dict; it's just asking for trouble somewhere down the road...)

rep_dict = {
"Início": "Start",
"Ajuda": "Help",
"Criar Alerta": "Create Alert",
"Materiais e Estruturas": "Structures and Materials" 
}

现在输入代码:

soup = BeautifulSoup(rep, 'lxml')

only_a_tags = soup.find_all('a')

for item in range(len(only_a_tags)):
    for word in rep_dict:
        if word in str(only_a_tags[item]):
            print(str(only_a_tags[item]).replace(word,rep_dict[word]))

输出:

<a href="index.html" style="color:#00233C;"><i class="icon icon-home"></i>  Start</a>
<a href="index.html" style="color:#00233C;"><i class="icon icon-home"></i>  Create    Alert</a>

因为"nunca"不在rep_dict中,所以未打印包含"nunca"的项目.

The item containing "nunca" was not printed because "nunca" is not in rep_dict.

这篇关于HTML逐行解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆