我应该使用正则表达式还是仅使用DOM /字符串操作? [英] Should I use regex or just DOM/string manipulation?

查看:99
本文介绍了我应该使用正则表达式还是仅使用DOM /字符串操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时候我不确定什么时候需要使用这个或那个。我通常用Python解析所有的东西,但是我想把这个问题放在HTML解析上。



我个人觉得DOM操作在解析多个两个常规元素(例如,新闻列表的标题和正文)。

然而,我发现自己在我不清楚建立正则表达式或试图获得所需的值简单地操纵字符串。一个特别虚构的例子:我必须得到一张专辑的总照片数量,唯一的办法是用这种方式解析照片的数量:


(1 of 190)

所以我必须从整个HTML文档中取得'190'。我可以为它写一个正则表达式,虽然解析HTML的正则表达式并不是最好的,或者这是我一直理解的。另一方面,使用DOM对我来说似乎压倒一切,因为它只是一个简单的元素。字符串操作似乎是最好的方式,但我不确定是否应该像这样类似的情况下继续。



你能告诉我你将如何解析使用Python(或任何其他语言)的HTML文档中的这些单一元素?解析方案

这是一个主观问题答案),但一般情况下,我会尽量避免使用正则表达式来解析HTML / XML,因为已经在SO中先前讨论 。只有带标记的输入字符串很小,没有可能变得更复杂,并且被搜索的模式是明确的并且容易描述为正则表达式时,我会使用正则表达式。这是一个平衡工作的正确工具和需要实用的问题。



对于您的具体示例,我认为可以用正则表达式开始。但是如果你发现自己从输入中提取了额外的信息和/或正则表达式开始变得麻烦,请切换到解析器。


Sometimes I am not sure when do I have to use one or another. I usually parse all sort of things with Python, but I would like to focus this question on HTML parsing.

Personally I find DOM manipulation really useful when having to parse more than two regular elements (i.e. title and body of a list of news, for example).

However, I found myself in situations where it is not clear for me to build a regex or try to get the desired value simply manipulating strings. A particular fictional example: I have to get the total number of photos of an album, and the only way to get this is parsing the number of photos using this way:

(1 of 190)

So I have to get the '190' from the whole HTML document. I could write a regex for that, although regex for parsing HTML is not exactly the best, or that is what I always understood. On the other hand, using DOM seems overwhelming for me as it is just a simple element. String manipulation seems to be the best way, but I am not really sure if I should proceed like that in such a similar case.

Can you tell me how would you parse these kind of single elements from a HTML document using Python (or any other language)?

解决方案

It's a subjective question (with subjective answers) but in general I'd try to avoid using regex for parsing HTML/XML, as has been previously discussed in SO. Only if the input string with the markup is small and with no possibilities of getting more complex, and the pattern being searched is unambiguous and easily described as a regex, would I use a regex. It's a matter of balancing the right tool for the job with the need to be practical.

For your concrete example, I think it'd be OK to start with a regex. But if you find yourself extracting additional information from the input and/or the regex starts to get cumbersome, switch to a parser.

这篇关于我应该使用正则表达式还是仅使用DOM /字符串操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆