通过curl和cut在unix的HTML标记中抓取信息 [英] Scraping information within HTML tags in unix with curl and cut
问题描述
我想弄清楚 最简单 的方法,该方法可以从Unix中的远程资源中获取HTML/XML标签之间的内容.这就是我认为它将起作用的方式.
I'd like to figure out the simplest way to grab content between HTML/XML tags from a remote resource in unix. Here's how I thought it would work.
curl https://www.google.com | grep "<title>" | cut -d '<title>' -f2 | cut -d '</title>' -f1
因此,我首先
So first I curl
the resource, grep out the line with the tag I want (which sometimes means the whole HTML, because many websites are minified these days).
然后我切开了< title>
之后的所有内容,最后我切掉了''之后的所有内容.
Then I cut out everything after the opening <title>
and finally I cut out anything after the closing ''.
但这会导致错误: curl:(23)书写体失败(0!= 8192)
.该错误似乎是由于 cut
不支持除单个字符之外的任何分隔符而引起的.
This causes an error though: curl: (23) Failed writing body (0 != 8192)
.
The error seems to happen because cut
does not support delimiters that are anything but a single character.
我喜欢我的方法是一种非常简单的方法来获取我想要的数据,并且正在寻找另一种简单的方法来实现它.
I liked how my approach was a very minimalist way to grab the data I wanted and am looking for another simple way to do it.
此外,即使它确实起作用,我的解决方案也仅适用于HTML中的单个标记.对于出现多次的标签该怎么办?
Additionally, even if it did work, my solution would only work for a single tag in the HTML. What about for a tag that appears many times?
推荐答案
W3的这套工具似乎适合您的用例. http://www.w3.org/Tools/HTML-XML-utils/自述文件
This set of tools by W3 seem to fit your use case. http://www.w3.org/Tools/HTML-XML-utils/README
对于足够小的物件,您可以使用.(我将使用Python + lxml进行其他操作)
For something small enough you can get away with something like this. (I'd use Python+lxml for anything else)
这篇关于通过curl和cut在unix的HTML标记中抓取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!