通过curl和cut在unix的HTML标记中抓取信息 [英] Scraping information within HTML tags in unix with curl and cut

查看:64
本文介绍了通过curl和cut在unix的HTML标记中抓取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想弄清楚 最简单 的方法,该方法可以从Unix中的远程资源中获取HTML/XML标签之间的内容.这就是我认为它将起作用的方式.

I'd like to figure out the simplest way to grab content between HTML/XML tags from a remote resource in unix. Here's how I thought it would work.

curl https://www.google.com | grep "<title>" | cut -d '<title>' -f2 | cut -d '</title>' -f1

因此,我首先 资源,用我想要的标签grep行(有时表示整个HTML,因为这几天许多网站都被缩小了).

So first I curl the resource, grep out the line with the tag I want (which sometimes means the whole HTML, because many websites are minified these days).

然后我切开了< title> 之后的所有内容,最后我切掉了''之后的所有内容.

Then I cut out everything after the opening <title> and finally I cut out anything after the closing ''.

但这会导致错误: curl:(23)书写体失败(0!= 8192).该错误似乎是由于 cut 不支持除单个字符之外的任何分隔符而引起的.

This causes an error though: curl: (23) Failed writing body (0 != 8192). The error seems to happen because cut does not support delimiters that are anything but a single character.

我喜欢我的方法是一种非常简单的方法来获取我想要的数据,并且正在寻找另一种简单的方法来实现它.

I liked how my approach was a very minimalist way to grab the data I wanted and am looking for another simple way to do it.

此外,即使它确实起作用,我的解决方案也仅适用于HTML中的单个标记.对于出现多次的标签该怎么办?

Additionally, even if it did work, my solution would only work for a single tag in the HTML. What about for a tag that appears many times?

推荐答案

W3的这套工具似乎适合您的用例. http://www.w3.org/Tools/HTML-XML-utils/自述文件

This set of tools by W3 seem to fit your use case. http://www.w3.org/Tools/HTML-XML-utils/README

对于足够小的物件,您可以使用.(我将使用Python + lxml进行其他操作)

For something small enough you can get away with something like this. (I'd use Python+lxml for anything else)

请记住:您无法使用正则表达式解析[X] HTML.

这篇关于通过curl和cut在unix的HTML标记中抓取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆