正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL [英] Regex to extract all Starred Items URLs from Google Reader JSON file

查看：103 发布时间：2018/6/21 17:24:43 html regex json url

本文介绍了正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

令人遗憾的是，它宣布Google Reader将在今年年中关闭。
由于我在Google阅读器中有大量加星标的项目，因此我想将它们备份。
这可以通过Google阅读器取出。它生成一个 JSON 格式的文件。

现在我想提取出所有的文章url几MB的大文件。

起初，我认为最好是使用url的正则表达式，但似乎更好的是通过正则表达式提取所需的文章url找到只是文章网址。这样可以防止提取其他不需要的url。

以下是一个简短的例子，说明json文件的部分外观：

 published：1359723602，
updated：1359723602，
canonical：[{
href： http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/
}]，
alternate： [{
href：http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/，
type：text / html 
}]，

我只需要在这里可以找到的网址：

 canonical：[{
href：http://arstechnica.com/apple/2013/02/omni -group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac /
}]，

也许任何人都在想说一个正则表达式如何去提取所有这些url？

好处是可以快速和肮脏的方式来提取加星标的项目ls from Google Reader to import them in pocket like pocket or evernote，once processed。

解决方案

我知道你问过关于正则表达式，但我认为有更好的方法来处理这个问题。多线正则表达式是PITA，在这种情况下，不需要这种脑损伤。

我会从 grep ，而不是正则表达式。 -A1 参数表示返回匹配的行，并且后面有一行：

grep -A1canonical< file>
这将返回如下所示的行：

canonical：[{ href：http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan -omnioutliner-4-for-mac /
然后，我再次grep href：
grep -A1canonical< file> | grephref
给出

href：http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/
现在我可以使用awk获取网址：
grep -A1canonical< file> | grephref| awk -F'：''{print $ 2}'
网址：
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2 -omniplan-omnioutliner-4-for-mac /
现在我只需要摆脱额外的报价：

grep -A1canonical< file> | grephref| awk -F'： ''{print $ 2}'| tr -d''
就是这样！
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4- for-mac /

Sadly it was announced that Google Reader will be shutdown mid of the year. Since I have a large amount of starred items in Google Reader I'd like to back them up. This is possible via Google Reader takeout. It produces a file in JSON format.

Now I would like to extract all of the article urls out of this several MB large file.

At first I thought it would be best to use a regex for url but it seems to be better to extract the needed article urls by a regex to find just the article urls. This will prevent to also extract other urls that are not needed.

Here is a short example how parts of the json file looks:
"published" : 1359723602, "updated" : 1359723602, "canonical" : [ { "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/" } ], "alternate" : [ { "href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/", "type" : "text/html" } ],
I just need the urls you can find here:
"canonical" : [ { "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/" } ],
Perhaps anyone is in the mood to say how a regex have to look like to extract all these urls?

The benefit would be to have a quick and dirty way to extract starred items urls from Google Reader to import them in services like pocket or evernote, once processed.
解决方案
I know you asked about regex, but I think there's a better way to handle this problem. Multi-line regular expressions are a PITA, and in this case there's no need for that kind of brain damage.

I would start with grep, rather than a regex. The -A1 parameter says "return the line that matches, and one after":
grep -A1 "canonical" <file>
This will return lines like this:
"canonical" : [ { "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
Then, I'd grep again for the href:
grep -A1 "canonical" <file> | grep "href"
giving
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
now I can use awk to get just the url:
grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }'
which strips out the first quote on the url:
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
Now I just need to get rid of the extra quote:
grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'
That's it!
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/

这篇关于正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL [英] Regex to extract all Starred Items URLs from Google Reader JSON file

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL [英] Regex to extract all Starred Items URLs from Google Reader JSON file

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭