正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL [英] Regex to extract all Starred Items URLs from Google Reader JSON file

查看:103
本文介绍了正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

令人遗憾的是,它宣布Google Reader将在今年年中关闭。
由于我在Google阅读器中有大量加星标的项目,因此我想将它们备份。
这可以通过Google阅读器取出。它生成一个 JSON 格式的文件。



现在我想提取出所有的文章url几MB的大文件。



起初,我认为最好是使用url的正则表达式,但似乎更好的是通过正则表达式提取所需的文章url找到只是文章网址。这样可以防止提取其他不需要的url。



以下是一个简短的例子,说明json文件的部分外观:

 published:1359723602,
updated:1359723602,
canonical:[{
href: http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/
}],
alternate: [{
href:http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/,
type:text / html
}],

我只需要在这里可以找到的网址:

 canonical:[{
href:http://arstechnica.com/apple/2013/02/omni -group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac /
}],

也许任何人都在想说一个正则表达式如何去提取所有这些url?

好处是可以快速和肮脏的方式来提取加星标的项目ls from Google Reader to import them in pocket like pocket or evernote,once processed。

解决方案

我知道你问过关于正则表达式,但我认为有更好的方法来处理这个问题。多线正则表达式是PITA,在这种情况下,不需要这种脑损伤。



我会从 grep ,而不是正则表达式。 -A1 参数表示返回匹配的行,并且后面有一行:

  grep -A1canonical< file> 

这将返回如下所示的行:

 canonical:[{
href:http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan -omnioutliner-4-for-mac /

然后,我再次grep href:

  grep -A1canonical< file> | grephref

给出

 href:http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/

现在我可以使用awk获取网址:

  grep -A1canonical< file> | grephref| awk -F':''{print $ 2}'

网址:

  http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2 -omniplan-omnioutliner-4-for-mac /

现在我只需要摆脱额外的报价:

  grep -A1canonical< file> | grephref| awk -F': ''{print $ 2}'| tr -d''

就是这样!

  http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4- for-mac / 


Sadly it was announced that Google Reader will be shutdown mid of the year. Since I have a large amount of starred items in Google Reader I'd like to back them up. This is possible via Google Reader takeout. It produces a file in JSON format.

Now I would like to extract all of the article urls out of this several MB large file.

At first I thought it would be best to use a regex for url but it seems to be better to extract the needed article urls by a regex to find just the article urls. This will prevent to also extract other urls that are not needed.

Here is a short example how parts of the json file looks:

"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
  "href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
  "type" : "text/html"
} ],

I just need the urls you can find here:

 "canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],

Perhaps anyone is in the mood to say how a regex have to look like to extract all these urls?

The benefit would be to have a quick and dirty way to extract starred items urls from Google Reader to import them in services like pocket or evernote, once processed.

解决方案

I know you asked about regex, but I think there's a better way to handle this problem. Multi-line regular expressions are a PITA, and in this case there's no need for that kind of brain damage.

I would start with grep, rather than a regex. The -A1 parameter says "return the line that matches, and one after":

grep -A1 "canonical" <file>

This will return lines like this:

"canonical" : [ {
    "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

Then, I'd grep again for the href:

grep -A1 "canonical" <file> | grep "href"

giving

"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

now I can use awk to get just the url:

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' 

which strips out the first quote on the url:

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

Now I just need to get rid of the extra quote:

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'

That's it!

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/

这篇关于正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆