正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL [英] Regex to extract all Starred Items URLs from Google Reader JSON file
问题描述
由于我在Google阅读器中有大量加星标的项目,因此我想将它们备份。
这可以通过Google阅读器取出。它生成一个
JSON
格式的文件。 现在我想提取出所有的文章url几MB的大文件。
起初,我认为最好是使用url的正则表达式,但似乎更好的是通过正则表达式提取所需的文章url找到只是文章网址。这样可以防止提取其他不需要的url。
以下是一个简短的例子,说明json文件的部分外观:
published:1359723602,
updated:1359723602,
canonical:[{
href: http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/
}],
alternate: [{
href:http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/,
type:text / html
}],
我只需要在这里可以找到的网址:
canonical:[{
href:http://arstechnica.com/apple/2013/02/omni -group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac /
}],
也许任何人都在想说一个正则表达式如何去提取所有这些url?
好处是可以快速和肮脏的方式来提取加星标的项目ls from Google Reader to import them in pocket like pocket or evernote,once processed。
我知道你问过关于正则表达式,但我认为有更好的方法来处理这个问题。多线正则表达式是PITA,在这种情况下,不需要这种脑损伤。
我会从 grep
,而不是正则表达式。 -A1
参数表示返回匹配的行,并且后面有一行:
grep -A1canonical< file>
这将返回如下所示的行:
canonical:[{
href:http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan -omnioutliner-4-for-mac /
然后,我再次grep href:
grep -A1canonical< file> | grephref
给出
href:http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/
现在我可以使用awk获取网址:
grep -A1canonical< file> | grephref| awk -F':''{print $ 2}'
网址:
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2 -omniplan-omnioutliner-4-for-mac /
现在我只需要摆脱额外的报价:
grep -A1canonical< file> | grephref| awk -F': ''{print $ 2}'| tr -d''
就是这样!
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4- for-mac /
Sadly it was announced that Google Reader will be shutdown mid of the year.
Since I have a large amount of starred items in Google Reader I'd like to back them up.
This is possible via Google Reader takeout. It produces a file in JSON
format.
Now I would like to extract all of the article urls out of this several MB large file.
At first I thought it would be best to use a regex for url but it seems to be better to extract the needed article urls by a regex to find just the article urls. This will prevent to also extract other urls that are not needed.
Here is a short example how parts of the json file looks:
"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
"href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
"type" : "text/html"
} ],
I just need the urls you can find here:
"canonical" : [ {
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
Perhaps anyone is in the mood to say how a regex have to look like to extract all these urls?
The benefit would be to have a quick and dirty way to extract starred items urls from Google Reader to import them in services like pocket or evernote, once processed.
I know you asked about regex, but I think there's a better way to handle this problem. Multi-line regular expressions are a PITA, and in this case there's no need for that kind of brain damage.
I would start with grep
, rather than a regex. The -A1
parameter says "return the line that matches, and one after":
grep -A1 "canonical" <file>
This will return lines like this:
"canonical" : [ {
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
Then, I'd grep again for the href:
grep -A1 "canonical" <file> | grep "href"
giving
"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
now I can use awk to get just the url:
grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }'
which strips out the first quote on the url:
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
Now I just need to get rid of the extra quote:
grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'
That's it!
http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/
这篇关于正则表达式从Google Reader JSON文件中提取所有已加星标的项目的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!