如何使用正则表达式提取json字段? [英] how to use a regular expression to extract json fields?

查看:3073
本文介绍了如何使用正则表达式提取json字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

入门RegExp问题.我在文本文件中有几行JSON,每行具有略有不同的字段,但是如果有行,我想为每行提取3个字段,而忽略其他所有内容.我将如何使用正则表达式(在编辑板或其他任何地方)执行此操作?

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?

示例:

"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24

我要提取URL,TITLE,TAGS

I want to extract URL,TITLE,TAGS,

推荐答案

/"(url|title|tags)":"((\\"|[^"])*)"/i

我认为这就是您要的.我会暂时提供一个解释.此正则表达式(以/分隔-您可能不必将其放在编辑板中)匹配:

I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:

"

文字".

(url|title|tags)

在正则表达式中,三个文字字符串"url","title"或"tags"中的任何一个默认情况下都使用圆括号来创建组,而使用竖线字符来进行替换(如逻辑或").要匹配这些文字字符,您必须将其转义.

Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.

":"

另一个文字字符串.

(

另一组的开始. (第2组)

The beginning of another group. (Group 2)

    (

另一组(3)

        \\"

文字字符串\"-您必须转义反斜杠,因为否则它将被解释为转义下一个字符,并且您永远都不知道该怎么做.

The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.

        |

或...

        [^"]

除双引号外的任何单个字符括号表示字符类/字符集或要匹配的字符列表.任何给定的类都与字符串中的一个字符完全匹配.在类的开头使用克拉(^)会将其取反,从而使匹配器匹配该类中未包含的任何内容.

Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.

    )

第3组结束...

    *

星号会导致以前的正则表达式(在本例中为第3组)重复零次或更多次,在这种情况下会导致匹配器匹配JSON字符串双引号内的所有内容.

The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.

)"

第2组的结尾和文字".

我在这里做了一些非显而易见的事情,可能会派上用场:

I've done a few non-obvious things here, that may come in handy:

  1. 第2组-使用反向引用取消引用时-将是分配给场地.当获取实际值时,这很有用.
  2. 表达式末尾的i使其不区分大小写.
  3. 组1包含捕获的字段的名称.
  1. Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
  2. The i at the end of the expression makes it case insensitive.
  3. Group 1 contains the name of the captured field.

所以我看到标签是一个数组.当我有机会考虑一下时,我将在此处更新正则表达式.

So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.

您的新正则表达式为:

/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i

我在这里所做的全部工作是替换我一直在使用的字符串正则表达式("((\\"|[^"])*)")和用于查找数组的正则表达式(\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]).没有那么容易阅读,是吗?好了,用我们的String Regex替换字母S,我们可以将其重写为:

All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:

\[(S(,S)*)?\]

与文字的左括号匹配(因此使用反斜杠),并可选地后面跟一个用逗号分隔的字符串列表和一个右括号.我在这里介绍的唯一新概念是问号(?),它本身就是重复的一种.通常称为使上一个表达式为可选",也可以将其视为恰好0或1个匹配项.

Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.

使用相同的S表示法,这是整个脏的正则表达式:

With our same S Notation, here's the whole dirty Regular Expression:

/"(url|title|tags)":(S|\[(S(,S)*)?\])/i

如果有助于查看其实际效果,请查看其实际效果.

If it helps to see it in action, here's a view of it in action.

这篇关于如何使用正则表达式提取json字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆