处理推文时使用 JSON 或正则表达式 [英] Using JSON or regex when processing tweets

查看:70
本文介绍了处理推文时使用 JSON 或正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

哪种方法更快,使用 JSON 解析器(python 2.6)或正则表达式获取相关数据.由于数据量巨大,我认为使用一种方法与其他方法相比,时间上会有很大差异.

Which is faster method, using JSON parser (python 2.6) or regex for obtaining relevant data. Since the amount of data is huge, I presume there will considerable difference in time when one method is used in comparison to other.

推荐答案

假设你在问什么...

我相信您是在问通过反序列化或通过正则表达式搜索相关匹配来从序列化的 JSON 字符串中获取信息是否更快.

Assuming what you are asking...

I believe you're asking if it's faster to obtain information from a serialized JSON string by deserializing it or searching for the relevant match via regex.

根据我在序列化 JSON 中的活动流对象(推文、转发或引用)中寻找单个键值对的非官方经验,使用正则表达式比解析整个 JSON 对象更好.

In my unofficial experience with looking for a single key-value pair in an activity streams object (tweet, retweet or quote) in serialized JSON, using regex scales better than parsing the entire JSON object.

这是因为推文相当大,当您处理数十万条推文时,反序列化整个 JSON 字符串并随机访问生成的 JSON 对象对于单个键值对,就像用大锤敲碎坚果一样.

This is because tweets are pretty big, and when you're working with hundreds of thousands of them, deserializing the entire JSON string and randomly accessing the resulting JSON object for a single key-value pair is like using a sledgehammer to crack a nut.

然而,当键在不同级别的嵌套中重复时,就会出现问题.

The problem arises, however, when keys are repeated at different levels of nesting.

例如,quotes 有一个名为 twitter_quoted_status 的根级属性,其中包含该quote 对象引用的推文副本.

For example, quotes have a root level attribute called twitter_quoted_status which contains a copy of the tweet this quote object refers to.

这意味着如果您使用正则表达式搜索序列化的引用对象,推文和引用共享的任何属性名称将返回至少 2 个匹配项.

That means any attribute name shared by both tweets and quotes would return at least 2 matches if you searched a serialized quote object with regex.

由于您不能也不应该依赖 JSON 对象中属性顺序的可靠性(字典键应该是无序的!),您甚至不能依赖您想要的第一个或第二个匹配项(或其他)如果您到目前为止已经确定了该模式,则匹配.

Since you cannot and should not rely on the reliability of the order of attributes within a JSON object (dictionary keys are supposed to be unordered!), you can't even rely on the match you want being the first or second (or whatever) match if you've identified that pattern so far.

目前我可以与您分享的唯一证据是,从 100,000 个原始推文对象(无引号或转推)中检索单个键值对,我的桌面在使用反序列化时往往需要 8-14 秒方法,使用正则表达式时为 0-2.

The only evidence I can share with you at the moment, is that to retrieve a single key-value pair from 100,000 original tweet objects (no quotes nor retweets), my desktop tended to take 8-14 seconds when using the deserialization method, and 0-2 when using regex.

数字是近似的并且来自记忆.抱歉,只是提供了一个快速的答案,我现在没有工具来对此进行测试并发布结果供我使用.

这篇关于处理推文时使用 JSON 或正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆