使用Python以JSON格式提取/格式化数据的最佳方法? [英] Best way to extract/format data in JSON format using Python?

查看:55
本文介绍了使用Python以JSON格式提取/格式化数据的最佳方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对批量专利数据进行一些数据分析(数据通常在此处找到,但目前已关闭- https://ped.uspto.gov/peds/).

I am trying to do some data analysis on bulk patent data (data is usually found here but is currently down - https://ped.uspto.gov/peds/).

这是JSON文件中的第一项:

Here is the first entry in the JSON file:

{
  "PatentBulkData":[
    {
      "patentCaseMetadata":{
        "applicationNumberText":{
          "value":"15733015",
          "electronicText":"15733015"
        },
        "filingDate":"2020-01-01",
        "applicationTypeCategory":"Utility",
        "partyBag":{
          "applicantBagOrInventorBagOrOwnerBag":[
            {
              "applicant":[
                {
                  "contactOrPublicationContact":[
                    {
                      "name":{
                        "personNameOrOrganizationNameOrEntityName":[
                          {
                            "personStructuredName":{
                              "firstName":"Birol",
                              "middleName":"",
                              "lastName":"Cimen"
                            }
                          }
                        ]
                      },
                      "cityName":"Hengelo",
                      "geographicRegionName":{
                        "value":"",
                        "geographicRegionCategory":"STATE"
                      },
                      "countryCode":"NL"
                    }
                  ]
                }
              ]
            },
            {
              "partyIdentifierOrContact":[
                {
                  "name":{
                    "personNameOrOrganizationNameOrEntityName":[
                      {
                        "personStructuredName":{
                          "lastName":"Oppedahl Patent Law Firm LLC (Mink)"
                        }
                      }
                    ]
                  },
                  "postalAddressBag":{
                    "postalAddress":[
                      {
                        "postalStructuredAddress":{
                          "addressLineText":[
                            {
                              "value":"P O Box 351240"
                            }
                          ],
                          "cityName":"Westminster",
                          "geographicRegionName":[
                            {
                              "value":"CO"
                            }
                          ],
                          "countryCode":"US",
                          "postalCode":"80035"
                        }
                      }
                    ]
                  }
                },
                {
                  "value":"133517"
                }
              ]
            }
          ]
        },
        "groupArtUnitNumber":{
          "value":"3771",
          "electronicText":"3771"
        },
        "applicationConfirmationNumber":"7897",
        "applicantFileReference":"FP01.P035 SST02US",
        "priorityClaimBag":{
          "priorityClaim":[
            {
              "ipOfficeName":"NETHERLANDS",
              "applicationNumber":{
                "applicationNumberText":"2019179"
              },
              "filingDate":"2017-07-05",
              "sequenceNumber":"1"
            }
          ]
        },
        "patentClassificationBag":{
          "cpcClassificationBagOrIPCClassificationOrECLAClassificationBag":[
            {
              "ipOfficeCode":"US",
              "mainNationalClassification":{
                "nationalClass":"606",
                "nationalSubclass":"133000"
              }
            }
          ]
        },
        "businessEntityStatusCategory":"SMALL",
        "firstInventorToFileIndicator":"true",
        "inventionTitle":{
          "content":[
            "Hair removal device for removing body hair on a body surface"
          ]
        },
        "applicationStatusCategory":"Application Dispatched from Preexam, Not Yet Docketed",
        "applicationStatusDate":"2020-05-08",
        "officialFileLocationCategory":"ELECTRONIC",
        "patentPublicationIdentification":{
          "publicationNumber":"US20200170371A1",
          "publicationDate":"2020-06-04"
        },
        "relatedDocumentData":{
          "parentDocumentDataOrChildDocumentData":[
            {
              "descriptionText":"This application is National Stage Entry of",
              "applicationNumberText":"PCT/NL2018/050434",
              "filingDate":"2018-07-04",
              "parentDocumentStatusCode":"Published",
              "patentNumber":""
            }
          ]
        }
      },
      "prosecutionHistoryDataBag":{
        "prosecutionHistoryData":[
          {
            "eventDate":"2020-06-05",
            "eventCode":"PG-ISSUE",
            "eventDescriptionText":"PG-Pub Issue Notification"
          },
          {
            "eventDate":"2020-05-11",
            "eventCode":"M903",
            "eventDescriptionText":"Notice of DO/EO Acceptance Mailed"
          },
          {
            "eventDate":"2020-05-11",
            "eventCode":"FLRCPT.U",
            "eventDescriptionText":"Filing Receipt - Updated"
          },
          {
            "eventDate":"2020-05-11",
            "eventCode":"MPEN",
            "eventDescriptionText":"Mail Pre-Exam Notice"
          },
          {
            "eventDate":"2020-02-26",
            "eventCode":"EML_NTR",
            "eventDescriptionText":"Email Notification"
          },
          {
            "eventDate":"2020-02-26",
            "eventCode":"EML_NTR",
            "eventDescriptionText":"Email Notification"
          },
          {
            "eventDate":"2020-02-26",
            "eventCode":"CCRDY",
            "eventDescriptionText":"Application ready for PDX access by participating foreign offices"
          },
          {
            "eventDate":"2020-01-05",
            "eventCode":"371COMP",
            "eventDescriptionText":"371 Completion Date"
          },
          {
            "eventDate":"2020-02-25",
            "eventCode":"PGPC",
            "eventDescriptionText":"Sent to Classification Contractor"
          },
          {
            "eventDate":"2020-02-25",
            "eventCode":"FTFS",
            "eventDescriptionText":"FITF set to YES - revise initial setting"
          },
          {
            "eventDate":"2020-01-02",
            "eventCode":"PTA.RFE",
            "eventDescriptionText":"Patent Term Adjustment - Ready for Examination"
          },
          {
            "eventDate":"2020-02-26",
            "eventCode":"FLRCPT.O",
            "eventDescriptionText":"Filing Receipt"
          },
          {
            "eventDate":"2020-02-26",
            "eventCode":"M903",
            "eventDescriptionText":"Notice of DO/EO Acceptance Mailed"
          },
          {
            "eventDate":"2019-12-31",
            "eventCode":"SREXR141",
            "eventDescriptionText":"PTO/SB/69-Authorize EPO Access to Search Results"
          },
          {
            "eventDate":"2019-12-31",
            "eventCode":"APPERMS",
            "eventDescriptionText":"Applicants have given acceptable permission for participating foreign "
          },
          {
            "eventDate":"2020-02-25",
            "eventCode":"SMAL",
            "eventDescriptionText":"Applicant Has Filed a Verified Statement of Small Entity Status in Compliance with 37 CFR 1.27"
          },
          {
            "eventDate":"2019-12-31",
            "eventCode":"L194",
            "eventDescriptionText":"Cleared by OIPE CSR"
          },
          {
            "eventDate":"2019-12-31",
            "eventCode":"WIDS",
            "eventDescriptionText":"Information Disclosure Statement (IDS) Filed"
          },
          {
            "eventDate":"2019-12-31",
            "eventCode":"WIDS",
            "eventDescriptionText":"Information Disclosure Statement (IDS) Filed"
          },
          {
            "eventDate":"2019-12-31",
            "eventCode":"BIG.",
            "eventDescriptionText":"ENTITY STATUS SET TO UNDISCOUNTED (INITIAL DEFAULT SETTING OR STATUS CHANGE)"
          },
          {
            "eventDate":"2019-12-31",
            "eventCode":"IEXX",
            "eventDescriptionText":"Initial Exam Team nn"
          }
        ]
      },
      "st96Version":"V3_1",
      "ipoVersion":"US_V8_0"
    },

我将json数据作为字典导入.但是,获取我想要检索的信息的最佳方法是什么.我应该使用json.normalize对其进行扁平化并转换为Dataframe吗?

I import the json data as a dictionary. However, what is the best way to obtain the information I would like to retrieve. Should I use json.normalize to flatten it and convert to a Dataframe?

我想专门在"prosecutionHistoryData"中检索信息.例如,在其他专利申请中,这将提供有关已发出了多少次起诉的具体信息.

I would like to specifically retrieve information in the "prosecutionHistoryData". For example, with other patent applications, this would provide specific information regarding how many office actions have been issued.

最终,我想交叉引用专利审查员的这种办公室行动数据(分配给审查员后可在"applicantBagOrInventorBagOrOwnerBag"中找到).

Eventually I would like to cross-reference this office action data by Patent Examiner (which would be found in the "applicantBagOrInventorBagOrOwnerBag" when assigned to an Examiner).

有没有很好的资源来解释如何清除json数据,这样我就可以将这些信息分解为单独的列?

Are there any good resources that explain how to clean json data such I can get break this information into separate columns?

感谢您提供信息!这是一个检查员的例子:

Thank you for the information! Here is an example with an Examiner:

   {
         "patentCaseMetadata":{
            "applicationNumberText":{
               "value":"16732312",
               "electronicText":"16732312"
            },
            "filingDate":"2020-01-01",
            "applicationTypeCategory":"Utility",
            "partyBag":{
               "applicantBagOrInventorBagOrOwnerBag":[
                  {
                     "primaryExaminerOrAssistantExaminerOrAuthorizedOfficer":[
                        {
                           "name":{
                              "personNameOrOrganizationNameOrEntityName":[
                                 {
                                    "personFullName":"ORGAD, EDAN"
                                 }
                              ]
                           }
                        }
                     ]
                  },
                  {
                     "applicant":[
                        {
                           "contactOrPublicationContact":[
                              {
                                 "name":{
                                    "personNameOrOrganizationNameOrEntityName":[
                                       {
                                          "organizationStandardName":{
                                             "content":[
                                                "Communication Systems LLC"
                                             ]
                                          }
                                       }
                                    ]
                                 },
                                 "cityName":"Santa Fe",
                                 "geographicRegionName":{
                                    "value":"NM",
                                    "geographicRegionCategory":"STATE"
                                 },
                                 "countryCode":""
                              }
                           ]
                        }
                     ]
                  }
               ]
            },
            "groupArtUnitNumber":{
               "value":"2414",
               "electronicText":"2414"
            },
            "applicationConfirmationNumber":"8996",
            "applicantFileReference":"CS1003US03",
            "patentClassificationBag":{
               "cpcClassificationBagOrIPCClassificationOrECLAClassificationBag":[
                  {
                     "ipOfficeCode":"US",
                     "mainNationalClassification":{
                        "nationalClass":"370",
                        "nationalSubclass":"329000"
                     }
                  }
               ]
            },
            "businessEntityStatusCategory":"SMALL",
            "firstInventorToFileIndicator":"true",
            "inventionTitle":{
               "content":[
                  "APPARATUSES, METHODS, AND COMPUTER-READABLE MEDIUM FOR COMMUNICATION IN A WIRELESS LOCAL AREA NETWORK"
               ]
            },
            "applicationStatusCategory":"Docketed New Case - Ready for Examination",
            "applicationStatusDate":"2020-02-07",
            "officialFileLocationCategory":"ELECTRONIC",
            "patentPublicationIdentification":{
               "publicationNumber":"US20200154403A1",
               "publicationDate":"2020-05-14"
            }
         },
         "prosecutionHistoryDataBag":{
            "prosecutionHistoryData":[
               {
                  "eventDate":"2020-05-19",
                  "eventCode":"PG-ISSUE",
                  "eventDescriptionText":"PG-Pub Issue Notification"
               }
            ]
         },
         "assignmentDataBag":{
            "assignmentData":[
               {
                  "reelNumber":"52436",
                  "frameNumber":"295",
                  "documentReceivedDate":"2020-04-20",
                  "recordedDate":"2020-04-20",
                  "mailDate":"2020-04-21",
                  "pageTotalQuantity":3,
                  "conveyanceText":"ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).",
                  "assignorBag":{
                     "assignor":[
                        {
                           "executionDate":"2016-07-14",
                           "contactOrPublicationContact":[
                              {
                                 "name":{
                                    "personNameOrOrganizationNameOrEntityName":[
                                       {
                                          "value":"ATEFI, ALI"
                                       }
                                    ]
                                 }
                              }
                           ]
                        }
                     ]
                  },
                  "assigneeBag":{
                     "assignee":[
                        {
                           "contactOrPublicationContact":[
                              {
                                 "name":{
                                    "personNameOrOrganizationNameOrEntityName":[
                                       {
                                          "value":"COMMUNICATION SYSTEMS LLC"
                                       }
                                    ]
                                 },
                                 "postalAddressBag":{
                                    "postalAddress":[
                                       {
                                          "postalAddressText":[
                                             {
                                                "sequenceNumber":"1",
                                                "value":"530-B HARKLE ROAD"
                                             },
                                             {
                                                "sequenceNumber":"2",
                                                "value":"STE. 100"
                                             },
                                             {
                                                "sequenceNumber":"3",
                                                "value":"SANTA FE NEW MEXICO 87505"
                                             }
                                          ]
                                       }
                                    ]
                                 }
                              }
                           ]
                        }
                     ]
                  },
                  "correspondenceAddress":{
                     "partyIdentifierOrContact":[
                        {
                           "name":{
                              "personNameOrOrganizationNameOrEntityName":[
                                 {
                                    "value":"ALI ATEFI"
                                 }
                              ]
                           },
                           "postalAddressBag":{
                              "postalAddress":[
                                 {
                                    "postalAddressText":[
                                       {
                                          "sequenceNumber":"1",
                                          "value":"530-B HARKLE ROAD"
                                       },
                                       {
                                          "sequenceNumber":"2",
                                          "value":"STE. 100"
                                       },
                                       {
                                          "sequenceNumber":"3",
                                          "value":"SANTA FE, NM 87505"
                                       }
                                    ]
                                 }
                              ]
                           }
                        }
                     ]
                  },
                  "sequenceNumber":"1"
               }
            ],
            "assignmentTotalQuantity":1
         },
         "st96Version":"V3_1",
         "ipoVersion":"US_V8_0"
      },

我的解析将不会超过subscriberBagOrInventorBagOrOwnerBag.这是我尝试获取检查者名称的示例解析,该名称返回一个空的数据框:

My parse will not go past the applicantBagOrInventorBagOrOwnerBag. Here is my example parse for trying to obtain the Examiner name, which returns an empty dataframe:

jsonpath_expression = parse('PatentBulkData[*].patentCaseMetadata.partyBag.applicantBagOrInventorBagOrOwnerBag.primaryExaminerOrAssistantExaminerOrAuthorizedOfficer.name.personNameOrOrganizationNameOrEntityName.personFullName[*]')

如果我以applicantBagOrInventorBagOrOwnerBag结尾,我将返回一个包含适当信息的数据框-仅带有方括号和所有其他JSON表示法.我是否缺少密钥结构?

If I end at the applicantBagOrInventorBagOrOwnerBag, I return a dataframe with proper information - just with brackets and all the other JSON notation. Am I missing the key structure?

再次感谢!

推荐答案

要解析或多或少的复杂JSON文档,您可能想看看 JSONPath 查询语言".

For parsing more or less complex JSON documents you might wanna take a look at the JSONPath "query language".

jsonpath-rw .由于您需要的数据是这样嵌套的

There's a nice Python implementation in jsonpath-rw. Since the data you need is nested like this

{
  "PatentBulkData": [
    {
      "prosecutionHistoryDataBag": {
        "prosecutionHistoryData": [
          {
            "eventDate": "2020-06-05",
            "eventCode": "PG-ISSUE",
            "eventDescriptionText": "PG-Pub Issue Notification"
          },

JSONPath是

在键 PatentBulkData 下,获取数组的每个元素,然后键 prosecutionHistoryDataBag ,然后键 prosecutionHistoryData ,最后是所有数组元素.

Under key PatentBulkData, get the every element of the array, then the key prosecutionHistoryDataBag, then the key prosecutionHistoryData, and finally all array elements under that.

PatentBulkData[*].prosecutionHistoryDataBag.prosecutionHistoryData[*]

这就是您在Python中要做的

This is what you'd do in Python

import json

from jsonpath_rw import jsonpath, parse
import pandas as pd

# Parse the string containing the whole JSON document
data = json.loads(<YOUR_JSON_STRING>)

jsonpath_expr = parse('PatentBulkData[*].prosecutionHistoryDataBag.prosecutionHistoryData[*]')

# Extract the raw value from each matching element,
# i.e. every element of the JSON array
matches = [match.value for match in jsonpath_expr.find(data)]

# Create dataframe from the list of dictionaries
df = pd.DataFrame.from_records(matches)

结果:

| eventDate   | eventCode   | eventDescriptionText              |
|-------------|:------------|:----------------------------------|
| 2020-06-05  | PG-ISSUE    | PG-Pub Issue Notification         |
| 2020-05-11  | M903        | Notice of DO/EO Acceptance Mailed |
| 2020-05-11  | FLRCPT.U    | Filing Receipt - Updated          |
| 2020-05-11  | MPEN        | Mail Pre-Exam Notice              |
| 2020-02-26  | EML_NTR     | Email Notification                |

编辑

对于检查者查询,您需要注意嵌套数组.每次访问树中的数组时,您需要获取一个( [0] [1] 等)或数组中的所有元素( [*] ):

EDIT

For the examiner query, you need to look out for nested arrays. Every time you get to an array in the tree, you need to either get one ([0], [1], etc.) or all the elements in the array ([*]):

examiner_expr = parse(
    "PatentBulkData[*].patentCaseMetadata.partyBag"
    ".applicantBagOrInventorBagOrOwnerBag[*]"
    ".primaryExaminerOrAssistantExaminerOrAuthorizedOfficer[*]"
    ".name.personNameOrOrganizationNameOrEntityName[*]"
    ".personFullName"
)
[match.value for match in examiner_expr.find(data)]                                                                                                  
# ['ORGAD, EDAN']

这篇关于使用Python以JSON格式提取/格式化数据的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆