如何从检索到的JSON数据中删除双引号 [英] How do I remove double quotes from whithin retreived JSON data

查看:186
本文介绍了如何从检索到的JSON数据中删除双引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用BeautifulSoup从工作网站上抓取列表中的内容,并通过网站的HTML代码将数据输出到JSON中.

我使用正则表达式修复了一些错误,但是这个特殊的问题使我陷入了困境.在抓取工作清单时,我选择从HTML源代码(< script type = "application/ld+json" >)中提取JSON数据,而不是从每个感兴趣的容器中提取信息.从那里,我将BeautifulSoup结果转换为字符串,清除HTML剩余的内容,然后将字符串转换为JSON.但是,由于工作清单中使用引号引起的文字,我遇到了麻烦.由于实际数据很大,因此我将使用替代项.

 example_string = '{"Category_A" : "Words typed describing stuff",
                   "Category_B" : "Other words speaking more irrelevant stuff",
                   "Category_X" : "Here is where the "PROBLEM" lies"}'

 

现在,上面的代码无法在Python中运行,但是我从工作清单的HTML中提取的字符串与上面的格式相当.当它传递到json.loads()时,它返回错误:json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035

我完全不确定如何解决此问题.

编辑 以下是导致错误的实际代码:

 from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re

uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()

listing_soup = BeautifulSoup(page_html, "lxml")

json_script = listing_soup.find("script", "type":"application/ld+json"}).strings

extracted_json_str = ''.join(json_script)

## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+|  |&nbsp;|amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
                                repl='', 
                                string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
                                repl = r"'",
                                string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
                                repl=r" -",
                                string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
                                repl="",
                                string = extracted_json_str_CLEAN3)

## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)
 

我确实知道导致错误的原因:在

谢谢!

 # WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
    extraction = extraction.replace("\"", "\'")
print(extraction)
 

在这种情况下,您将转换从提取中",这意味着您需要转换某些内容,因为如果您要在字符串中使用",则python会给您提供一种同时使用两者的方式,您需要将辛博尔求逆: /p>

示例:

 "this is a 'test'"
'this was a "test"'
"this is not a \"test\""
 

 #in case the condition is meat
if "\"" in item:
    #use this
    item = item.replace("\"", "\'")
    #or use this
    item = item.replace("\"", "\\\"")
 

I'm currently using BeautifulSoup to web-scrape listings from a jobs website, and outputting the data into JSON via the site's HTML code.

I fix bugs with regex as they come along, but this particular issue has me stuck. When webscraping the job listing, instead of extracting info from each container of interest, I've chosen to instead extract JSON data within the HTML source code (< script type = "application/ld+json" >). From there I convert the BeautifulSoup results into strings, clean out the HTML leftovers, then convert the string into a JSON. However, I've hit a snag due to text within the job listing using quotes. Since the actual data is large, I'll just use a substitute.

example_string = '{"Category_A" : "Words typed describing stuff",
                   "Category_B" : "Other words speaking more irrelevant stuff",
                   "Category_X" : "Here is where the "PROBLEM" lies"}'

Now the above won't run in Python, but the string I have that has been extracted from the job listing's HTML is pretty much in the above format. When it's passed into json.loads(), it returns the error: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035

I'm not at all sure how to address this issue.

EDIT Here's the actual code leading to the error:

from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re

uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()

listing_soup = BeautifulSoup(page_html, "lxml")

json_script = listing_soup.find("script", "type":"application/ld+json"}).strings

extracted_json_str = ''.join(json_script)

## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+|  |&nbsp;|amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
                                repl='', 
                                string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
                                repl = r"'",
                                string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
                                repl=r" -",
                                string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
                                repl="",
                                string = extracted_json_str_CLEAN3)

## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)

I do know what's leading to the error: within the last bullet point of Objective 4 in the job description, the author used quotes when referring to a required task of the job (i.e. "quality control" ). The way I've been going about extracting information from these job listings, a simple instance of someone using quotes causes my whole approach to blow up. Surely there's got to be a better way to build this script without such liabilities like this (as well as having to use regex to fix each breakdown as they arise).

Thanks!

解决方案

# WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
    extraction = extraction.replace("\"", "\'")
print(extraction)

In this case you will convert " from extraction in ' I mean something you will need to convert because python give uyou a way to use both if uyou want to use " inside of a string you will need to inverse that simbols:

example:

"this is a 'test'"
'this was a "test"'
"this is not a \"test\""

#in case the condition is meat
if "\"" in item:
    #use this
    item = item.replace("\"", "\'")
    #or use this
    item = item.replace("\"", "\\\"")

这篇关于如何从检索到的JSON数据中删除双引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆