CoreNLP服务器的UTF-8问题 [英] UTF-8 issue with CoreNLP server
问题描述
我使用以下命令运行斯坦福CoreNLP服务器:>
I run a Stanford CoreNLP Server with the following command:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
我尝试解析句子Who was Darth Vader’s son?
.请注意,Vader
后面的撇号不是ASCII字符.
I try to parse the sentence Who was Darth Vader’s son?
. Note that the apostrophe behind Vader
is not an ASCII character.
在线演示成功解析了该句子:
我在本地主机上运行的服务器失败:
The server I run on localhost fails:
我还尝试使用Python执行查询.
I also tried to perform the query using Python.
import requests
url = 'http://localhost:9000/'
sentence = 'Who was Darth Vader’s son?'
r=requests.post(url, params={'properties' : '{"annotators": "tokenize,ssplit,pos,ner", "outputFormat": "json"}'}, data=sentence.encode('utf8'))
tree = r.json()
最后一条命令引发异常:
The last command raises an exception:
ValueError: Invalid control character at: line 1 column 1172 (char 1171)
但是,我注意到在文本中出现了字符\x00
(即r.text
).如果删除它们,则json解析成功:
However, I noticed occurrences of the character \x00
in the text (i.e. r.text
). If I remove them, the json parsing succeeds:
import json
tree = json.loads(r.text.replace('\x00', ''))
最后,即使我没有使用选项-strict
来运行服务器,r.encoding
还是ISO-8859-1
.请注意,如果我手动将其替换为UTF-8
,则不会更改任何内容.
Finally, r.encoding
is ISO-8859-1
, even though I did not use the option -strict
to run the server. Note that it does not change anything if I manually replace it by UTF-8
.
如果我运行相同的代码,将url = 'http://localhost:9000/'
替换为url = 'http://corenlp.run/'
,则一切成功.调用r.json()
返回一个dict,r.encoding
确实是UTF-8
,并且文本中没有字符\x00
.
If I run the same code replacing url = 'http://localhost:9000/'
by url = 'http://corenlp.run/'
, then everything succeeds. The call r.json()
returns a dict, r.encoding
is indeed UTF-8
, and no character \x00
is in the text.
我运行的CoreNLP服务器怎么了?
What is wrong with the CoreNLP server I run?
推荐答案
这是3.6.0版本中的一个已知错误.如果您从GitHub构建服务器,则该服务器应可以正确使用UTF-8字符.在请求中设置适当的Content-Type标头也可以解决此问题(请参见 https://github.com. com/stanfordnlp/CoreNLP/issues/125 ).
This is a known bug with the 3.6.0 release. If you build the server from GitHub, it should work properly with UTF-8 characters. Setting the appropriate Content-Type header in the request will also fix this issue (see https://github.com/stanfordnlp/CoreNLP/issues/125).
这篇关于CoreNLP服务器的UTF-8问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!