如何解决StanfordNLP服务器的100K个字符限制? [英] How to work around 100K character limit for the StanfordNLP server?

查看:146
本文介绍了如何解决StanfordNLP服务器的100K个字符限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用StanfordNLP解析书本长度的文本块. http请求的效果很好,但是StanfordCoreNLPServer.java中的文本长度MAX_CHAR_LENGTH不可配置,限制为100KB.

I am trying to parse book-length blocks of text with StanfordNLP. The http requests work great, but there is a non-configurable 100KB limit to the text length, MAX_CHAR_LENGTH in StanfordCoreNLPServer.java.

就目前而言,我在将文本发送到服务器之前将其切碎,但是即使我尝试在句子和段落之间进行拆分,在这些块之间也会丢失一些有用的共指信息.大概,我可以解析重叠很大的块并将它们链接在一起,但这似乎(1)不雅致(2)有点像维护工作.

For now, I am chopping up the text before I send it to the server, but even if I try to split between sentences and paragraphs, there is some useful coreference information that gets lost between these chunks. Presumably, I could parse chunks with large overlap and link them together, but that seems (1) inelegant and (2) like quite a bit of maintenance.

是否有更好的方法来配置服务器或请求以删除手动分块或跨块保留信息?

Is there a better way to configure the server or the requests to either remove the manual chunking or preserve the information across chunks?

顺便说一句,我正在使用python请求模块进行发布,但我怀疑这会有所不同,除非corenlp python包装器以某种方式处理了此问题.

BTW, I am POSTing using the python requests module, but I doubt that makes a difference unless a corenlp python wrapper deals with this problem somehow.

推荐答案

您应该能够使用标志-maxCharLength -1启动服务器,这将摆脱句子长度的限制.请注意,这在生产中是不可取的:任意大的文档可能会消耗任意大量的内存(和时间),尤其是对于诸如coref之类的东西.

You should be able to start the server with the flag -maxCharLength -1 and that'll get rid of the sentence length limit. Note that this is inadvisable in production: arbitrarily large documents can consume arbitrarily large amounts of memory (and time), especially with things like coref.

应该通过使用-help调用服务器来访问服务器的选项列表,并且这些列表是

The list of options to the server should be accessible by calling the server with -help, and are documented in code here.

这篇关于如何解决StanfordNLP服务器的100K个字符限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆