如何将文本(如名称和主题)与标点符号分开 [英] How do i separate text(like names and topics) from puctuation marks

查看:210
本文介绍了如何将文本(如名称和主题)与标点符号分开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用python,我想从文本中分隔名称(连同首字母)和标题,例如:



-Kuhn,R。Molekulare Asymmetrie in Stereochemie ,1933年,803年。

-Miyashita,A。; Yasuda,A。; Takaya,H。; Toriumi,K。; Ito,T。; Souchi,T。; Noyori,R。2,2'-双(二苯基膦基)-1,1'-联萘(BINAP),阻转异构手性双(三芳基)膦的合成,及其在铑(I)催化的α不对称氢化反应中的应用 - (酰氨基)丙烯酸。 J. Am。化学。 SOC。 1980,102,7932-7934。



我尝试过:



我学会了各种与机器学习相关的教程,sci-kit学习了十多天,还访问了各种网站。大部分是理论上的,或者专注于处理与数字相关的数据。我不知道想探索可能与我的工作无关的主题(我是初学者)。我无法找到解决此问题的正确起点

解决方案

< blockquote>这里有两个问题:



  1. 确定格式。
  2. 定义解析方法。



主要问题是第一个问题,因为有些不同的格式,某些部分可能不存在,可能有多个作者。要了解常用格式,请在网上搜索科学出版物参考格式。



确定格式后,使用正则表达式操作 [ ^ ]可以使用拆分字符串。



在这里使用某种自学习可能是一项艰巨的任务。更实用的解决方案是使用预定义格式并检查输入字符串是否匹配。当匹配失败时,可以报告和分析以添加新格式。



要自动执行任务,格式检查器可以使用某种标记来表示特定元素然后转换为相应的正则表达式。


我没有完整的解决方案,但我认为有两个建议可能会有所帮助:



1)正如Jochen建议的那样,你可以检查不同的格式。您可以以不产生普通真/假但匹配确定性的方式实现这些检查(例如,0到1之间的浮点值)。这样,即使没有100%的匹配,你仍然可以选择产生最高确定性的格式。



2)确定a的确定性的一个要素格式匹配可以是自动搜索确定的标题。如果您发现搜索结果中的标题被不同的字符所包围,则会增加匹配的确定性。


Using python, i want to separate names(along with initials) and titles from a text such as this:

-Kuhn, R. Molekulare Asymmetrie in Stereochemie, 1933, 803.
-Miyashita, A.; Yasuda, A.; Takaya, H.; Toriumi, K.; Ito, T.; Souchi, T.; Noyori, R. Synthesis of 2,2'-bis(diphenylphosphino)-1,1'-binaphthyl (BINAP), an atropisomeric chiral bis(triaryl)phosphine, and its use in the rhodium(I)-catalyzed asymmetric hydrogenation of α-(acylamino)acrylic acids. J. Am. Chem. Soc. 1980, 102, 7932-7934.

What I have tried:

I've learnt various tutorials related to machine learning,sci-kit learn for more than ten days,and also visited various websites.Most of it was either theoretical or was focused on working with data related to numbers.I don't want to explore in to topics that might not be related to my work(I'm a beginner).I was unable to find a proper starting point to solve this problem

解决方案

There are two problems here:


  1. Identify the format.
  2. Define the parsing method.


The main problem is the first one because there are different formats where some parts may be not present and there may be multiple authors. To know about commonly used formats, search the web for "scientific publication reference format".

Once the format has been identified, using Regular expression operations[^] can be used to split the string.

Using some kind of self learning here might be a difficult task. A more practical solution would use predefined formats and check if the input string matches. When a matching fails it can be reported and analysed to add a new format.

To automate the task, the format checkers can use some kind of tokens for specific elements that are then translated to corresponding regular expressions.


I don't have a whole solution for you but two suggestion which, I think, could be helpful:

1) As Jochen suggested, you could check different formats. You could implement those checks in a way that they don't produce a plain true/false but a certainty of a match (e.g. a float value between 0 and 1). That way, even if there is no 100% match, you could still choose the format that yielded the highest certainty.

2) One element of determining the certainty of a format-match could be automated googling of the determined title. If you find the title in the search results surrounded by different characters than in your input it would increase the certainty of a match.


这篇关于如何将文本(如名称和主题)与标点符号分开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆