代币发行问题 [英] Tokenizing issue

查看:147
本文介绍了代币发行问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试如下标记一个句子.

I am trying to tokenize a sentence as follows.

Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)

当我使用tidytext和下面的代码标记时,

When I tokenize using tidytext and the code below,

AA <- df %>%
  mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
         locations = str_locate_all(df$Section, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations) 

它给了我如下所示的结果集(见图片).

it gives me a result set as below (see image).

我如何将逗号和句点作为独立的记号获得,而不是出现"和注入"的一部分.分别使用tidytext.所以我的令牌应该是

How do i get the comma and the period as independent tokens as not part of 'occurs,' and 'infusion.' respectively, using tidytext. so my tokens should be

If
an
infusion
reaction
occurs
,
interrupt
the
infusion
.

推荐答案

事先用其他东西代替它们.请确保在更换前添加一个空格.然后在空格处分隔句子.

Replace them with something else beforehand. Make sure to add a space before the replacement. Then split the sentences at spaces.

include = c(".", ",") #The symbols that should be included

mystr = Section  # copy data
for (mypattern in include){
    mystr = gsub(pattern = mypattern,
                 replacement = paste0(" ", mypattern),
                 x = mystr, fixed = TRUE)
}
lapply(strsplit(mystr, " "), function(V) data.frame(Tokens = V))
#[[1]]
#      Tokens
#1         If
#2         an
#3   infusion
#4   reaction
#5     occurs
#6          ,
#7  interrupt
#8        the
#9   infusion
#10         .

这篇关于代币发行问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆