检查字符串中是否有多个单词匹配,以便在R中搜索文本 [英] Check for multiple words in string match for text search in r

查看:100
本文介绍了检查字符串中是否有多个单词匹配,以便在R中搜索文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我有一个可用于一个单词搜索的代码,我们可以搜索多个单词并将匹配的单词写到数据框中吗? (为澄清起见,请参阅此帖子),这是 akrun的解决方案适用于一个单词. 这是代码:

Presently I have a code which works for one word search, can we search multiple words and write those matched words in a dataframe? (for clarification, please refer to this post) this is akrun's solution which works for one word. Here is the code:

 library(pdftools)
 library(tesseract)

 All_files <- Sys.glob("*.pdf")
 v1     <- numeric(length(All_files))
 word   <- "school"
 df     <- data.frame()
 Status <- "Present"

for (i in seq_along(All_files)){
  file_name <- All_files[i]

  cnt <- pdf_info(All_files[i])$pages
  print(cnt)
  for(j in seq_len(cnt)){
      img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400)
      text     <- ocr(img_file)
      ocr_text <- capture.output(cat(text))
      check    <- sapply(ocr_text, paste, collapse="")
      junk     <- dir(path= paste0(path, "/tiff"), pattern="tiff")
      file.remove(junk)
      br <-if(length(which(stri_detect_fixed(tolower(check),tolower(word)))) <= 0) "Not Present"  
              else "Present" 
      print(br)       
      if(br=="Present") {
         v1[i] <- j
         break}
    }

    Status <- if(v1[i] == 0) "Not Present" else "Present"
    pages  <- if(v1[i] == 0) "-" else 
      paste0(tools::file_path_sans_ext(basename(file_name)), "_", v1[i])
    words  <- if(v1[i] == 0) "-" else word
    df     <- rbind(df, cbind(file_name = basename(file_name),
                    Status, pages = pages, words = words))
}

在这里,我们仅搜索一个词,即school.我们可以搜索多个单词,例如schoolgymswimming pool吗?

Here we are searching for only one word i.e school. Can we search for multiple words like school, gym, swimming pool?

预期O/P

fileName   Status        Page             Words                    TEXT
test.pdf   Present     test_1             gym            I go gym regularly  
test.pdf   Present     test_3             school     Here is the next school
test1.pdf  Present     test1_4            swimming pool  In swimming pool
test1.pdf  Present     test1_7            gym         next to Gold gym
test2.pdf  Not Present    -               -

fileName =文件名

状态 =如果找到任何单词,则存在",否则为不存在"

Status=If any word is found then "Present" else "Not Present"

页面 ="_ 1","_ 3"定义在其上找到单词的页面编号;在页面"test_1"上找到单词"gym",在页面"test_3"上找到单词"school".

Page=Here "_1", "_3" defines the page number on which the word was found;; on page "test_1" word "gym" was found and on page "test_3" word "school" was found.

单词 =找到了所有单词;就像在test.pdf文件的第1和第3页上只发现了健身房"和学校",而在test1.pdf文件的第4和7页上只发现了游泳池"和健身房".

Words= Which all words were found ;; like only "gym" and "school" were found on page 1 and 3 of test.pdf file AND only "swimming pool" and "gym" were found on page 4 and 7 of test1.pdf file.

TEXT =它是在其中找到单词的文本

TEXT = It is the text in which the word was found

任何关于此的建议都会有所帮助.

Any suggestion on the same will be helpful.

谢谢

推荐答案

您可以使用外部循环浏览目录中的每个PDF.然后,浏览PDF的所有页面,并在内部循环中提取文本.您要检查每个文档是否至少一页包含schoolgymswimming pool.您要使用的返回值是:

You go through every PDF in your directory with the outside loop. Then you go through all pages of the PDF and extract the text in the inner loop. You want to check for every document whether at least one page contains either school, gym or swimming pool. The returned values you want to use are:

  1. 包含PresentNot present的PDF文档数量的长度的向量.
  2. 带有一些字符串的三个向量,其中包含有关哪个单词何时何地出现的信息.
  1. a vector of the length of the number of PDF documents containing either Present or Not present.
  2. Three vector with some strings, containing information on which word occurs where and when.

对吗?

您可以跳过循环中的几个步骤,尤其是在将PDF转换为TIFF并使用ocr从其中读取文本时:

You can skip a couple of steps in your loop, especially while transforming PDFs to TIFFs and reading texts from them with ocr:

all_files <- Sys.glob("*.pdf")
strings   <- c("school", "gym", "swimming pool")

# Read text from pdfs
texts <- lapply(all_files, function(x){
                img_file <- pdf_convert(x, format="tiff", dpi=400)
                return( tolower(ocr(img_file)) )
                })

# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
  for(w in seq_along(strings)){
    intermed   <- grep(strings[w], texts[[d]])
    words[[d]] <- c(words[[d]], 
                    strings[w][ (length(intermed) > 0) ])
    pages[[d]] <- unique(c(pages[[d]], intermed))
  }
}

# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))

Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))

Words    <- sapply(words, paste0, collapse=", ")
Status   <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")

data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)        
#       Status                                   Page                      Words
# pdf1 Present                         pdf1_1, pdf1_2         gym, swimming pool
# pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool

它不像我希望的那样可读.可能是因为几乎没有要求输出需要少量的中间步骤,使代码看起来有些混乱.效果很好,尽管

It's not as readable as I'd like it to be. Probably because little requirements w.r.t. the output require minor intermediate steps that make the code seem a bit chaotic. It works well, though

这篇关于检查字符串中是否有多个单词匹配,以便在R中搜索文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆