我应该使用哪个函数将非结构化文本文件读入 R? [英] Which function should I use to read unstructured text file into R?

查看:28
本文介绍了我应该使用哪个函数将非结构化文本文件读入 R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在这里的第一个问题,我是 R 的新手,试图找出我如何进行数据处理的第一步,请保持简单:)

我想知道在 R 中加载非结构化文本数据以供进一步处理的最佳函数和有用数据结构是什么.例如,假设我将一本书存储为文本文件, 里面没有换行符.

I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.

使用 read.delim() 并将数据存储在列表中是个好主意吗?或者字符向量更好,我将如何定义它?

Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it?

提前致谢.

PN

附言如果我使用."作为我的分隔符,它会处理诸如先生"之类的东西.作为一个单独的句子.虽然这只是一个例子,我并不担心这个缺陷,只是出于教育目的,我仍然很好奇你会如何解决这个问题.

P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.

推荐答案

read.delim 以表格格式(如 Excel 中的行和列)读入数据.读取一串文本不是很有用.

read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.

要将文本文件中的文本读入 R,您可以使用 readLines().readLines() 创建一个包含与文本行一样多的元素的字符向量.对于此类软件,一行是任何以换行符结尾的文本字符串.(阅读维基百科上的 newline.)当您编写文本时,您输入系统特定的换行符(s) 按 Return.实际上,一行文本不是由软件窗口的宽度定义的,而是可以跨越许多可视行.实际上,一行文本就是一本书中的一个段落.所以 readLines() 在段落处分割你的文本:

To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:

> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[2] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[3] "\"TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[4] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[5] "\"What's gone with that boy,  I wonder? You TOM!\""                                                                                                                                                                                                                                                                                                                                                                                                                             
[6] "No answer."                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"

请注意,您可以在 Stackoverflow 中向左滚动长文本.第七行比这列宽.

如您所见,readLines() 将第七段作为一行阅读.而且,正如您所看到的,readLines() 在每个引号前添加了一个反斜杠.由于 R 将各个行放在引号中,因此需要将这些行与原始文本的一部分区分开来.因此,它转义"了原始引号.阅读维基百科上的转义.

As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.

readLines() 可能会输出警告,指出在您的文件中发现了不完整的最后一行".这仅意味着在最后一行之后没有换行符.您可以使用 readLines(..., warn = FALSE) 抑制此警告,但您不必这样做,这不是错误,并且抑制警告只会抑制警告消息.

readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.

如果您不想只是将文本输出到 R 控制台而是进一步处理它,请创建一个保存 readLines() 输出的对象:

If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():

mytext <- readLines("textfile.txt")

除了readLines(),还可以使用scan()readBin()等函数从文件中读取文本.通过输入 ?scan 等查看手册.查看 ?connections 以了解将文件读入 R 的许多不同方法.

Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.

我强烈建议您在 Vim、Notepad、TextWrangler 等文本编辑器中将文本写入 .txt 文件中,而不是在 MS Word 等文字处理器中编写.Word 文件包含的不仅仅是您在屏幕上看到或打印的文本,而且这些文本将被 R 读取.您可以尝试看看您得到了什么,但为了获得良好的结果,您应该将文件另存为 Word 中的 .txt 文件,或者在文本编辑器中编写它.

I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.

您还可以将在任何其他软件中打开的文本文件中的文本复制粘贴到 R 或在 R 控制台中撰写文本:

You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:

myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."

注意在我用 ") 关闭字符串之前,输入 Return 不会导致 R 执行命令.R 只是回复 +,告诉我我可以继续编辑.我没有输入那些加号.试试吧.另请注意,现在换行符是你的文本字符串的一部分.(我在 Mac 上,所以我的换行符是 \n.)

Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)

如果您手动输入文本,我会将整个文本作为一个字符串加载到向量中:

If you input your text manually, I would load the whole text as one string into a vector:

x <- c("The text of your book.")

您可以将不同的章节加载到此向量的不同元素中:

You could load different chapters into different elements of this vector:

y <- c("Chapter 1", "Chapter 2")

为了更好的参考,您可以为元素命名:

For better reference, you can name the elements:

z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")

现在您可以拆分任何这些向量的元素:

Now you can split the elements of any of these vectors:

sentences <- strsplit(z, "[.!?] *")

输入 ?strsplit 以阅读此函数的手册并了解它需要的属性.第二个属性采用正则表达式.在这种情况下,我告诉 strsplit 在三个标点符号中的任何一个处拆分向量的元素,后跟一个可选的空格(如果您没有在此处定义空格,则生成的句子"将是前面有一个空格).

Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).

sentences 现在包含:

> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"                       
[3] "Why was the author so lazy"           

$ch2
[1] "This is the text of the second chapter" "It is even shorter"

您可以通过索引访问各个句子:

You can access the individual sentences by indexing:

> sentences$ch1[2]
[3] "It is not long"

R 将无法知道它不应该在先生"之后拆分.您必须在正则表达式中定义异常.解释这一点超出了本问题的范围.

R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.

你会如何告诉 R 如何识别主题或对象,我不知道.

How you would tell R how to recognize subjects or objects, I have no idea.

这篇关于我应该使用哪个函数将非结构化文本文件读入 R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆