用Python编写打开的文件时的分割功能 [英] Split function when writing an opened file in Python
问题描述
所以我有一个程序,我应该在其中提取一个外部文件,用python打开它,然后将每个单词和每个标点分隔开,包括逗号,撇号和句号.然后我应该将此文件保存为文本中每个单词和标点符号出现时的整数位置.
So I have a program in which I am supposed to take an external file, open it in python and then separate each word and each punctuation including commas, apostrophes and full stops. Then I am supposed to save this file as the integer positions of when each word and punctuation occurs in the text.
例如:-我喜欢编码,因为编码很有趣.计算机的骨架.
For eg:- I like to code, because to code is fun. A computer's skeleton.
在我的程序中,我必须将其另存为:-
In my program, I have to save this as:-
1,2,3,4,5,6,3,4,7,8,9,10,11,12,13,14
1,2,3,4,5,6,3,4,7,8,9,10,11,12,13,14
(帮助那些不了解的人) 1-I,2-like,3-to,4-code,5-(,),6-因为,7-is,8-fun 9-(.),10-A,11台计算机,12-( '),13-s,14骨架
(Help for those who do not understand) 1-I , 2-like, 3-to, 4-code, 5-(,), 6-because, 7-is, 8-fun 9-(.), 10-A, 11-computer, 12-('), 13-s, 14-skeleton
因此,这显示了每个单词的位置,即使重复出现也显示了同一单词的第一个出现位置
So this has displayed the positions of each of word, even if it repeats, it shows the first occuring postion of the same word
很抱歉,冗长的解释,但这是我的实际问题.到目前为止,我已经做到了:-
Sorry for the long explanation but here is my actual question. I have done this so far:-
file = open('newfiles.txt', 'r')
with open('newfiles.txt','r') as file:
for line in file:
for word in line.split():
print(word)
这是结果:-
They
say
it's
a
dog's
life,.....
不幸的是,这种分割文件的方法不会将单词和标点符号分开,也不会水平打印. .split不适用于文件,有人知道我可以更有效地分割文件的方式-标点符号吗?然后将分隔的单词和标点符号一起存储在列表中?
Unfortunately this way to split a file does not separate words from punctuation and it does not print out horizontally. .split does not work on a file, does anyone know a more effective way in which i can split the file - words from punctuation? And then store the separated words and punctuation together in a list?
推荐答案
内置字符串方法.split
仅可用于简单的定界符.没有参数,它只是在空白上分割.对于更复杂的拆分行为,最简单的方法是使用正则表达式:
The built-in string method .split
can only work with simple delimiters. Without an argument, it simply splits on whitespace. For more complex splitting behavior, the easiest thing is to use regex:
>>> s = "I like to code, because to code is fun. A computer's skeleton."
>>> import re
>>> delim = re.compile(r"""\s|([,.;':"])""")
>>> tokens = filter(None, delim.split(s))
>>> idx = {}
>>> result = []
>>> i = 1
>>> for token in tokens:
... if token in idx:
... result.append(idx[token])
... else:
... result.append(i)
... idx[token] = i
... i += 1
...
>>> result
[1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]
此外,我认为您无需按照规范逐行遍历文件.您应该做类似的事情:
Also, I don't think you need to iterate over the file line by line, as per your specifications. You should just do something like:
with open('my file.txt') as f:
s = f.read()
这会将整个文件作为字符串放入s
中.请注意,我从未在with
语句之前使用open
,这没有任何意义.
Which will put the entire file as a string into s
. Note, I never used open
before the with
statement, that doesn't make any sense.
这篇关于用Python编写打开的文件时的分割功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!