如何读取不同目录中的txt文件的内容并根据重命名其他文件 [英] How read contents of txt files in different directories and rename other files according to

查看:218
本文介绍了如何读取不同目录中的txt文件的内容并根据重命名其他文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始使用Python 3并遇到了以下问题:

I just started with Python 3 and ran into the following problem:

我从我的论文中下载了大量来自不同期刊的PDF,但它们都被命名了在他们的DOI之后,而不是作者(年) - 标题的格式。
根据期刊的名称和数量,文档保存在不同的目录中,例如:

I downloaded a good deal of PDFs from different journals for my thesis, but they are all named after their DOI and not in the format "Author (Year) - Title". The documents are saved in different directories, according to the journal's name and volume, e.g.:

/Journal 1/
    /Vol. 1/
        file1.pdf
        file1.txt
        file2.pdf
        file2.txt
        filen.pdf
        filen.txt
    /Vol. 2/
        file1.pdf
        file1.txt
/Journal 2/
    ...

因为我不知道如何用Python阅读PDF的内容,所以我编写了一个非常简短的bash脚本,它将PDF转换为简单的TXT文件。 pdf和txt文件具有相同的名称,具有不同的文件扩展名。

Because I have no idea how to read the contents of a PDF with Python, I wrote a very short bash script, that converted the PDFs to simple TXT files. The pdf and txt files have the same name with a different file extension.

我想重命名所有PDF文件,幸运的是每个文件都有一个字符串文件的连续文本,我可以使用。这个变量字符串位于两个静态字符串之间:

I would like to rename all of the PDF files, luckily there is a string in each of the file's continuous text, that I could use. This variable string lies between two static strings:

"Cite this article as: " AUTHOR/YEAR/TITLE ", Journal name". 

如何让Python进入每个目录,阅读TXT / PDF的内容,提取两个固定字符串之间的变量字符串,然后重命名相应的PDF文件?

How do I make Python go into each directory, read the contents of the TXT/PDF, extract the variable string between the two fixed strings and then rename the appropriate PDF file?

如果有人知道如何使用Python 3执行此操作,我将非常感谢。

If anyone knows how to do this with Python 3, I would be very thankful.

推荐答案

最后让它起作用:

#__author__ = 'Telefonmann'
# -*- coding: utf-8 -*-

import os, re, ntpath, shutil

for root, dirs, files in os.walk(os.getcwd()):
    for file in files: # loops through directories and files
        if file.endswith(('.txt')): # only processes txt files
            full_path = ntpath.splitdrive(ntpath.join(root, file))[1]
            # builds correct path under Win 7 (and probably other NT-systems

            with open(full_path, 'r', encoding='utf-8') as f:
                content = f.read().replace('\n', '') # remove newline

                r = re.compile('To\s*cite\s*this\s*article:\s*(.*?),\s*Journal\s*of\s*Quantitative\s*Linguistics\s*,')
                m = r.search(content)
                # finds substring inbetween "To cite this article: " and "Journal of Quantitative Linguistics,"
                # also finds typos like "Journal ofQuantitative ..."

                if m:
                    full_title = m.group(1)

            print("full_title: {0}".format(full_title))
            full_title = (full_title.replace('<','') # removes/replaces forbidden characters in Windows file names
                .replace('>','')
                .replace(':',' -')
                .replace('"','')
                .replace('/','')
                .replace('\\','')
                .replace('|','')
                .replace('?','')
                .replace('*',''))

            pdf_name = full_path.replace('txt','pdf')
            # since txt and pdf files only differ in their format extension I simply replace .txt with .pdf
            # to get the right name

            print('File: '+ file)
            print('Full Path: ' + full_path)
            print('Full Title: ' + full_title)
            print('PDF Name: ' + pdf_name)
            print('....................................')
            # for trouble shooting

            dirname = ntpath.dirname(pdf_name)
            new_path = ntpath.join(dirname, "{0}.pdf".format(full_title))

            if ntpath.exists(full_path):
                print("all paths found")
                shutil.copy(pdf_name, new_path)
                # makes a copy of the pdf file with the new name in the respective directory

这篇关于如何读取不同目录中的txt文件的内容并根据重命名其他文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆