生成的Beautiful Soup txt文件中文件的未知编码 [英] Unknown encoding of files in a resulting Beautiful Soup txt file

查看:146
本文介绍了生成的Beautiful Soup txt文件中文件的未知编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我下载了13000个文件(来自不同公司的10-K报告),我需要提取这些文件的特定部分(第1A部分-风险因素).问题是我可以轻松地在Word中打开这些文件并且它们是完美的,而当我在普通的txt编辑器中打开它们时,该文件似乎是最后带有大量加密字符串的HTML(我怀疑这是由于这些文件的XBRL格式).使用BeautifulSoup也会发生相同的情况.

I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end ( I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup.

我尝试使用在线解码器,因为我认为这可能与Base64编码有关,但是似乎没有一种已知的编码可以帮助我.我看到在某些文件的开头,有类似以下内容:使用Certent Disclosure Management 6.31.0.1创建"和其他程序,我认为这可能导致编码.但是,Word可以打开这些文件,因此我猜必须有一个已知的密钥.这是一个示例编码数据:

I've tried using online decoder, because I thought that maybe this is connected to Base64 encoding, but it seems that none of the known encoding could help me. I saw that at the beginning of some files, there is something like: "created with Certent Disclosure Management 6.31.0.1" and other programs, I thought maybe this causes the encoding. Nevertheless Word is able to open these files, so I guess there must be a known key to it. This is a sample encoded data:

M1G2RBE@MN)T='1,SC4,]%$$Q71T3<XU#[AHMB9@*E1=E_U5CKG&(77/*(LY9
ME$N9MY/U9DC,- ZY:4Z0EWF95RMQY#J!ZIB8:9RWF;\"S+1%Z*;VZPV#(MO
MUCHFYAJ'V#6O8*[R9L<VI8[I8KYQB7WSC#DMFGR[E6+;7=2R)N)1Q\24XQ(K
MYQDS$>UJ65%MV4+(KBRHJ3HFIAR76#G/F$%=*9FOU*DM-6TSTC$Q\[C$YC$/

然后从我下载的13000个文件中示例文件.

And a sample file from the 13 000 that I downloaded.

下面,我插入用来提取文本的BeautifulSoup.它可以完成工作,但是我需要找到此编码字符串的线索,并以某种方式在下面的Python代码中对其进行解码.

Below I insert the BeautifulSoup that I use to extract text. It does its' job, but I need to find a clue to this encoded string and somehow decode it in the Python code below.

from bs4 import BeautifulSoup

with open("98752-TOROTEL INC-10-K-2019-07-23", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'html.parser')
    print(soup.getText())
    with open("extracted_test.txt", "w", encoding="utf-8") as f:
        f.write(soup.getText())
    f.close()

我要实现的是在文件末尾对该伪字符串进行解码.

What I want to achieve is decoding of this dummy string in the end of the file.

推荐答案

好的,这会有些杂乱,但是可以使您与所需的内容足够接近,而无需使用正则表达式(这在html中是非常成问题的).您将面临的基本问题是EDGAR档案的格式非常不一致,因此对于一个10Q(或10K或8K)档案可能有效的方法可能不适用于类似的档案(即使来自同一档案管理器...)例如,单词"item"可能以小写或大写(或混合使用)出现,因此使用string.lower()方法等.因此,在所有情况下都将进行一些清理.

Ok, this is going to be somewhat messy, but will get you close enough to what you are looking for, without using regex (which is notoriously problematic with html). The fundamental problem you'll be facing is that EDGAR filings are VERY inconsistent in their formatting, so what may work for one 10Q (or 10K or 8K) filing may not work with a similar filing (even from the same filer...) For example, the word 'item' may appear in either lower or uppercase (or mixed), hence the use of the string.lower() method, etc. So there's going to be some cleanup, under all circumstances.

话虽如此,下面的代码应该使您从两个文件中都可以找到风险因素"部分(包括没有文件的部分):

Having said that, the code below should get you the RISK FACTORS sections from both filings (including the one which has none):

url = [one of these two]

from bs4 import BeautifulSoup as bs
response = requests.get(url)
soup = bs(response.content, 'html.parser')

risks = soup.find_all('a')
for risk in risks:    
    if 'item' in str(risk.attrs).lower() and '1a' in str(risk.attrs).lower():       
        for i in risk.findAllNext(): 
            if 'item' in str(i.attrs).lower():
                break
            else:
                print(i.text.strip())

祝您的项目好运!

这篇关于生成的Beautiful Soup txt文件中文件的未知编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆