如何使用Databrick截断和/或使用通配符 [英] How to TRUNCATE and / or use wildcards with Databrick

查看:191
本文介绍了如何使用Databrick截断和/或使用通配符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在数据块中编写一个脚本,该脚本将根据文件名中的某些字符或仅基于文件中的日期戳来选择文件.

I'm trying to write a script in databricks that will select a file based on certain characters in the name of the file or just on the datestamp in the file.

例如,以下文件如下所示:

For example, the following file looks as follows:

LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31

LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31

我已经在Databricks中创建了以下代码:

I have created the following code in Databricks:

import datetime
now1 = datetime.datetime.now()
now = now1.strftime("%Y-%m-%d")

使用上面的代码,我尝试使用以下代码选择文件:

Using the above code I tried to select the file using following:

LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now

但是,如果仔细观察,您会发现datestamp和timestamp之间有一个空格,即22到06之间.

However, if you look closely you will notice that there is a space between the datestamp and the timestamp, i.e between 22 and 06

LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12- 22 06 -07-31

LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31

这是因为如果该空间阻止了我上面的代码正常工作.

It is because if this space that is preventing my code above from working.

我认为Databricks不支持通配符,因此以下内容将不起作用:

I don't think Databricks supports wildcards so the following won't work:

LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now

有人曾经建议缩短时间戳记.

Someone once suggested TRUNCATING the timestamp.

有人可以让我知道是否:

Can someone let me know if:

A.TRUNCATING将解决此问题 B.有没有办法我的代码LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now

A.TRUNCATING will solve this problem B.Is there a way to my code LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'% now

要选择整个文件?请记住,我绝对需要根据当前日期进行选择..我只想能够使用我的代码在文件中进行选择.

To select the whole file? Bearing in mind I definitely need to select based on current date.. I just want to be able to use my code to select on the file.

推荐答案

您可以使用dbutils读取文件名,并可以在if语句中检查模式是否匹配:如果现在在filname中.因此,您无需直接读取具有特定模式的文件,而是获得文件列表,然后复制与所需模式匹配的具体文件.

You can read filenames with dbutils and can check if a pattern matches in an if-statement: if now in filname. So instead of reading files with a specific pattern directly, you get a list of files and then copy the concrete files matching your required pattern.

以下代码可在databricks python笔记本中使用:

The following code works in a databricks python notebook:

data = """
{"a":1, "b":2, "c":3}
{"a":{, b:3} 
{"a":5, "b":6, "c":7}

"""

dbutils.fs.put("/mnt/adls2/demo/files/file1-2018-12-22 06-07-31.json", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file2-2018-02-03 06-07-31.json", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file3-2019-01-03 06-07-31.json", data, True)

2.以列表形式读取文件名:

文件= dbutils.fs.ls("/mnt/adls2/demo/files/")

2. Reading the filnames as a list:

files = dbutils.fs.ls("/mnt/adls2/demo/files/")

import datetime

now = datetime.datetime.now().strftime("%Y-%m-%d")
print(now)

输出:2019-01-03

Output: 2019-01-03

for i in range (0, len(files)):
  file = files[i].name
  if now in file:  
    dbutils.fs.cp(files[i].path,'/mnt/adls2/demo/target/' + file)
    print ('copied     ' + file)
  else:
    print ('not copied ' + file)

输出:

未复制的文件1-2018-12-22 06-07-31.json

not copied file1-2018-12-22 06-07-31.json

未复制的文件2-2018-02-03 06-07-31.json

not copied file2-2018-02-03 06-07-31.json

复制的文件3-2019-01-03 06-07-31.json

copied file3-2019-01-03 06-07-31.json

这篇关于如何使用Databrick截断和/或使用通配符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆