pyspark用正则表达式读取csv文件 [英] pyspark read csv file with regular expression

查看:133
本文介绍了pyspark用正则表达式读取csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从具有特定模式的目录中读取csv文件我想匹配所有包含此字符串"logs_455DD_33 的文件t应该匹配"

I'm trying to read csv files from a directory with a particular pattern I want to match all the files with that contains this string "logs_455DD_33 t should match anything like "

machine_ logs_455DD_33 .csv

logs_455DD_33 _2018.csv

logs_455DD_33_2018.csv

machine_ logs_455DD_33 _2018.csv

machine_logs_455DD_33_2018.csv

我尝试了以下正则表达式,但与上述格式的文件不匹配.

I've tried the following regex but it doesn't match files with the above format .

file = "hdfs://data/logs/{*}logs_455DD_33{*}.csv"
df = spark.read.csv(file)

推荐答案

您可以使用子进程列出hdfs中的文件并grep这些文件:

You could use a subprocess to liste files in hdfs and grep these files :

import subprocess

# Define path and pattern to match
dir_in = "data/logs"
your_pattern = "logs_455DD_33"

# Specify your subprocess
args = "hdfs dfs -ls "+dir_in+" | awk '{print $8}' | grep "+your_pattern
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

# Get output and split it
s_output, s_err = proc.communicate()
l_file = s_output.split('\n')

# Read files
for file in l_file :
    df = spark.read.csv(file)

这篇关于pyspark用正则表达式读取csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆