上载到Google云端存储时,输出数据会以随机顺序显示 [英] Output data appears in a random order when uploaded to google cloud storage
问题描述
我一直在使用google-dataflow-sdk将CSV文件上传到Google云端存储.当我将文件上传到Google云项目时,我的数据在云中以随机顺序显示在文件中.csv上的每一行都是正确的,但是行到处都是.
I've been using the google-dataflow-sdk to upload CSV files to google cloud storage. When I upload the file to a google cloud project, my data appears in a file in a random order on the cloud. Each line on the csv is correct, but the rows are all over the place.
csv的标头)属性,属性,属性)始终在另一行,而永远不在顶部.我再次强调,每一列中的数据都很好,只是行的位置是随机的.
The header of the csv )i.e. attribute, attribute, attribute) are on another line all the time and never at the top where is should be. I stress again, the data in each column is fine, it is just the rows that are randomly positioned.
下面是最初读取数据的代码:
here is the code which reads the data initially:
PCollection<String> csvData = pipeline.apply(TextIO.Read.named("ReadItems")
.from(filename));
这是写入Google云端项目的代码:
and this is the code that writes to the google cloud project:
csvData.apply(TextIO.Write.named("WriteToCloud")
.to("gs://dbm-poc/"+partnerId+"/"+dateOfReport+modifiedFileName)
.withSuffix(".csv"));
感谢您的帮助.
推荐答案
尽管我同意Graham Polley提供的答案是正确的,但我设法找到了一种更简单的方法来使数据按有序方式写入.
Whilst i agree the answer provided by Graham Polley is correct, I managed to find a much simpler way to get the data to write in an ordered way.
我改为使用Google云存储库将需要的文件存储到云中,就像这样:
I instead used the google cloud storage library to store the files I would need onto the cloud, like so:
public static String writeFile(byte[] content, String filename, String partnerId, String dateOfReport) {
Storage storage = StorageOptions.defaultInstance().service();
BlobId blobId = BlobId.of("dbm-poc", partnerId + "/" + dateOfReport + "-" + filename + ".csv");
BlobInfo blobInfo = BlobInfo.builder(blobId).contentType("binary/octet-stream").build();
storage.create(blobInfo, content);
return filename;
}
public static byte[] readFile(String filename) throws IOException {
return Files.readAllBytes(Paths.get(filename));
}
结合使用这两种方法,我不仅可以将文件上传到我想要的存储桶中,而且不会丢失任何内容排序,而且还可以从文本更改上传文件的格式到二进制/八位字节流文件,这意味着可以访问和下载该文件.
Using these two methods in conjunction with each other, I was not only able to upload the files to the bucket i wanted without losing any of the contents ordering, but i was also able to change the format of the uploaded files from text to a binary/octet-stream file which means it can be access and downloaded.
此方法似乎也消除了需要使用管道上传数据的情况.
This method also seems to remove the need to have a pipeline to upload data.
这篇关于上载到Google云端存储时,输出数据会以随机顺序显示的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!