从CSV文件解析不同类型的数据格式 [英] Parsing different types of data format from a CSV file

查看:876
本文介绍了从CSV文件解析不同类型的数据格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我仍然是Java编程的初学者,因此如果我使问题变得过于复杂,我会提前道歉.

I am still a beginner with Java programming so I apologise in advance if I am over-complicating my problem.

我的程序是什么? 我正在构建一个基于GUI的程序.该程序的目标是加载CSV,XML或JSON文件,然后程序将数据存储到Array中.然后,数据将显示在文本框中.最终,该程序将能够将数据绘制到图形上.

What is my program? I am building a GUI based program. The goal of the program is to load a CSV, XML or JSON file and for the program to then store the data into an Array. The data will then be displayed in a text box. Ultimately, the program will have the ability to plot data to a graph.

GUI详细信息:

  • 3个单选按钮-允许用户选择CSV,XML或JSON
  • 加载文件按钮
  • 显示按钮-将数据显示到textArea
  • 显示图形按钮
  • 文本区域

问题::我无法将数据存储到数组中.我相信这是因为数据的格式.因此,例如,这是CSV文件的前3行:

Problem: I am having trouble storing the data into an Array. I believe this is because of the format of the data. So for example, this is the first 3 lines of the CSV file:

millis,stamp,datetime,light,temp,vcc
1000, 1273010254, 2010/5/4 21:57:34, 333, 78.32, 3.54
2000, 1273010255, 2010/5/4 21:57:35, 333, 78.32, 3.92
3000, 1273010256, 2010/5/4 21:57:36, 344, 78.32, 3.95

(注意-CSV/XML/JSON文件中有52789000行数据)

(Note - there are 52789000 lines of data in the CSV/XML/JSON files)

CSV-Reader类包含以下方法:读取数据,将其存储到数组中,然后将其存储到dataList中.

The CSV-Reader Class contains the method for reading through the data, storing it into an array and then storing it to a dataList.

从上面的示例中可以看到,某些数据类型有很大不同.我在拆分/解析时间和日期变量时遇到了特别麻烦.

As you can see from the above example, some of the data types are much different. I am having particular trouble with splitting/parsing the time and date variables.

这是我的CSV-Reader类代码当前的样子(再次,我为菜鸟代码表示歉意).

Here is what my CSV-Reader Class code looks like at the moment (Again, I apologise for noob code).

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class CSVReader {

//create a class that will hold arraylist which will have objects representing all lines of the file

private List<Data> dataList = new ArrayList<Data>();
private String path;

public List<Data> getDataList() {
    return dataList;
}

public String getPath() {
    return path;
}
public void setPath(String path) {
    this.path = path;
}

//Create a method to read through the csv stored in the path
//Create the list of data and store in the dataList

public void readCSV() throws IOException{

    //i will create connection with the file, in the path
    BufferedReader in  = new BufferedReader(new FileReader(path));  

    String line = null;
    line = in.readLine();

    while((line = in.readLine())!=null){

        //I need to split and store in the temporary variable and create an object


        String[] splits = line.split("\\s*(=>|,|\\s)\\s*");

        long millis = Long.parseLong(splits[0].trim());
        long stamp = Long.parseLong(splits[1].trim());
        DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy/M/d HH:mm:ss");
        System.out.println(splits[2].trim());
        LocalDateTime dateTime = LocalDateTime.parse(splits[2].trim(), formatter);
        LocalDate dateTime = dateTime.toLocalDate();
        LocalTime time = dateTime.toLocalTime();
        int light = Integer.parseInt(splits[3].trim());
        double temp = Double.parseDouble(splits[4].trim());
        double vcc = Double.parseDouble(splits[5].trim());

        Data d = new Data(millis,stamp,datetime,light,temp,vcc);//uses constructor


        //final job is to add this object 'd' onto the dataList
        dataList.add(d);

    }//end of while loop

}

任何帮助将不胜感激!

Any help would be greatly appreciated!

编辑1-我认为日期和时间是单独的CSV标头.它们不是.因此,时间变量已从程序中删除.它已替换为datetime变量.

Edit 1 - I thought that date and time were seperate CSV headers. They were not. Therefore the time variable has been deleted from the program. It has been replaced with the datetime variable.

编辑2-我的程序现在正在读取CSV文件,直到csv的第15行

Edit 2 - My program is now reading the CSV file up until line 15 of the csv

27000,1273010280, 2010/5/4 21:58:0 ,288,77.74,3.88

27000, 1273010280, 2010/5/4 21:58:0, 288, 77.74, 3.88

控制台错误

Exception in thread "AWT-EventQueue-0" 
java.time.format.DateTimeParseException: Text **'2010/5/4 21:58:0'** could not 
be parsed at index 15
at java.time.format.DateTimeFormatter.parseResolved0(Unknown Source)
at java.time.format.DateTimeFormatter.parse(Unknown Source)
at java.time.LocalDateTime.parse(Unknown Source)
at CSVReader.readCSV(CSVReader.java:55)
at GUI$2.actionPerformed(GUI.java:85)
at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(Unknown Source)
at java.awt.Component.processMouseEvent(Unknown Source)
at javax.swing.JComponent.processMouseEvent(Unknown Source)
at java.awt.Component.processEvent(Unknown Source)
at java.awt.Container.processEvent(Unknown Source)
at java.awt.Component.dispatchEventImpl(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Window.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
at java.awt.EventQueue.access$500(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.awt.EventQueue$3.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
at java.awt.EventQueue$4.run(Unknown Source)
at java.awt.EventQueue$4.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown Source)
at java.awt.EventQueue.dispatchEvent(Unknown Source)
at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.run(Unknown Source)

推荐答案

ISO 8601

已解决,由于我的CSV格式不符合正确的日期和时间格式,因此程序崩溃了(请阅读下面的评论).

SOLVED So the program was crashing due to my CSV not following the correct date and time format (Read comments below).

当将日期时间值作为文本交换时,请使用标准的 ISO 8601 格式,而不要使用发明自己的.明智地设计它们,使其易于通过机器解析,并且易于跨文化的人类阅读.因此,2010-05-04T21:57:34,而不是2010/5/4 21:57:34.

When exchanging date-time values as text, use the standard ISO 8601 formats rather than inventing your own. They are wisely designed to be easy to parse by machine and easy to read by humans across cultures. So, 2010-05-04T21:57:34, not 2010/5/4 21:57:34.

java.time 类在解析/生成字符串时默认使用ISO 8601格式.

The java.time classes use the ISO 8601 formats by default when parsing/generating strings.

数据Feed的第二列和第三列表示相同的内容:带日期的日期.第一个是从纪元参考日期1970-01-01T00:00Z开始的整秒计数(Z表示UTC).

The 2nd and 3rd columns of your data feed represent the same thing: a date with time-of-day. The first is a count of whole seconds since the epoch reference date of 1970-01-01T00:00Z (Z means UTC).

因此同时包含这两者是很愚蠢的.如上所述,第3列的格式选择不当.在我看来,使用从纪元开始计数"的第二列方法也不是一个好的选择,因为它不明显,没有人可以理解其含义,因此使错误变得不明显,从而使调试和记录变得困难.

So it is silly to include both. As mentioned above, the 3rd column is in a poorly chosen format. The 2nd column approach of using a count-from-epoch is also a poor choice in my opinion, as it is not obvious, no human can decipher its meaning, and so it makes mistakes non-obvious thereby making debugging and logging difficult.

要处理我们已有的内容,可以将距秒的秒数解析为Instant.这堂课代表了UTC的一刻.

To deal with what we have, the seconds-from-epoch can be parsed as an Instant. This class represents a moment in UTC.

Instant instant = Instant.ofEpochMilli( 1_273_010_254L ) ;

您的第3列给出了日期和时间,但省略了时区或UTC偏移量的指示符.由于从1970年1月1日起以秒为单位解析时,它与第二列匹配,因此我们知道它的值适用于UTC.忽略此类信息是不明智的做法,例如拥有一个没有货币指标的货币金额.

Your 3rd column gives a date and time but omits an indicator of time zone or offset-from-UTC. Since it matches the 2nd column when parsed as seconds from first moment of 1970 in UTC, we know its value was intended for UTC. Omitting such info is bad practice, like having a monetary amount with no indicator of currency.

理想情况下,两列均应替换为ISO 8601格式的字符串,例如2010-05-04T21:57:34Z包括Z表示UTC.

Ideally both columns should be replaced by a string in ISO 8601 format, for example 2010-05-04T21:57:34Z including the Z to indicate UTC.

如果我们不得不在不知道要用于UTC的情况下解析第三列,则将其解析为LocalDateTime,即具有一天中时间但缺少时区或偏移量的日期.我们需要定义一种格式设置模式以匹配您的输入.

If we had to parse the 3rd column without knowing it was intended for UTC, we would parse as a LocalDateTime, a date with time-of-day but lacking a time zone or offset. We need to define a formatting pattern to match your input.

DateTimeFormatter f = DateTimeFormatter.ofPattern( "uuuu/M/d HH:mm:ss" );
LocalDateTime localDateTime = LocalDateTime.parse( "2010/5/4 21:57:34" , f );

BigDecimal

为了精确起见,您的十进制小数应该表示为BigDecimal对象.切勿在意准确性的地方使用double/Doublefloat/Float.这些类型使用浮点技术,该技术

BigDecimal

Your decimal fraction numbers should be represented as BigDecimal objects for accuracy. Never use double/Double or float/Float where you care about accuracy. These types use floating-point technology which trades away accuracy for speed of execution. In contrast, BigDecimal is slow but accurate.

从字符串中解析BigDecimal.

new BigDecimal ( "78.32" ) 

Apache Commons CSV

当经过良好测试的代码已经存在时,请勿编写代码.已经编写了读取 CSV /

我将 Apache Commons CSV 用于此类工作.这些格式有多种变体,均由该库处理.

I use Apache Commons CSV for such work. There are several variations of these formats, all handled by this library.

这是示例代码.首先定义一个类来保存您的数据,这里命名为Reading.

Here is example code. First define a class to hold your data, here named Reading.

package com.basilbourque.example;

import java.math.BigDecimal;
import java.time.Instant;
import java.time.LocalDateTime;

public class Reading {
    private Integer millis;
    private Instant instant;
    private LocalDateTime localDateTime;
    private Integer light;
    private BigDecimal temp;
    private BigDecimal vcc;

    public Reading ( Integer millis , Instant instant , LocalDateTime localDateTime , Integer light , BigDecimal temp , BigDecimal vcc ) {
        // TODO: Add checks for null arguments: Objects.requireNonNull( … ).
        this.millis = millis;
        this.instant = instant;
        this.localDateTime = localDateTime;
        this.light = light;
        this.temp = temp;
        this.vcc = vcc;
    }

    @Override
    public String toString ( ) {
        return "com.basilbourque.example.Reading{" +
                "millis=" + millis +
                ", instant=" + instant +
                ", localDateTime=" + localDateTime +
                ", light=" + light +
                ", temp=" + temp +
                ", vcc=" + vcc +
                '}';
    }
}

示例数据文件:

millis,stamp,datetime,light,temp,vcc
1000, 1273010254, 2010/5/4 21:57:34, 333, 78.32, 3.54
2000, 1273010255, 2010/5/4 21:57:35, 333, 78.32, 3.92
3000, 1273010256, 2010/5/4 21:57:36, 344, 78.32, 3.95

现在调用Commons CSV解析该数据,实例化Reading对象,然后收集它们.

And now call upon Commons CSV to parse that data, instantiate Reading objects, and collect them.

DateTimeFormatter f = DateTimeFormatter.ofPattern( "uuuu/M/d HH:mm:ss" );

List < Reading > readings = new ArrayList <>( 3 );
Reader reader = null;
try {
    reader = new FileReader( "/Users/basilbourque/data.csv" );
    Iterable < CSVRecord > records = CSVFormat.RFC4180.withIgnoreSurroundingSpaces( true ).withHeader().parse( reader );
    for ( CSVRecord record : records ) {
        // Grab inputs
        String millisInput = record.get( "millis" );
        String stampInput = record.get( "stamp" );
        String datetimeInput = record.get( "datetime" );
        String lightInput = record.get( "light" );
        String tempInput = record.get( "temp" );
        String vccInput = record.get( "vcc" );

        // Parse inputs
        Integer millis = Integer.valueOf( millisInput );
        Instant instant = Instant.ofEpochSecond( Integer.valueOf( stampInput ) );
        LocalDateTime localDateTime = LocalDateTime.parse( datetimeInput , f );
        Integer light = Integer.valueOf( lightInput );
        BigDecimal temp = new BigDecimal( tempInput );
        BigDecimal vcc = new BigDecimal( vccInput );

        // Construct object
        Reading r = new Reading( millis , instant , localDateTime , light , temp , vcc );

        // Collect object
        readings.add( r );
    }
} catch ( FileNotFoundException e ) {
    e.printStackTrace();
} catch ( IOException e ) {
    e.printStackTrace();
}

System.out.println( readings );

[com.basilbourque.example.Reading {millis = 1000,Instant = 2010-05-04T21:57:34Z,localDateTime = 2010-05-04T21:57:34,light = 333,temp = 78.32,vcc = 3.54},com.basilbourque.example.Reading {millis = 2000,Instant = 2010-05-04T21:57:35Z,localDateTime = 2010-05-04T21:57:35,light = 333,temp = 78.32,vcc = 3.92 },com.basilbourque.example.Reading {millis = 3000,Instant = 2010-05-04T21:57:36Z,localDateTime = 2010-05-04T21:57:36,light = 344,temp = 78.32,vcc = 3.95} ]

[com.basilbourque.example.Reading{millis=1000, instant=2010-05-04T21:57:34Z, localDateTime=2010-05-04T21:57:34, light=333, temp=78.32, vcc=3.54}, com.basilbourque.example.Reading{millis=2000, instant=2010-05-04T21:57:35Z, localDateTime=2010-05-04T21:57:35, light=333, temp=78.32, vcc=3.92}, com.basilbourque.example.Reading{millis=3000, instant=2010-05-04T21:57:36Z, localDateTime=2010-05-04T21:57:36, light=344, temp=78.32, vcc=3.95}]

关于您的提及:

将数据存储到数组

store the data into an Array

您正在使用 ArrayList 在您的代码中,而不是数组中.请参阅Oracle教程有关列表 Java集合框架.在大小和速度真正重要的地方,我们可以选择一个数组.

You are using an ArrayList in your code, not an array. See the Oracle Tutorials for lists and for arrays to understand the difference. Generally best to use the Java Collections framework. Where size and speed really matter, we may choose an array.

java.time 框架内置于Java 8及更高版本中.这些类取代了麻烦的旧版日期时间类,例如 SimpleDateFormat .

The java.time framework is built into Java 8 and later. These classes supplant the troublesome old legacy date-time classes such as java.util.Date, Calendar, & SimpleDateFormat.

Joda-Time 项目,现在位于<一个href ="https://en.wikipedia.org/wiki/Maintenance_mode" rel ="nofollow noreferrer">维护模式,建议迁移到要了解更多信息,请参见 Oracle教程 .并在Stack Overflow中搜索许多示例和说明.规范为 JSR 310 .

To learn more, see the Oracle Tutorial. And search Stack Overflow for many examples and explanations. Specification is JSR 310.

您可以直接与数据库交换 java.time 对象.使用符合 JDBC驱动程序 /jeps/170"rel =" nofollow noreferrer> JDBC 4.2 或更高版本.不需要字符串,不需要java.sql.*类.

You may exchange java.time objects directly with your database. Use a JDBC driver compliant with JDBC 4.2 or later. No need for strings, no need for java.sql.* classes.

在哪里获取java.time类?

Where to obtain the java.time classes?

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆