SAS清理杂乱的数据集 [英] SAS cleaning messy datasets

查看:26
本文介绍了SAS清理杂乱的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个凌乱的数据集,我想导入 SAS

I have this messy dataset which I wanted to import into SAS

Datelisted, Shop, Region, Product, Color, Price, Discount, Contacts 
"[{"date":"12/03/20","listed by":"Josh"}]", MelB, "New York, Houston, Boston, Seattle", "Pental Fountain, 5""5, metallic", "Red, Blue, Black, Green", 12.30, No, "[{"name":"Josh","contact":"+12717111"}, {"name":"Marine","contact":"+13681811"}]"
"[{"date":"22/05/21","listed by":"Marrie"}]", Shop Japan, "Tokyo, Los Angeles, Melbourne", "Faser Kean, gold, 4""", "Black", 14.50, Yes, "[{"name":"Moseaih","contact":"+1235199191"}, {"name":"James","contact":"+11717711"}, {"name":"Francis","contact":"+1172271"}]"
"[{"date":"01/01/21","listed by":"user"}]", , "New York, Boston, Washington D. C.", Kirnawa, PS, pattern, 5""", ,23.12, , "[{"name":"Moseaih","contact":"+1235199191"}]"
"[{"date":"17/03/25","listed by":"user"}]", Trailer Price, "Sydney, New York, Chicago", "Ink Bladez, 4""", "Blue, Green, Yellow, Black, Pink", 34.23, , , 

我正在尝试使用 SAS 程序而不是手动编辑数据集来清理数据,知道怎么可能吗?

Im trying to use SAS program instead of manually edit the datasets to clean the data,know how is it possible?

推荐答案

下面的程序会读取这个类似 CSV 的 JSON 格式的文件.这个想法是将每一行读取为一个大行,然后使用 scan、循环和识别数据中的模式的组合对其进行解析.请注意,您在Kirnawa"之前的第三行缺少引号.必须添加它才能成功读取该行.

The program below will read this CSV-like JSON-esque file. The idea is to read each line as one large line, then parse it using combinations of scan, loops, and identifying patterns in the data. Note that you are missing a quotation mark on the third line before "Kirnawa." This had to be added to read the line successfully.

data want;
    infile datalines firstobs=2 length=len;
    input line$varying32767. len;

    length date 8.
           listed_by $15.
           shop      
           region
           product
           color     $50.
           price     8.
           discount  $10.;
    ;

    format date  ddmmyys8.
           price dollar32.2
    ;
        

    array contact[5]        $100.;              /* Change max number of contacts here */
    array contact_line[100] $100. _TEMPORARY_;  /* Holds temporary strings of interest for contacts */

    date      = input(substr(line, 12, 8), ddmmyy10.);  /* Date is always 8 chars long and starting on the 12th column */
    listed_by = dequote(scan(line, 3, ':}' ));          /* The listed-by individual is always the third word in the string after ':' and '}' */

    /* Treat the rest like a CSV file. Ignore commas within quotes. */
    shop      = dequote(scan(line, 2, ',', 'QR') );         
    region    = dequote(scan(line, 3, ',', 'QR') );
    product   = dequote(scan(line, 4, ',', 'QR') );
    color     = dequote(scan(line, 5, ',', 'QR') );
    price     = input(scan(line, 6, ',', 'QR'), 8.);
    discount  = dequote(scan(line, 7, ',', 'QR') );

    /* All contacts in the JSON-like format */
    contacts   = scan(line, 8, ',', 'Q');

    /* Count the number of contacts and loop through each one */
    n_contacts = count(scan(line, 8, ',', 'Q'), '"name":');

    /* Contacts always follow this format:
        "[{"name":"NAME1","contact":"+12345678"}, {"name":"NAME2","contact":"+1234567"}, {"NAMR3":"Francis","contact":"+1234567"}]"
        
        Grab each individual contact and extract the name and phone number.
        {"name":"NAME1","contact":"+12345678"} --> name = 2nd after the ':', phone = 4th word after the ':' and ','
        {"name":"NAME2","contact":"+1234567"}  --> name = 2nd after the ':', phone = 4th word after the ':' and ','
    */
    do i = 1 to n_contacts*2;
        contact_line[i] = scan(contacts, i, '{}');

        /* When we scan through the contact line in this way, every other column contains the full contact name */
        if(mod(i, 2) = 0) then do;
            contact[i/2] = catx(', ', dequote(scan(contact_line[i], 2, ':,', 'R') ), dequote(scan(contact_line[i], 4, ':,', 'R') ) );
        end;
    end;
    
    drop contacts n_contacts i line;

    datalines;
Datelisted, Shop, Region, Product, Color, Price, Discount, Contacts 
"[{"date":"12/03/20","listed by":"Josh"}]", MelB, "New York, Houston, Boston, Seattle", "Pental Fountain, 5""5, metallic", "Red, Blue, Black, Green", 12.30, No, "[{"name":"Josh","contact":"+12717111"}, {"name":"Marine","contact":"+13681811"}]"
"[{"date":"22/05/21","listed by":"Marrie"}]", Shop Japan, "Tokyo, Los Angeles, Melbourne", "Faser Kean, gold, 4""", "Black", 14.50, Yes, "[{"name":"Moseaih","contact":"+1235199191"}, {"name":"James","contact":"+11717711"}, {"name":"Francis","contact":"+1172271"}]"
"[{"date":"01/01/21","listed by":"user"}]", , "New York, Boston, Washington D. C.", "Kirnawa, PS, pattern, 5""", ,23.12, , "[{"name":"Moseaih","contact":"+1235199191"}]"
"[{"date":"17/03/25","listed by":"user"}]", Trailer Price, "Sydney, New York, Chicago", "Ink Bladez, 4""", "Blue, Green, Yellow, Black, Pink", 34.23, , , 
;
run;

输出:

SAS 当然可以毫无问题地处理此文件,并且在 SAS 中还有许多其他方法可以读取此文件.对于一次性文件,这没关系,但如果我收到这样的文件以纳入常规制作过程,我会要求他们发送另一种更结构化的格式.它只是有太多可能出错的地方,除非绝对必要,否则您不希望支持需要如此复杂步骤才能读取的文件.

SAS certainly can handle this file without a problem, and there are many other ways to read this file within SAS. For a one-off file this will be okay, but if I received a file like this to be incorporated in a regular production process, I'd ask them to send it another more structured format. It simply has too many ways it can go wrong, and you don't want to be supporting a file that requires such complex steps to read unless absolutely necessary.

这篇关于SAS清理杂乱的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆