SAS清理杂乱的数据集 [英] SAS cleaning messy datasets
问题描述
我有这个凌乱的数据集,我想导入 SAS
I have this messy dataset which I wanted to import into SAS
Datelisted, Shop, Region, Product, Color, Price, Discount, Contacts
"[{"date":"12/03/20","listed by":"Josh"}]", MelB, "New York, Houston, Boston, Seattle", "Pental Fountain, 5""5, metallic", "Red, Blue, Black, Green", 12.30, No, "[{"name":"Josh","contact":"+12717111"}, {"name":"Marine","contact":"+13681811"}]"
"[{"date":"22/05/21","listed by":"Marrie"}]", Shop Japan, "Tokyo, Los Angeles, Melbourne", "Faser Kean, gold, 4""", "Black", 14.50, Yes, "[{"name":"Moseaih","contact":"+1235199191"}, {"name":"James","contact":"+11717711"}, {"name":"Francis","contact":"+1172271"}]"
"[{"date":"01/01/21","listed by":"user"}]", , "New York, Boston, Washington D. C.", Kirnawa, PS, pattern, 5""", ,23.12, , "[{"name":"Moseaih","contact":"+1235199191"}]"
"[{"date":"17/03/25","listed by":"user"}]", Trailer Price, "Sydney, New York, Chicago", "Ink Bladez, 4""", "Blue, Green, Yellow, Black, Pink", 34.23, , ,
我正在尝试使用 SAS 程序而不是手动编辑数据集来清理数据,知道怎么可能吗?
Im trying to use SAS program instead of manually edit the datasets to clean the data,know how is it possible?
推荐答案
下面的程序会读取这个类似 CSV 的 JSON 格式的文件.这个想法是将每一行读取为一个大行,然后使用 scan
、循环和识别数据中的模式的组合对其进行解析.请注意,您在Kirnawa"之前的第三行缺少引号.必须添加它才能成功读取该行.
The program below will read this CSV-like JSON-esque file. The idea is to read each line as one large line, then parse it using combinations of scan
, loops, and identifying patterns in the data. Note that you are missing a quotation mark on the third line before "Kirnawa." This had to be added to read the line successfully.
data want;
infile datalines firstobs=2 length=len;
input line$varying32767. len;
length date 8.
listed_by $15.
shop
region
product
color $50.
price 8.
discount $10.;
;
format date ddmmyys8.
price dollar32.2
;
array contact[5] $100.; /* Change max number of contacts here */
array contact_line[100] $100. _TEMPORARY_; /* Holds temporary strings of interest for contacts */
date = input(substr(line, 12, 8), ddmmyy10.); /* Date is always 8 chars long and starting on the 12th column */
listed_by = dequote(scan(line, 3, ':}' )); /* The listed-by individual is always the third word in the string after ':' and '}' */
/* Treat the rest like a CSV file. Ignore commas within quotes. */
shop = dequote(scan(line, 2, ',', 'QR') );
region = dequote(scan(line, 3, ',', 'QR') );
product = dequote(scan(line, 4, ',', 'QR') );
color = dequote(scan(line, 5, ',', 'QR') );
price = input(scan(line, 6, ',', 'QR'), 8.);
discount = dequote(scan(line, 7, ',', 'QR') );
/* All contacts in the JSON-like format */
contacts = scan(line, 8, ',', 'Q');
/* Count the number of contacts and loop through each one */
n_contacts = count(scan(line, 8, ',', 'Q'), '"name":');
/* Contacts always follow this format:
"[{"name":"NAME1","contact":"+12345678"}, {"name":"NAME2","contact":"+1234567"}, {"NAMR3":"Francis","contact":"+1234567"}]"
Grab each individual contact and extract the name and phone number.
{"name":"NAME1","contact":"+12345678"} --> name = 2nd after the ':', phone = 4th word after the ':' and ','
{"name":"NAME2","contact":"+1234567"} --> name = 2nd after the ':', phone = 4th word after the ':' and ','
*/
do i = 1 to n_contacts*2;
contact_line[i] = scan(contacts, i, '{}');
/* When we scan through the contact line in this way, every other column contains the full contact name */
if(mod(i, 2) = 0) then do;
contact[i/2] = catx(', ', dequote(scan(contact_line[i], 2, ':,', 'R') ), dequote(scan(contact_line[i], 4, ':,', 'R') ) );
end;
end;
drop contacts n_contacts i line;
datalines;
Datelisted, Shop, Region, Product, Color, Price, Discount, Contacts
"[{"date":"12/03/20","listed by":"Josh"}]", MelB, "New York, Houston, Boston, Seattle", "Pental Fountain, 5""5, metallic", "Red, Blue, Black, Green", 12.30, No, "[{"name":"Josh","contact":"+12717111"}, {"name":"Marine","contact":"+13681811"}]"
"[{"date":"22/05/21","listed by":"Marrie"}]", Shop Japan, "Tokyo, Los Angeles, Melbourne", "Faser Kean, gold, 4""", "Black", 14.50, Yes, "[{"name":"Moseaih","contact":"+1235199191"}, {"name":"James","contact":"+11717711"}, {"name":"Francis","contact":"+1172271"}]"
"[{"date":"01/01/21","listed by":"user"}]", , "New York, Boston, Washington D. C.", "Kirnawa, PS, pattern, 5""", ,23.12, , "[{"name":"Moseaih","contact":"+1235199191"}]"
"[{"date":"17/03/25","listed by":"user"}]", Trailer Price, "Sydney, New York, Chicago", "Ink Bladez, 4""", "Blue, Green, Yellow, Black, Pink", 34.23, , ,
;
run;
输出:
SAS 当然可以毫无问题地处理此文件,并且在 SAS 中还有许多其他方法可以读取此文件.对于一次性文件,这没关系,但如果我收到这样的文件以纳入常规制作过程,我会要求他们发送另一种更结构化的格式.它只是有太多可能出错的地方,除非绝对必要,否则您不希望支持需要如此复杂步骤才能读取的文件.
SAS certainly can handle this file without a problem, and there are many other ways to read this file within SAS. For a one-off file this will be okay, but if I received a file like this to be incorporated in a regular production process, I'd ask them to send it another more structured format. It simply has too many ways it can go wrong, and you don't want to be supporting a file that requires such complex steps to read unless absolutely necessary.
这篇关于SAS清理杂乱的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!