Tuesday, August 28, 2012

check missing values in a data set

Before doing any analysis on data, we first need to check the data structure and basic statistics. Number of missing values for each variable is usually the first step to check for data structure and quality.  One can use PROC FREQ, PROC MENAS and  NMISS functions to check missing values. However, these procedures can process only one variable at a time. Writing codes for each variable is fine if you have a small data set, but that would be time consuming if you have many variables to check. A SAS macro is considered to make the process easier.

The D_CHECK_MISS macro will output a data contains 4 columns -- variable name, variable type (num or char), missing count and total count. The count() and nmiss() function in PROC SQL procedure will do the calculation for each variable. We use the VCOLUMN table in SASHELPE dictionary to access variable name and type information. One may find that there is not necessary to deal with numerica and character variables separately. The reason to do this is for further extension. For example, one may include mean, standard deviation, min, max for numerical variable in the final table, or number of unique values in character variables.

/*-------------------------------------------------------------------*/
/*                    D_CHK_MISS                                     */
/*             Check Missing values for a data v1.0                  */
/*                   by Elite Bistatistics                           */
/*-------------------------------------------------------------------*/
/*                                                                   */
/* This macro is provided "as is" by Elite Biostatistics. There      */
/* are no warranties, expressed or implied, as to merchantability or */
/* fitness for a particular purpose regarding the materials or code  */
/* contained herein. The company is not responsible for errors in    */
/* this material as it now exists or will exist, nor does the company*/
/* provide technical support for it.                                 */
/*                                                                   */
/*-------------------------------------------------------------------*/
/* Questions or problem reports concerning this material may be      */
/* addressed to the company by email:                                */
/* EB@elitebiostats.com  use this subject line:                      */
/*    Comments for SAS code                                          */
/*                                                                   */
/*-------------------------------------------------------------------*/
/* Date Last Updated: 28Aug2012                                      */
/*-------------------------------------------------------------------*/
/*                 Purpose                                           */
/* This macro is to identify number of missing values in each        */
/* variable. The output data has 4 columns: name (vairalbe name),    */
/* type(variable type--num, char), missing(count of missing values), */
/* and total(number of observations).                                */
/* default output data: miss_count                                   */
/*                                                                   */
/*-------------------------------------------------------------------*/
%macro D_CHK_MISS(lib=work, data=, dataout=miss_count);
%let lib=%upcase(&lib);
%let data=%upcase(&data);

proc sql noprint;
create table _columns_ as
select *
from dictionary.columns
where libname="&lib" and
memname="&data";

proc sql noprint;
select name into :names separated by ' ' from _columns_;
select type into :types separated by ' ' from _columns_;
select count(*) into :nvars from _columns_;

%let names=&names;
%let types=&types;
%let nvars=&nvars;

data &dataout;
set _columns_(keep=name type);
length missing total 8;
stop;
run;

proc sql;
select count(*) into :total from &lib..&data;
%do i=1 %to &nvars;
 %let name=%scan(&names,&i,' ');
 %let type=%scan(&types,&i,' ');
 %if &type=char %then %do;
 proc sql noprint;
 select nmiss(&name) into :missing from &lib..&data;
 %end;
 %else %do;
 proc sql noprint;
  select nmiss(&name) into :missing from &lib..&data;
%end;
 proc sql noprint;
  insert into &dataout
  set name="&name", type="&type", missing = &missing, total=&total;
 quit;
%end; 
proc datasets lib=WORK nolist;
delete _columns_;
run; 
%mend;

No comments:

Post a Comment