Thursday, August 30, 2012

Change Multiple Variable Names

Some times we want to change all variable names in a data set. For example, we may want to add a "_o" postfix to all variables in the old data before merging it to the new data. Some times we want to add a number at the end by its physical order in the data or by the alphabetic orders at the end of each variable. Today we are going to talk about how to do this using macros. We are going to develop a couple small macros today for this work. And of course, we will use the SASHELP dictionary again.

Let's first outline our steps:

  1. get the list of variable names in the data
  2. add the postfix to each variable
  3. build the rename clause
  4. use sas procedure to rename
To get the variable list, we use the proc sql procedure:


proc sql;
select name into :varnames separated by ' '
from sashelp.vcolumn
where libname="WORK" and memname="TEST";

This code generates the list in physical order. If you want to create a list in alphabetical order, you need to add "order by name" in the code.

We iterate variable names in the list and add the postfix to each name to create the new list. There are 3 parameters in this macro, list--the original list of names, pfix--the postfix and outlist--the name of the new list. If pfix is the empty, it adds variables order to the end.


%macro attach_char(list, pfix, outlist);
%let i=1;
%let item=%scan(&list, &i);
%let newlist=;
 %do %while (&item ^= );
  %if &pfix= %then %do;
   %let newlist=&newlist &item.&i;
  %end;
  %else %do;
   %let newlist=&newlist &item.&pfix;
  %end;
  %let i=%eval(&i+1);
  %let item=%scan(&list, &i);
%end;
%let &outlist=&newlist;
%put &newlist;
%mend;

After generate the new list, we want to pair the new and old list. For example, the old list --V1, V2, V3, and the new list -- V1_o, V2_o, V3_o, we want to create a paired list:
V1=V1_o V2=V2_o V3=V3_o


%macro pair(lista, listb, operator, outlist);
 %let i=1;
 %let item1=%scan(&lista, &i);
 %let item2=%scan(&listb, &i);
 %let newlist=;
 %do %while ((&item1 ^= ) and (&item2 ^= ));
  %let newlist= &newlist &item1.&operator.&item2;
  %let i=%eval(&i+1);
  %let item1=%scan(&lista,&i);
  %let item2=%scan(&listb,&i);
 %end;
 %put &newlist;
 %let &outlist=&newlist;
%mend;


Now is the final step, rename the variables:


%let nlist=;
%let rnamelist=;
%attach_char(&varnames, _o ,nlist);
%pair(&varnames, &nlist, =, rnamelist);

proc datasets library=WORK nolist;
modify TEST;
rename
&rnamelist;
run;

We created two macros for the rename process.  We do not have a single macro for this because every time we have different requirements. I prefer to keep small macros to increase the flexibility. For example, if you want to add "_o1", "_o2" etc to the end, we run twice of our attach_char macro to get the result, then run the pair macro.  Using the same technique, you may also build macros for other request, adding prefix, replace a substring to another string etc.


Tuesday, August 28, 2012

check missing values in a data set

Before doing any analysis on data, we first need to check the data structure and basic statistics. Number of missing values for each variable is usually the first step to check for data structure and quality.  One can use PROC FREQ, PROC MENAS and  NMISS functions to check missing values. However, these procedures can process only one variable at a time. Writing codes for each variable is fine if you have a small data set, but that would be time consuming if you have many variables to check. A SAS macro is considered to make the process easier.

The D_CHECK_MISS macro will output a data contains 4 columns -- variable name, variable type (num or char), missing count and total count. The count() and nmiss() function in PROC SQL procedure will do the calculation for each variable. We use the VCOLUMN table in SASHELPE dictionary to access variable name and type information. One may find that there is not necessary to deal with numerica and character variables separately. The reason to do this is for further extension. For example, one may include mean, standard deviation, min, max for numerical variable in the final table, or number of unique values in character variables.

/*-------------------------------------------------------------------*/
/*                    D_CHK_MISS                                     */
/*             Check Missing values for a data v1.0                  */
/*                   by Elite Bistatistics                           */
/*-------------------------------------------------------------------*/
/*                                                                   */
/* This macro is provided "as is" by Elite Biostatistics. There      */
/* are no warranties, expressed or implied, as to merchantability or */
/* fitness for a particular purpose regarding the materials or code  */
/* contained herein. The company is not responsible for errors in    */
/* this material as it now exists or will exist, nor does the company*/
/* provide technical support for it.                                 */
/*                                                                   */
/*-------------------------------------------------------------------*/
/* Questions or problem reports concerning this material may be      */
/* addressed to the company by email:                                */
/* EB@elitebiostats.com  use this subject line:                      */
/*    Comments for SAS code                                          */
/*                                                                   */
/*-------------------------------------------------------------------*/
/* Date Last Updated: 28Aug2012                                      */
/*-------------------------------------------------------------------*/
/*                 Purpose                                           */
/* This macro is to identify number of missing values in each        */
/* variable. The output data has 4 columns: name (vairalbe name),    */
/* type(variable type--num, char), missing(count of missing values), */
/* and total(number of observations).                                */
/* default output data: miss_count                                   */
/*                                                                   */
/*-------------------------------------------------------------------*/
%macro D_CHK_MISS(lib=work, data=, dataout=miss_count);
%let lib=%upcase(&lib);
%let data=%upcase(&data);

proc sql noprint;
create table _columns_ as
select *
from dictionary.columns
where libname="&lib" and
memname="&data";

proc sql noprint;
select name into :names separated by ' ' from _columns_;
select type into :types separated by ' ' from _columns_;
select count(*) into :nvars from _columns_;

%let names=&names;
%let types=&types;
%let nvars=&nvars;

data &dataout;
set _columns_(keep=name type);
length missing total 8;
stop;
run;

proc sql;
select count(*) into :total from &lib..&data;
%do i=1 %to &nvars;
 %let name=%scan(&names,&i,' ');
 %let type=%scan(&types,&i,' ');
 %if &type=char %then %do;
 proc sql noprint;
 select nmiss(&name) into :missing from &lib..&data;
 %end;
 %else %do;
 proc sql noprint;
  select nmiss(&name) into :missing from &lib..&data;
%end;
 proc sql noprint;
  insert into &dataout
  set name="&name", type="&type", missing = &missing, total=&total;
 quit;
%end; 
proc datasets lib=WORK nolist;
delete _columns_;
run; 
%mend;

Monday, August 27, 2012

Elite Technical Notes

Elite Team will start to post some relative techniques to consulting services based on Software, Technology, Data Management Skills, SAS Programming methods. 

We will open for ideas. 
Please contact us immediately, if any feedback/questions, at EB@EliteBiostats.com