Thursday, October 11, 2012

convert character to numeric (date time) and vice versa


To convert character to number, we use the input function. To convert number to character, we use the put function. At the beginning, we always get confuse which one is which. We can use the following way to help us remember:
  When we write something on paper, everything is character. So if we try to convert something to character, we "put" the information in paper. If we try to get a number from paper (character), we "input"(read) the character to our brain (number). 

char to num, and num to char example: 

data eb;
 chra="321.32";
 chrb="5.2E6";
 num=235;
 chra2num=input(chra,best32.);
 chrb2num=input(chrb, best32.);
 num2chr=compress(put(num,best32.),' ');
run;
The best32. format in the above code is just an example, you can specific other formats.  

char to date, and date to char example:

data eb2;
input date_char $10. mon day year;/* convert character to date */
date1=input(compress(date_char), anydtdte.);
date2=mdy(mon,day,year);
format date1 mmddyy10. date2 yymmdd10.;
cards;
11/ 1/1999 11 01 1999
 1Sep1999  1 11 1999
12/24/2003 12 24 2003 
 7/ 1/2001  7  1 2001
;;
run;

If you check the result data set eb2, you will find that the third line (line of 12/24/2003) has a missing value. 
The reason for the missing value is that the "anydtdate" format by default is equivalent "anydtdte9.", 9 digits. To correct this we use "anydtdte10." and now the third line convert to the right date.


Thursday, September 13, 2012

calculate percentile in SAS with weighted data

How to calculate the 12.5 percentile? The answer is to use PROC UNIVARIATE procedure. Doesn't the univariate procedure only include the 1st, 5th, 10th, 25th 50th 75th 90th 95th, 99th and 100th percentile? Yes, if you read the default output only. The output statement will have the univeriate procedure output customized percentiles.

Let's first create a sample data eb

data eb;
input n1 n2;
cards;
2   20
4   35
5   40
8   55
10  60
13  75
;;
run;

Example 1: calculate percentiles not automatically in the procedure. 


proc univariate data=eb noprint;
 var n1;
 output out=eb1 pctlpre=p pctlpts=10 to 20 by 3;
 run;
 proc print eb1;
 run;


    Obs    p10    p13    p16    p19
    1      2      2      2      4  


Example 2: calculate customs percentiles for multiple variables. At this time we need to specify the prefix for each variable, pctlpre=pa_ pb_. 

proc univariate data=eb noprint;
 var n1 n2;
 output out=eb2 pctlpre=n1p n2p pctlpts=12.5 20 50;
 run;
proc print data=eb2;
run;


 Obs    pa_12_5    pa_20    pa_50    pb_12_5    pb_20    pb_50
 1        2         4       6.5        20        35      47.5 



If you forget to put the pctlpre term you will get a warning message " ERROR: The PCTLPRE= option must be specified to generate additional percentiles." If you only specify pa_ in the option but no pb_, you will not get any error message, the output data set will only include the percentiles for the first variable. In this example is the "2, 4 and 6.5". 


Example 3: what if the data is a weighted data? we use the weight statement in the procedure. The weight does not need to be a interger, decimal values are allowed. We add a decimal weight wt in the data.



data eb;
input n1 n2 wt;
cards;
2   20  5.4  
4   35  8.6  
5   40  10.9 
8   55  15.5 
10  60  20.3 
13  75  25.8 
;;
run;
proc univariate data=eb noprint;
 var n1 n2;
 output out=eb1 pctlpre=pa_ pb_ pctlpts=10 20 30;
 run;
proc print data=eb1;
run;
proc univariate data=eb noprint;
weight wt;
var n1 n2;
 output out=eb1 pctlpre=pa_ pb_ pctlpts=10 20 30;
 run;
 proc print data= eb1;
 run;

The result below shows the percentiles by using wt(2nd row) compare to not using weight (1st row)

  Obs    pa_10    pa_20    pa_30    pb_10    pb_20    pb_30
  1       2        4        4        20       35       35  
  1       4        5        8        35       40       55  






Tuesday, September 11, 2012

Count missing value in a row

There are several functions in SAS that can count missing values in a row. They are nmiss(), cmiss(). The function n() counts non-missing values.  

The "n" functions only work on numeric variables, while "c" function works for both numeric and character variables. So if you are not sure your variables are numeric or character, or the variables are mixed, it is save to use the cmiss() function than to use "n" functions. 
The following is an example to show how the "n" functions failed when character variable is included. 
char num  n    nmiss cmiss cnmiss
A    1    1    1     0     2     (no missing)
B    .   0    2     1     1     (1 missing)
     3    1    1     1     1     (1 missing)
D    4    1    1     0     2     (no missing)
     .    0    2     2     0     (2 missing)
F    .    0    2     1     1     (1 missing)
                                            correct correct

The code :
data miss;
input char $1 num 3;
cards;
A 1
B .
  3
D 4
   
F  
;
run;
data miss2;
set miss;
n=n(char, num);
nmiss=nmiss(char, num);
cmiss=cmiss(char, num);
cnmiss=2-cmiss(char, num);
run;

Wednesday, September 5, 2012

Some times we have two datasets with the same structure. For example, one has collected data in different periods and saved in separate in different data sets.  To do analysis, one will first concatenate these data. This looks simple, however, some times you get error message because a variable say study_id in one dataset is numeric and in another dataset is character. This is actually good because you got an error message, so that you can change the variable to same type. However, if a date variable in one data is datetime format and another is in date format, you may get no error message when you concatenate your data. There fore, we should always check the agreement of two dataset before we concatenate them together. One simple solution is to use proc contents procedure to print the contents of two data set and compare the variables one by one. Can we let SAS automatically output a table to help us? we are now to create a macro for this purpose.

We want our macro to create a table with the first 4 variables as below:
1. varname : includes all variables in all data sets
2. in_all_data : a variable has two values, 1 means this variable in all datasets, 0 means not in all datasets. 
3.  match_main : the variable in all data set have same type and same length. 
4. match_all : the variable in all data set have same type, length and format

and then follow the columns type1, length1, format1, type2, length2, format2 etc...  for data1, data2 ...

In this macro, we require the user input the data list, and the name of the output table. 
The data list is in this format: work.classJan work.classFeb work.classMar
That is, we need to include the lib name in the data and we use empty space to separate data. 

Here are the sections for the macro:
1. after read in the data list, we create datasets to store variable's name, type, length and format for each data, and called them _columns1, _columns2 ... corresponding to the first, second, ... data in the list. 

 To merge these new created data _columns1, _columns2 into a big data, but they have same variable names : type, length and format, 
    a: change variable name to type1, length1, format1 , type2 length2 format2 ... correspond the to
                   _columns1, _columns2 ...
    b: merge data by name, and name this big data _columns_all

2. if a variable exists in all data assign in_all=1 else assign in_all=0
    To do that we check if there is missing values in type1, type2, ...  because an exist variable must have a type. 

3. if a variable exists in all data set, check if all dataset have the same type and length, assign match_main=1 if the answer is yes. 

4. if match_main=1 then we check if their format are the same, if the answer is yes assign match_all=1

5. output the final data. 

That looks easy, and the following macro only have 90 rows. 
we include 3 macros inside this macro: 
1. macro macvarlen is used to calculate the length of a name list separated by a delimiter. 
2. the pair macro is a modification from the pair macro in last post, we use this macro to create the condition used in the if clause, for example  type1=type2 and type2=type3 and type3=type4
3. macro expandname used to create name list, by given a key , start number, stop number. 
   for example, %expandanme(type, 1, 3, outputlist), the output list will be: "type1 type2 type3" without the quotation mark.

Here is the macro:

/*-------------------------------------------------------------------*/
/*                    D_CHK_DATAMTCH v1.0                            */
/*             Check if variables are in all data sets               */
/*  if variables with same name having same type, length and format  */
/*                   by Elite Bistatistics                           */
/*-------------------------------------------------------------------*/
/*                                                                   */
/* This macro is provided "as is" by Elite Biostatistics. There      */
/* are no warranties, expressed or implied, as to merchantability or */
/* fitness for a particular purpose regarding the materials or code  */
/* contained herein. The company is not responsible for errors in    */
/* this material as it now exists or will exist, nor does the company*/
/* provide technical support for it.                                 */
/*                                                                   */
/*-------------------------------------------------------------------*/
/* Questions or problem reports concerning this material may be      */
/* addressed to the company by email:                                */
/* EB@elitebiostats.com  use this subject field:                     */
/*    Comments for SAS code                                          */
/*                                                                   */
/*-------------------------------------------------------------------*/
/* Date Last Updated: 01Sep2012                                      */
/*-------------------------------------------------------------------*/
/*                 Purpose                                           */
/* This macro is to compare multiple data sets whether they have     */
/* common variables and of the common variables whether they have    */
/* same type, length, and format.                                    */
/* input :                                                           */
/*  datalist: the list of datasets with lib names separated by space */
/*       e.g work.classJan  work.classFeb work.classMar              */
/* output data: user specified                                       */
/*                                                                   */
/*-------------------------------------------------------------------*/
%macro D_CHK_DATAMTCH(datalist, outdata);
%*------------------------------------------------------------*;
%* macros used in D_CHK_DATMTCH *;
%macro macvarlen(list, delim, num);
%* calculate the length of the list by the delimeter delim *;
 %let i=1;
 %do %while(%scan(&list, &i, &delim)^=);
  %let i=%eval(&i+1);
 %end;
 %let &num=%eval(&i-1);
%mend;
%macro pair(lista, listb, operator, connectby, outlist);
%*------------------------------------------------------------*;
%* lista, listb are two lists with same number of items       *;
%* operator is used to pair items in lista and listb          *;
%* connectby is the string to connect the paires              *;
%* For Example                                                *;
%* pair(type1 type2, case1 case2, =, and, mylist)             *;
%* mylist is created as                                       *;
%* type1=case1 and type2=case2                                *;
%*------------------------------------------------------------*;
 %let i=1;
 %do %while(%scan(&lista, &i)^=);
  %let i=%eval(&i+1);
 %end;
 %let n=%eval(&i-1);
 %let newlist=;
 %do i=1 %to &n;
  %let item1=%scan(&lista,&i);
  %let item2=%scan(&listb,&i);
  %if &i=1 %then %let newlist=&item1.&operator.&item2;
  %if &i>1 and &i<%eval(&n-1) %then %let newlist=&newlist &item1.&operator.&item2 &connectby;
  %if &i>1 and &i=&n  %then %let newlist=&newlist &item1.&operator.&item2;
 %end;
 %let &outlist=&newlist;
%mend;

%macro expandname(key, num1, num2, outputlist);
%*------------------------------------------------------------*;
%* output a list with the key string follow by numbers        *;
%* expandname(type, 1, 3, mylist)                             *;
%* mylist is created as                                       *;
%* type1 type2 type3                                          *;
%*------------------------------------------------------------*;
 %let &outputlist=;
 %let nlist=;
 %do i=&num1 %to &num2;
   %let nlist=&nlist &key&i;
 %end;
 %let &outputlist=&nlist;
%mend;

%* section one, create a big table named _columns_all to      *;
%* include all variable names and their types, length, format *;
%let datalist=%upcase(&datalist);
%let columndatalist=;
%let data_n=;
%macvarlen(&datalist,' ',data_n);
%put data_n is &data_n;
%do i=1 %to &data_n;
 %let data=%scan(&datalist,&i, ' ');
 %let libname=%scan(&data,1,'.');
 %let memname=%scan(&data,2,'.');
 %put &libname &memname;
 data _columns&i;
  set sashelp.vcolumn; 
  keep name type length format;
  rename  type=type&i length=length&i format=format&i;
  where libname="&libname" and memname="&memname";
 run;
%end;

%let columndatalist=;
%expandname(_columns, 1, &data_n, columndatalist);
%put columndatalist= &columndatalist;
data _columns_all;
merge &columndatalist;
by name;
run;

%* section two, create the condition for if statements        *;
%* using macro expandname, and pair                           *;
%* assign in_all, match_main and match_all values             *;
%* in_all : 1 indicate the variable in all data               *;
%* match_main : 1 indicates all data has the same type and    *;
%*  length for the variable                                   *;
%* match_all : 1 indicates all data has the same type, length *;
%*  and format for the variable                               *;
%let type1=;
%let type2=;
%let length1=;
%let length2=;
%let format1=;
%let format2=;
%expandname(type, 1, %eval(&data_n-1), type1);
%expandname(type, 2, &data_n, type2);
%expandname(length, 1, %eval(&data_n-1), length1);
%expandname(length, 2, &data_n, length2);
%expandname(format, 1, %eval(&data_n-1), format1);
%expandname(format, 2, &data_n, format2);
%put type1=&type1;
%put type2=&type2;
%let cmptype=;
%let cmplength=;
%let cmpformat=;
%pair(&type1, &type2, =, and, cmptype);
%pair(&length1, &length2, =, and,  cmplength);
%pair(&format1, &format2, =, and, cmpformat);
data &outdata;
length in_all match_main match_all 8;
set _columns_all;
array type{&data_n} type1 - type&data_n;
in_all=1;
do i=1 to &data_n;
 if type(i)='' then in_all=0;
end;
if in_all =1 and &cmptype and &cmplength then match_main=1; else match_main=0;
if match_main=1 and &cmpformat then match_all=1; else match_all=0;
drop i;
run;
%mend;

To test the macro, we first create two empty dataset and pass them to the macro.
data test1;
length v1 v2 v3 8 v4 8 v5 8 v7 $10 v8 $12 v9 8;
format v1 v2 v3 best32. v4 mmddyy10. v5 datetime13.;
stop;
run;

data test2;
length v1 $12 v2 v3 8 v4 4 v5 8 v7 $10 v8 $12;
format v1 $12. v2 v3 best32. v4 mmddyy10. v5 mmddyy10.;
stop;
run;

%D_CHK_DATAMTCH(work.test1 work.test2, testout);


    

Thursday, August 30, 2012

Change Multiple Variable Names

Some times we want to change all variable names in a data set. For example, we may want to add a "_o" postfix to all variables in the old data before merging it to the new data. Some times we want to add a number at the end by its physical order in the data or by the alphabetic orders at the end of each variable. Today we are going to talk about how to do this using macros. We are going to develop a couple small macros today for this work. And of course, we will use the SASHELP dictionary again.

Let's first outline our steps:

  1. get the list of variable names in the data
  2. add the postfix to each variable
  3. build the rename clause
  4. use sas procedure to rename
To get the variable list, we use the proc sql procedure:


proc sql;
select name into :varnames separated by ' '
from sashelp.vcolumn
where libname="WORK" and memname="TEST";

This code generates the list in physical order. If you want to create a list in alphabetical order, you need to add "order by name" in the code.

We iterate variable names in the list and add the postfix to each name to create the new list. There are 3 parameters in this macro, list--the original list of names, pfix--the postfix and outlist--the name of the new list. If pfix is the empty, it adds variables order to the end.


%macro attach_char(list, pfix, outlist);
%let i=1;
%let item=%scan(&list, &i);
%let newlist=;
 %do %while (&item ^= );
  %if &pfix= %then %do;
   %let newlist=&newlist &item.&i;
  %end;
  %else %do;
   %let newlist=&newlist &item.&pfix;
  %end;
  %let i=%eval(&i+1);
  %let item=%scan(&list, &i);
%end;
%let &outlist=&newlist;
%put &newlist;
%mend;

After generate the new list, we want to pair the new and old list. For example, the old list --V1, V2, V3, and the new list -- V1_o, V2_o, V3_o, we want to create a paired list:
V1=V1_o V2=V2_o V3=V3_o


%macro pair(lista, listb, operator, outlist);
 %let i=1;
 %let item1=%scan(&lista, &i);
 %let item2=%scan(&listb, &i);
 %let newlist=;
 %do %while ((&item1 ^= ) and (&item2 ^= ));
  %let newlist= &newlist &item1.&operator.&item2;
  %let i=%eval(&i+1);
  %let item1=%scan(&lista,&i);
  %let item2=%scan(&listb,&i);
 %end;
 %put &newlist;
 %let &outlist=&newlist;
%mend;


Now is the final step, rename the variables:


%let nlist=;
%let rnamelist=;
%attach_char(&varnames, _o ,nlist);
%pair(&varnames, &nlist, =, rnamelist);

proc datasets library=WORK nolist;
modify TEST;
rename
&rnamelist;
run;

We created two macros for the rename process.  We do not have a single macro for this because every time we have different requirements. I prefer to keep small macros to increase the flexibility. For example, if you want to add "_o1", "_o2" etc to the end, we run twice of our attach_char macro to get the result, then run the pair macro.  Using the same technique, you may also build macros for other request, adding prefix, replace a substring to another string etc.


Tuesday, August 28, 2012

check missing values in a data set

Before doing any analysis on data, we first need to check the data structure and basic statistics. Number of missing values for each variable is usually the first step to check for data structure and quality.  One can use PROC FREQ, PROC MENAS and  NMISS functions to check missing values. However, these procedures can process only one variable at a time. Writing codes for each variable is fine if you have a small data set, but that would be time consuming if you have many variables to check. A SAS macro is considered to make the process easier.

The D_CHECK_MISS macro will output a data contains 4 columns -- variable name, variable type (num or char), missing count and total count. The count() and nmiss() function in PROC SQL procedure will do the calculation for each variable. We use the VCOLUMN table in SASHELPE dictionary to access variable name and type information. One may find that there is not necessary to deal with numerica and character variables separately. The reason to do this is for further extension. For example, one may include mean, standard deviation, min, max for numerical variable in the final table, or number of unique values in character variables.

/*-------------------------------------------------------------------*/
/*                    D_CHK_MISS                                     */
/*             Check Missing values for a data v1.0                  */
/*                   by Elite Bistatistics                           */
/*-------------------------------------------------------------------*/
/*                                                                   */
/* This macro is provided "as is" by Elite Biostatistics. There      */
/* are no warranties, expressed or implied, as to merchantability or */
/* fitness for a particular purpose regarding the materials or code  */
/* contained herein. The company is not responsible for errors in    */
/* this material as it now exists or will exist, nor does the company*/
/* provide technical support for it.                                 */
/*                                                                   */
/*-------------------------------------------------------------------*/
/* Questions or problem reports concerning this material may be      */
/* addressed to the company by email:                                */
/* EB@elitebiostats.com  use this subject line:                      */
/*    Comments for SAS code                                          */
/*                                                                   */
/*-------------------------------------------------------------------*/
/* Date Last Updated: 28Aug2012                                      */
/*-------------------------------------------------------------------*/
/*                 Purpose                                           */
/* This macro is to identify number of missing values in each        */
/* variable. The output data has 4 columns: name (vairalbe name),    */
/* type(variable type--num, char), missing(count of missing values), */
/* and total(number of observations).                                */
/* default output data: miss_count                                   */
/*                                                                   */
/*-------------------------------------------------------------------*/
%macro D_CHK_MISS(lib=work, data=, dataout=miss_count);
%let lib=%upcase(&lib);
%let data=%upcase(&data);

proc sql noprint;
create table _columns_ as
select *
from dictionary.columns
where libname="&lib" and
memname="&data";

proc sql noprint;
select name into :names separated by ' ' from _columns_;
select type into :types separated by ' ' from _columns_;
select count(*) into :nvars from _columns_;

%let names=&names;
%let types=&types;
%let nvars=&nvars;

data &dataout;
set _columns_(keep=name type);
length missing total 8;
stop;
run;

proc sql;
select count(*) into :total from &lib..&data;
%do i=1 %to &nvars;
 %let name=%scan(&names,&i,' ');
 %let type=%scan(&types,&i,' ');
 %if &type=char %then %do;
 proc sql noprint;
 select nmiss(&name) into :missing from &lib..&data;
 %end;
 %else %do;
 proc sql noprint;
  select nmiss(&name) into :missing from &lib..&data;
%end;
 proc sql noprint;
  insert into &dataout
  set name="&name", type="&type", missing = &missing, total=&total;
 quit;
%end; 
proc datasets lib=WORK nolist;
delete _columns_;
run; 
%mend;

Monday, August 27, 2012

Elite Technical Notes

Elite Team will start to post some relative techniques to consulting services based on Software, Technology, Data Management Skills, SAS Programming methods. 

We will open for ideas. 
Please contact us immediately, if any feedback/questions, at EB@EliteBiostats.com