Sometimes you may need to restructure your dataset to move data from columns into rows, and vice versa. Two methods for doing this are using the transpose procedure or creating an array.
For example the code below will create a dataset showing the number of hours worked by each employee. The structure is one record per employee per day.
input ID Employee $ Day $ 13-21 Hours;
We want to restructure this dataset to instead contain just one record per employee, with a separate variable for each day.
This could be done by using the SAS proc transpose command:
proc transpose data=workweek out=workweek2(drop=_name_);
by id employee;
This could also be achieved by using an array:
data workweek2(drop=i day hours);
array dailyhour (5) Monday Tuesday Wednesday Thursday Friday;
if first.id then i=1;
if last.id then output;
retain Monday Tuesday Wednesday Thursday Friday;
Both sets of code will create the desired dataset:
However there are some distinct advantages and disadvantages to both:From this example both methods can be used to achieve the same result. In general it will just be personal preference to which method statistical programmers will use.
All variables within an array need to be the same type, either character or numeric. Arrays cannot use a mix of types, however proc transpose can. If both are used then it will automatically convert numeric values into character.
When creating new variables, proc transpose has limited control over naming the variables. By default it will use the naming convention COL1, COL2, COL3 and so on. The id, prefix and suffix options can be used for further control. Whilst arrays allow you to simply specify the names you want for the new variables.
Proc transpose requires less lines of code. This can make it quicker to type and easier to remember.
Arrays can be used for many other purposes such as manipulating variables, performing table lookups, creating new variables etc. They provide many options for customizing your data. Whilst proc transpose can only be used to restructure data.
When you want to transpose the values from one column into multiple variables: Proc transpose will automatically create a new variable for each unique value. However when using arrays you would need to have an idea beforehand of how many variables will need to be created, as the number/names of new variables will need to be specified in the code.
In order to compare the running speeds I have run equivalent code for both methods on the same large dataset within server SAS version 9.1.3. The dataset contained 10140 observations and with 37 variables to be transposed. Code for each method was run five times and the average real times and average CPU times from SAS were calculated and recorded in the table below:
Personally I prefer using proc transpose if all that is needed is to restructure a dataset. I find it simpler to use and much easier to remember the syntax for. The table shows that both the real and CPU running times from proc transpose were faster than that of arrays, taking less than half the real time to run. This suggests that proc transpose is the more efficient method to use.
Which method do you prefer to use? Can you think of any other advantages or disadvantages of either?