Quanticate Blog

WARNING: Transposing Data using PROC SQL

Written by Clinical Programming Team | Thu, Jun 13, 2019



If you search online for “how to transpose using PROC SQL” you will be told – repeatedly – that it cannot be done. Out of curiosity I decided to ignore the advice, and my solution came to 345 lines, achieving what PROC TRANSPOSE can do in 6; proving that you can do it – you just shouldn’t. However, tackling the challenge requires us to look deeper at tree traversal, used for finding combinations of variables for the transposed columns. Within SAS, basic tree traversal methods include nested and macro do loops, and as we get more advanced we can create efficient algorithms dealing with file searching, data-structuring and much more. This blog will give an overview of my method for transposing and use it to show how tree traversal can be understood and implemented in SAS.

Whilst training as a programmer I was told that you could do almost everything in PROC SQL, except find a median and transpose. Naturally, I started to try. Median is quite simple, however transposing took quite a bit more work, eventually leading me to 345 lines of code and the writing of this paper. I do want to stress that you should never transpose in PROC SQL for any practical purposes; this is purely an exercise in what can be done, allowing us to increase our knowledge of SAS, SQL and programming in general.

In this blog I will outline the problem in depth and discuss the outline of our transposition, break off to introduce tree traversal, and finally use this new found knowledge to build a solution. I will use the dataset sasuser.sales throughout, which is available in many SAS environments by default and will allow you to test the code snippets provided. If your SAS environment does not have this example dataset, it is available for download from SAS Example Datasets.

The problem has been approached once before by Ted Conway [1], who suggested the code
structure for basic long-to-wide transposition, which I will be building on in this paper; I would recommend reading his paper as well as this blog. This problem does rely on a familiarly of what is capable using both PROC TRANSPOSE and PROC SQL. For more reading surrounding these procedures, I recommend both the SAS procedure documentation or the Listen Data guides.

The Problem

For the purposes of this blog, we will concentrate on a basic transpose, setting aside the PROC TRANSPOSE statements IDLABEL and COPY for the moment. Instead we will focus on what can be done using the VAR , ID and BY statements. We use the VAR statement to take a dataset from wide to long, while ID takes us from long to wide. The BY statement is used to group over one or more variables, and will be analogous to the GROUP BY statement in PROC SQL.

Example

Consider the sample dataset sasuser.sales, which has the structure:

sasuser.sales: LastName Month Residential Commercial
  SMITH JAN 140097.98 440356.7
  DAVIS JAN 385873 178234
  JOHNSON JAN 98654.32 339485
  SMITH FEB 225983.09 440356.7
  ... ... ... ...


Giving the residential and commercial sales figures in January, February and March for the sales
persons Smith, Davis and Johnson. We would like to transpose the data into the following:

work.output: LastName SaleType JAN FEB ..
  SMITH Residential 140097.98 225983.09  
  ... ... ... ...  


This is both a "wide-to-long" and a "long-to-wide" transpose, first doubling the number of observations to move the variables Residential and Commercial into the single variable SaleType, then reducing the number of observations to create the variables JAN, FEB and MAR.

Wide-to-Long

The first stage, "wide-to-long", is simple to achieve using PROC SQL. If we have &varn variables (2 variables in our case), we can make &varn copies of our dataset, rename each variable to the column name of our choice, and use the union join.

PROC SQL;
    CREATE TABLE work.long AS
      (SELECT LastName, Month, "Residential" as SaleType,
         Residential as Var FROM sasuser.sales)
    UNION
       (SELECT LastName, Month, "Commercial" as SaleType,
         Commercial as Var FROM sasuser.sales)
QUIT;

This can easily be moved into a macro for any given number of VAR variables, and gives us the
following long dataset:

work.long: LastName Month SaleType Var
  SMITH JAN Residential 140097.98
  SMITH JAN Commercial 440356.7
  ... ... ... ...


We can then use the dataset work.long for the remaining "long-to-wide" transposition. Notice that if we were to do a transpose which only required a "long-to-wide" step, the code above would just rename the only VAR variable.

Long-to-Wide

Difficulty arises when we move on to the "long-to-wide" transpose, creating new columns for the listed ID variables and grouping by BY variables. In principle, we would like the statement to look as follows:
PROC SQL;
    CREATE TABLE work.wide AS
      SELECT &byvars, col1, col2, col3, col4
      FROM work.long
      GROUP BY &byvars
QUIT;
where col1, col2, col3, col4 are columns containing the transposing variable for the correct combination of ID variables. Above we have used &byvars to store our list of BY variables. In a similar way, we will use the macro variables &idvar1, &idvar2... to store each ID variable, and &idvar1_1, &idvar1_2... to store each possible value of &idvar1, and so on. As suggested by Ted Conway [1] we can use MAX, SUM or any other data-type appropriate summary function to make this selection:
PROC SQL;
    CREATE TABLE work.wide AS
      SELECT &byvars,
             MAX (CASE WHEN &idvar1=&idvar1_1 and &idvar2=&idvar2_1
                  THEN &var END) as col1,
             MAX (CASE WHEN &idvar1=&idvar1_1 and &idvar2=&idvar2_2
                  THEN &var END) as col2,
             MAX (CASE WHEN &idvar1=&idvar1_2 and &idvar2=&idvar2_1
                  THEN &var END) as col3,
             MAX (CASE WHEN &idvar1=&idvar1_1 and &idvar2=&idvar2_2
                  THEN &var END) as col4
      FROM work.long
      GROUP BY &byvars;
QUIT;

By listing all the BY variables in the GROUP BY statement, we create one observation for every
combination of BY variables. So in our example, where &byvars=LastName, we have one observation per LastName. Each CASE then points us to a specific combination of ID variables; in our example these are a combination of Month and SaleType. Overall, our MAX function will always be taking the max of one observation, in place of each combination of LastName, Month and SaleType.

But how do we know how many columns to create? In our sasuser.sales example, we have 1
ID variable with 3 distinct values, so we require 3 new columns. What if 2 variables were given in
the ID statement, or more? If we have 5 ID variables, with 4, 2, 3, 2 and 5 distinct values each, we
would need 240 columns. Is it possible to make this selection? To understand how we can solve this problem, we require tree traversal.

Tree Traversal

In order to understand tree traversal, we first need to understand a tree. A tree begins with a root and branches off repeatedly to different data points, called nodes, until it reaches the final data points, known as leaves. A diagram from Tutorials Point is shown below and shows a selection of basic terms, including levels, parents and sub-trees. These nodes can be any kind of data, whether that be folders, webpages, or in our case variables.

Figure 1: A Basic Tree Structure [2]

Example

Our example dataset sasuser.sales can be thought of as a tree, LastName representing the root, Month as the first layer of branches and SaleType as the leaves. Notice that we could assign these in a different order, but the order we pick should reflect the context of our data and the output we are aiming for: the sales person Davis (root) in January (node) had residential sales totalling £385873 (leaf). Although we have multiple LastName values, we are grouping them together using the MAX - GROUP BY statements and can be considered together as one root.

In our example we have 3 levels, with 3 branches at the first and second level and 2 branches at
the third level. In general, we will be able to find out this information and place it in macro variables, however our solution should be able to adjust to any number of levels and branches.

We can see that all our sibling nodes have the same number of child nodes. This is not necessarily
always true, as we may by missing records for one sales person in one month, for example. In a
general solution one could add a temporary empty observation or treat the missing branch as a "ghost" branch, however for brevity we will assume that we always have a complete set of records moving forward. In other scenarios, such as a file path search, for example, uneven sibling nodes would be expected regularly and would require us to rethink our traversal method, which we will discuss later in more detail.

Finally, in our example we can see 3 main subtrees, belonging to each of the LastName values.
As sibling nodes have the same number of child nodes in our trees, each of these subtrees will be
identical. We also have 9 smaller subtrees in each combination of Month and LastName. These are
very small, just a root with two leaves, but they are still trees in structure. We can even consider the leaves as 18 distinct and identical sub-trees, each with one root node that is also its only leaf node.

Depth First

In order to traverse our tree, we can use many methods, including breadth first and depth first traversal. Given that we need to get to the leaves of the tree quickly, we will focus on depth first traversal. We do this by following our tree down to the end of the first leaf, then following around the perimeter of the tree. This will allow us to pass every node. We can choose to select the data either as soon as we visit the node, before moving on to its child nodes (pre-order traversal) or on the way back afterward (post-order traversal). A third method, in-order traversal, is also an option but will not be covered in this blog. The first two methods are illustrated below:

Methods of Tree Traversal

 

 



 

Pre-order Traversal                                                                   Post-order Traversal

We can apply either of these methods in SAS, as we will see in the next section. While both methods will successfully list all nodes, the order of the output will change. We are collecting parts of a SELECT statement in PROC SQL for variables, so this will affect the order of the variables we create. Similarly, we could be collecting parts of a file name to search an extensive library, or perhaps looking at possible outcomes for a patient’s treatment schedule.

The Solution

We now have the information we need to implement our transposition in SAS. We have the dataset work.long which we recognize as a tree structure, with some BY variables, ID variables and one VAR variable ready for transposition. We will store the number of ID variables in &id_N, and the number of values for each ID variable in &idvar1_N, &idvar2_N and so on.

Starting in the simplest case, let &id_N=1. With only one ID variable, we can store the distinct ID values in a list and use a macro %DO loop to call each node:

PROC SQL;
    CREATE TABLE work.wide AS
      SELECT &byvars
             %select_idvars
      FROM work.long
      GROUP BY &byvars;
QUIT;
____________________________________

%MACRO select_idvars;
    %DO i = 1 %TO &idvar1_N;
      ,MAX (CASE WHEN &idvar1=&&&idvar1_&i
                 THEN Var END)
       AS col&i
    %END;
%MEND;

When we increase to &id_N=2, we can nest our %DO statement, looking for each combination of
idvar1 and idvar2.

%MACRO select_idvars;
    %LET m = 0;
    %DO i1 = 1 %TO &idvar1_N;
       %DO i2 = 1 %TO &idvar2_N;
          %LET m = %EVAL(&m+1);
          ,SUM (CASE WHEN &idvar1=&idvar1_&i1
                     AND &idvar2=&idvar2_&i2
                     THEN var ELSE 0 END) AS col&m
       %END;
    %END;
%MEND;

We can continue using this method for any fixed number of ID variables, and we can see the tree
structure playing out as we do it, each nested %DO is a new layer of our tree diagram. This is a simple tree traversal, and applying it to our sasuser.sales example we can see how it follows the diagram:

If we were to use a DATA step at this stage, we could utilize the following form of nested do loop statements. This method does rely on the fact that all sibling nodes have the same number of child nodes, which we established was true for our particular problem. By nesting a DO loop within a %DO loop, we can control how many levels we need and the size of those levels, without needing to type out some unknown number of nested loops. It is also a helpful advancement toward a complete PROC SQL solution, as we can visualize this code on a tree structure diagram as shown below.

  DATA _NULL_;
    %DO j = 1 %TO &id_N;
      DO i&j = 1 TO &idvar&j._N;
    %END;
         /*select needed rows*/
    %DO j = 1 %TO &id_N;
       END;
     %END;
  RUN;

Recursion

Unfortunately we cannot use this method with %DO loops only, so in order to create a PROC SQL only solution we will need to add recursion to the %select_idvars macro we created earlier. Recursion is where we allow a macro to call itself, creating more complex loops. Similar to the DO WHILE and DO UNTIL statements, it runs a high risk of causing an infinite loop in SAS, which we would like to avoid. If recursion is used in code, it should have a "safety variable" to minimize issues, along with usual defensive programming techniques, and be commented clearly so future edits do not cause issues.

%MACRO select_idvars(n=1);
   %*macro function*;

   /*this error exists to make sure any future code edits do not cause
   loops*/

   %IF &n > 999 %THEN %DO;
      %PUT %STR(ER)ROR: You are trying to produce far too many columns;
      %ABORT;
   %END;
   %ELSE %select_idvars(n = %EVAL(&n+1));
%MEND;

In the above, the macro variable n is only used to count the number of calls to %select_idvars,
more than 999 calls is considered an error, and the macro will print an error and abort.

For each level of our nested %DO statement, we can now call the SAS program into a new iteration
of the macro. Ensuring that we have a clause to stop the macro, we can use this method to build loops in any form we like. In the body of our macro, we can use the following algorithm to look through our database: if the data passed is a branch of a tree then call the macro again for its subtree, if it is a leaf then output the value.

When our macro %select_idvars is called,
we set macro variable &node to the root
node at level 0 of our tree dataset. The
macro will then set a counter associated
with the node, i&node=1, which it will
use to count through the branches. Once
it has output the portion of the SELECT
statement, it checks if &node is a leaf
node. If it isn’t, we let &node equal &ith
child node, and start a new iteration of the
macro. If it is a leaf node, the macro
iteration ends, returning &node to the last
node.

We can see that this method is depth
first; the algorithm will always start by calling
%select_idvars with child nodes until it
reaches the first leaf. It is also a preorder
method as we are outputting &node
when we first reach it, before we continue
through the tree. We can easily adapt
the flow chart on the left to be post-order,
by moving points A and B in the diagram
on the left or in the code below. In the
simplified example code below, this is the same
as moving the output below the macro DO
loop.

   %LET root = 0;
   %MACRO select_idvars(n=1);
/*A*/ /*output &node*/
    %DO i&node = 1 %TO &&total_&node;
/*B*/ %IF &&level_&node < &level_max %THEN %DO;
        %LET node = child_&node;
        /*error warning*/
        %select_idvars(n = %EVAL(&n+1));
      %END;
    %END;
%MEND select_idvars;

Of course this flowchart still leaves many options open to the programmer, in particular how to
determine if a node is a leaf node or a final branch. In the example code above, I have assumed we have created macro variables storing this information earlier in the macro, which can be achieved with metadata databases (such as the dictionary library [3]) and PROC SQL summary functions earlier in the macro.

Other Applications

An understanding of tree traversal opens up many new avenues for solving complex problems, and when applied correctly can save huge amounts of time and energy for both you and for your SAS environment. The first step is to notice the tree within your problem. Once you have identified your tree, with defined root, branches and leaves, you can begin to consider the type of traversal required by considering the unknowns in your problem.

When recursion is involved in your solution, I recommend informing any relevant personnel involved with the project, including the error message suggested on page 5, and commenting in detail to explain your decision and method. When used properly, it should minimize the resources required significantly.

Conclusion

At the beginning of this blog we set out to transpose a dataset using only PROC SQL. While this
problem may not be directly applicable to our day to day SAS use, the knowledge gained is a very
powerful tool. Tree structures, traversal and recursion are all advanced programming techniques which can be used to tackle a wide range of problems.

Quanticate's statistical programming team can support you with Clinical Trial with the creations of TLFs, and any CDISC Mappings or SDTM conversions. Our team of experts would be happy to provide support and guidance for your development programme if you have a need for these types of services please submit a Request for Information (RFI) and member of our Business Development team will be in touch with you within 2 business days. 

References

[1] Ted Conway; "It’s a Bird, It’s a Plane, It’s SQL Transpose!"
      http://www2.sas.com/proceedings/forum2008/089-2008.pdf
[2] A Basic Tree Structure Diagram
      https://www.tutorialspoint.com/data_structures_algorithms/tree_data_structure.htm
[3] Information about the PROC SQL dictionary libraries.
      https://v8doc.sas.com/sashtml/proc/zsqldict.htm