<img alt="" src="https://secure.perk0mean.com/171547.png" style="display:none;">

SAS Tips and Tricks -  An Overview of Good Programming Practices

By Clinical Programming Team
September 23, 2024

Good programming practices in SAS

SAS® (Statistical Analysis System) is a powerful tool widely used in statistical programming for clinical trial data analysis and reporting. Whether you’re working on clinical trials or handling complex data analyses, mastering SAS programming techniques and managing large datasets efficiently are crucial for successful outcomes. In this article we’ll explore:

By integrating these strategies, you’ll be well-equipped to optimise your SAS programming and tackle the challenges associated with big data.

 

SAS Tips and Tricks

Working in the life science industry provides numerous opportunities to learn valuable SAS techniques, often through training sessions. However, some of the most effective SAS tips come from collaborating with other programmers—whether seeking advice on coding challenges or reviewing programs written by colleagues. There are many simple SAS tips that can be applied to everyday programming tasks.

Tip 1 - Good Programming Practices in SAS

Whilst every programmer has their own preferred style of programming, what must be at the forefront of every programmer’s style is Good Programming Practice (GPP). Naturally programmers tend to follow their own internal style, adhere to Standard Operating Procedures (SOPs) and also code things in a certain way, often based on what is perceived as correct. However, a user should take some time to define their own style within the framework of compulsory/mandatory GPP, consistent at least for each individual program. It is important to avoid making significant sweeping changes in coding practices within a program, and potentially across a project.
Compulsory GPPs will include those from programmer’s company, the client and any other mandatory SOPs that must be followed. These will form the base layer of a programming style that can be built off. Beyond this base there are many other sources from which a programmer can draw ideas and inspiration from whilst still retaining or conforming to GPP.
A good example of third-party sources is PHUSE’s very own Good Programming Practice Guidance Document.

Other third-party sources include programming resources such as this one. Though other resources may cover some other topic the recommended practice within the resources can be considered GPP and influence how a programmer develops their programming style.

Base Step: Mandatory Practices from SOPs

Third Party Step: Ideas and concepts raised by other sources. Including GPP in SOPs that are optional

Personal Style Step: Looking at the preceding steps and deciding on one’s own final style whilst conforming to Good Programming Practice that the programmer wishes to follow tying ideas with the others already explored.

 

Tip 2 – Alternative SAS Techniques to ‘IF THEN ELSE’

Most programmers will have used the ‘IF THEN ELSE’ statements when working with conditional processing. Assigning one value for when the condition is true, and another value for when the condition is false. In these situations, lines of code can be reduced by instead using the functions IFC and IFN.

These functions can both create a new variable with assigned values based on whether a condition is true or false. IFC will create a character variable and IFN will create a numeric variable.

Below are standard templates for using IF THEN ELSE to create numeric/character variables, along with their counterpart in IFN and IFC:

*Standard IF THEN ELSE code (numeric);

if condition then variable=value1;

else variable=value2;

 *Equivalent IFN code;

variable=ifn(condition,value1,value2);

 *Standard IF THEN ELSE code (character);

if condition then variable='value1';

else variable='value2';

 *Equivalent IFC code;

variable=ifc(condition, 'value1' , 'value2') ;

 

It is evident that IFN and IFC effectively cut down and rearrange the keywords from IF THEN ELSE code and change it into just one compact line of code.

For example both the following sets of code can be used to create the same variable:

if sex='Male' then sexcd=1;

else sexcd=0;

 

 

sexcd=ifn(sex='Male',1,0);

arrows table_1

 

Either code can be used here and will produce the same result. In order to compare processing times equivalent code was ran five times for each method and recorded the average times. These were run within server SAS version 9.1.3 and showed very little difference in average processing times. However the IFN function has the advantage of requiring less code.

These functions are ideal when creating variables with 2 possible values.

Tip 3 - Access Special Characters

There may be times when a specific character, such as superscripts or trademark symbols in a footnote, is required in outputs but is not directly accessible on the keyboard. These characters and many more are all accessible for use in SAS programs.

This table shows some of the symbols in SAS available to use, along with their corresponding byte number in SAS and a description.

Byte(i)

Symbol

Description

153

Trademark sign

169

©

Copyright sign

170

ª

Superscript a

174

®

Registered trademark sign

176

°

Degree symbol

177

±

Plus or minus sign

178

²

Superscript 2 / squared

179

³

Superscript 3 / cubed

185

¹

Superscript 1

188

¼

One quarter

189

½

Half

190

¾

Three quarters

247

÷

Division sign

 

The full list of characters can be seen by running the following code. This will create a dataset with a variable containing each special character and a variable with each respective byte number in SAS.

data check;

  do i=1 to 255;

    sq=byte(i);

    output;

  end;

run;

Once the desired character is identified, %let can be used to create a macro variable for it. For example the following code will create a macro variable for the superscript a symbol:

%let supa=%sysfunc(byte(170));

To check this has worked properly, use %put to write the value of this macro variable to the log:

%put &supa;

The macro variable can then be inserted into the program. For example, the following is standard proc report code for creating a table, but using this macro variable to create a superscript value in the footnote. The highlighted code shows where the macro variable is used.

proc report data=final nowd headline headskip split=' | ' ps=23 formchar(2)='_';

  column ('__' pdcdl col1 ord1 ord2);

  define pdcdl / left  '  ' width=60;

  define col1 / center 'DrugA|6 mg SC';

  define ord1 / order noprint;

  define ord2 / order noprint;

  title 'Table 14-1.3 Summary of Important Protocol Deviations';

  compute after _page_;

    line @13 72*"_";

    line @2 ' ';

    line @14 "&supa Other than Treatment Compliance/Test Article Administration";

    line @14 'Note: Deviation categories are not mutually exclusive';

  endcomp;

  break after ord1 / skip;

run;

This will create the following output. The highlighted part shows what the footnote looks like from inserting the special character into it.

In this example, it is a simple process to find out the byte number of the desired special character.

Tip 4 - Using the ‘Where’ Option

Often in datasets issues, outliers and unexpected values may not be noticed straight away. There is a quick way to see all the unique values of a variable in a dataset by using the ‘where’ option from the data menu.

Although its primary use is to subset the data, it is also ideal for getting an overview of what values a variable can take.

The following steps show an example of how to use this option:

  1. Open a dataset and select 'Where...' from the 'Data' menu

  2. Choose the variable you want to check
     

  3. Select the EQ operator
     

  4. Select <LOOKUP distinct values>
     

  5. List of values is now visible

     

  

Using this option will then show the unique values of that variable, sorted in ascending order. This makes for an easy way to get an overview of the data.

For example, in demographics datasets, all the values of age can be examined. For a numeric variable like this, the minimum and maximum values can be observed, providing insight into possible outliers.

 

Another example is reviewing start dates in the adverse event dataset, as these can often be subject to issues. Examining this variable allows for the identification of incorrect or unusual values that may cause problems during programming.

 

This quick way of looking at data values can be used for saving time when checking for issues, outliers and unexpected values.

  

Tip 5 - Best Practices for SAS PC Shortcuts

When working on PC SAS there are many helpful keyboard shortcuts available. For example, when commenting or un-commenting blocks of code, time can be saved by using the shortcut instead of typing it manually. When trying to find the matching pair in a set of brackets, using the shortcut eliminates the need to spend time searching manually.


These actions can be performed with ease by pressing a few buttons on the keyboard. The following table shows some of the shortcuts available to use:

Action

Keyboard Shortcut

Convert highlighted text to upper case

Ctrl Shift U

Convert highlighted text to lower case

Ctrl Shift L

Comment highlighted text

Ctrl /

Uncomment highlighted text

Ctrl Shift /

Collapse all sections of code

Ctrl Alt – (on number pad)

Expand all sections of code

Ctrl Alt + (on number pad) 

Move cursor to matching bracket

Ctrl (

Move cursor to matching DO or END statement

Alt [

 

The full list of available shortcuts can be accessed through the menu:
Tools -> Options -> Enhanced Editor Keys

This will show a list of all commands that have a keyboard shortcut assigned to them, along with a brief description of what the command will do.

 

Ticking the box Show all commands will then also show the commands that have not been assigned any shortcuts yet. These can then be set as needed. For example, here are some of the unused commands which can be set a shortcut:

  • Convert highlighted text to opposite case.
  • Insert current date and time.
  • Delete line.
  • Repeat the current line.
  • Remove trailing white space from end of lines.

The Assign keys… button can be used to set a shortcut or to change an existing shortcut.

Switching text to upper/lower case and commenting/un-commenting code to be especially useful and a good time saver.

 

How to Handle Large Code

Working with large SAS programs usually presents complications and issues with the need for through programming which results in large amounts of code and large syntaxes. To handle these challenges and avoid common pitfalls, planning and sectioning can be applied to SAS programs.

Sectioning Code in SAS

Sectioning work is a common approach to any complex problem, so it is no surprise that it plays such a key role in handling large code. Programmers naturally make their problems into smaller, more manageable modules, to then tackle individually. As SAS is a procedural language, this modular approach is a natural part of programming, however modularisation [1] is a large topic in the wider world of computer science.

The first stage of planning large code should be to consider the types of sectioning that will be applied to the code by creating an outline of the pseudocode. Below are four of the major section types; procedural, variable, observation and input. A well laid out large code is likely to combine all four in some way, however certain datasets may contain one type of section more than the others. It will vary by dataset, and it is the responsibility of the author to find the right sections for their programming style and output dataset.

  • Procedural Sectioning: Procedural sectioning breaks code down by process, and often by applied SAS procedure. For example, a section dedicated to SAS PROC SORT statements, or to applying formatting. Other examples include transposing, summarising, baseline calculations, merging or macros.
  • Variable Sectioning: In many data models there are naturally grouped variables, such as timing variables or flags. These are often calculated together and will often make sense to be located together within the code. For example, SDTMIG 3.2 groups variables into Identifier, Topic, Timing, Qualifier and Rule [2]. Alternatively, code may be sectioned by variable category such as subject-level, character or numeric.
  • Observation Sectioning: In other cases, it makes sense to split code based on a vertical division of the intended dataset, rather than horizontal. A common example of this would be laboratory parameters, where it would be natural to create a section for different tests, or a questionnaire dataset. This could also appear on a more general scale, with a section for derived observations or for specific epochs. These sections are most likely to be appropriate for a repeat calling macro.
  • Input Sectioning: Depending on the structure of the input and output data, these sections may overlap with variable or observation sectioning. Here, the programmer has multiple sections in which all derivations can be performed from the datasets being read in; for example, a demography section wherein the demography dataset is read in, and all the required processing is completed for these variables.

As mentioned above, the ideal code will mix together all these styles of sectioning. A common approach is to begin and end with procedural sections in the form of dedicated sorting and merging. A vital signs dataset is likely to use a large number of observational sections, whilst a subject-level dataset will be more suited to variable sectioning. Please note that while preparing a section plan is important, it will always be subject to change. Study-needs can evolve, and new data can pose new challenges. Be aware that structuring of the code also needs to be adaptable.

 

The 6 Benefits of Naming Conventions

A good naming convention agreed upon early on can make a big difference in late stage debugging and updates. Below are six key rules for naming sections:

  1. Unique Dataset Names
    Having a unique dataset name ensures datasets are never overwritten. This is also preferable for debugging as the programmer only needs to run the entire code once and then look in the work library without any risk that a dataset is overwritten by a different data step or procedure. Overwriting datasets is generally bad practice.
  2. Suffixes and Prefixes

    Using suffixes and prefixes for datasets and variables produced in certain sections, allows for easy tracking of workflows and sorting of the work library. Do note however that this rules out using numeric prefixes, as datasets cannot start with a numeric character in SAS.

  3. Be Succinct Yet Descriptive

    Succinct sections and proper naming conventions for datasets makes it easier to manage the code as it gets bigger. Comparing and matching different names is easier to do when the name is shorter as many library viewers may only show the first 8 characters by default. An example of a good length and dataset name might be A_BBB_X, where A is a section name and BBB is additional information.

  4.  Order Datasets Chronologically

    In the above rule an _X is placed at the end of the dataset name. This is to keep the datasets sorted in the work library and is a very widely used technique. However, note that the ordering of sections should be done with caution, if a new section must be added later in production, it could result in having to rename many datasets.

  5. Stay Relevant

    At first a simple ‘A’ ‘B’ and ‘C’ sectioning might seem ideal: it’s short, alphabetical and easy to remember. However, updates and debugging will be made difficult. Whilst sections are fresh in the minds of the production programmers in the first week of production, after a few rounds of quality checking some of the section names may start to seem meaningless.

  6. Well Documented

    Much like every aspect of code development, a choice of naming convention should be well documented in the comments in case the code needs to be updated, adapted for new purposes or rerun by another programmer. It will also make code review easier at later stage.

Later in this blog we'll discuss some potential solutions for automating naming conventions using macros that will help in keeping section and dataset labeling consistent throughout big code.

 

How to Section and Program Headers

A section can begin with a comment, a macro call or even a bookmark. The section header is an excellent opportunity to document sectioning and naming decisions, and good documentation is at the cornerstone of good programming practice. There are many designs for section headers and dividers, formed from various comment formats. For large code, more information may be stored in a section divider than just a section header name, such as a description of the code within the section, the input and output datasets or any parameters to be used and updated. It can also be used to define section specific macros to help improve consistency throughout the section. Below are a few suggestions on ways that one could format section headers.

Figure 1: Suggestion for format of section headers

/*=====================*/ /* Main section */ /*=====================*/ /*description: */ /* */ /*=====================*/

*------------------; *Sub-section       ;
*------------------; *output datasets:  ;
*                  ;
*                  ;
*------------------;

/*=====================*/ /* Main section */ /*=====================*/ %let section_name = LAB; %let input_ds = LAB1; /*=====================*/

/**Main Section Macro Call**/
%macro main_section(in=                       ,out=

,prefix=);
%*in = first used dataset; %*out = output dataset for merging;
%*prefix = prefix applied to all datasets;
%mend;

/*==MAIN SECTION==*/ %let prefix = LAB;

*===sub-section===*; %let prefix = LAB1; %let subprefix = &prefix.1;
%let subprefix = &prefix.2;

*===sub-section===*; %let prefix = LAB2;

/*==================*/ /*===MAIN SECTION===*/ /*==================*/ %let section_name=LAB; %section_starting_macro;

 

It should be noted that these examples use “forward slash comments”, i.e. /*…*/ for main section headers, and for “regular comments”, i.e. * …*; for sub-sections. This way, the programmer can easily comment out large chunks of code within sections and comment out entire sections if needed. Alternatively, to inactivate a section a large uncalled macro can be created around the section. Also, notice that the macro example uses macro comments (i.e. %*…;), which prevents the comment being written to the log.

Not only are section headers important but also the overall program header. The best programs will start with a program header containing all relevant information. Important things to include are information such as which external datasets are being read in, any outputs the program is creating, concise purpose of the program, among other things. Special consideration for big SAS code would be to include a small summary of how the program functions. A list of sections and purposes can be very helpful for future navigation. Should a programmer inherit or come back to a program after an extended period, a quick glance at the header should tell them exactly where to look in the program in order to adapt, debug or continue creating the program.

 

Other SAS Best Practice Considerations

Different environments for SAS will have different functionalities designed to help organise the code. The list is extensive and will be constantly evolving as new versions of SAS are released, so this blog does not go through all features available.

However, it is important to investigate the features of the SAS environment that can benefit the code layout. For example, in SAS 9.4 for Windows, by default, the combination of shift and F2 keys creates a bookmark. As many bookmarks as needed can be set, and pressing F2 alone will jump through all bookmarks, allowing for scrolling between sections easily.

 

How to Order and Sort the SAS Code

Ordering large code properly will not only improve the readability, but may also have a positive effect on the efficiency of the program. Much like working with big datasets, big code may result in a large amount of read and write processing that will slow down the run time, and in turn the programmer’s own productivity. Sorting and merging are the biggest culprits of this, and whilst they are a cornerstone of most SAS programs, they can be minimised by ordering and sectioning.

Starting big code with a procedural section focused on sorting is a very popular approach, known as data pre-processing. From a planning and sectioning perspective, it establishes naming conventions and indexes for the incoming raw data. A common mistake is not sorting uniquely at this stage, which leads to further sorting being required at a later stage. When starting with a sort procedure section, think whether further sorts will be needed later in the code, and if so, reconsider the initial sort to combat this.

proc sort data = rawdata.labs out = work.alb (where = (paramcd="ALB"))                                          work.gluc (where = (paramcd="GLUC"))
                                    work.sodium (where = (paramcd="SODIUM"))
                                    nodupkey dupout = dups_labs;
     by usubjid visit param;
run;

In the above example of a proc sort, it should be noted that there is a single input dataset, but 3 output datasets, one for each parameter. Due to advance sorting, all three separated datasets will be needed for the upcoming processing. By using the proc sort statement in this way, the program will only read in and sort rawdata.labs once, rather than three times. This is a common technique when handling big data in SAS, but it also works well for big code as the run time is improved, especially over multiple sorts of different raw dataset. In addition, there is a NODUPKEY and DUPOUT. Removing duplicates should be avoided, but by applying the following code will verify whether the duplicates dataset is completely empty:

proc sql noprint;
    select * from dups_labs;
    %if &sqlobs = 0 %then %put USER: rawdata.labs contains duplicates;
    drop table dups_labs;
quit;

Or alternatively:

data _null_;
    set dups_labs;
    put 'USER: rawdata.labs contains duplicates';
    stop;
run;

This makes it easier to spot issues in the code early on, as well as keeping the work library clear of unnecessary information. Many different suffixes can be used here, including but not limited to; USER, ISSUE, NOTE, WARNING and ERROR. However, the PHUSE GPP guidelines [5] suggest avoiding the WARNING and ERROR options, as they may cause confusion with SAS generated warning and errors. Additionally, a programmer may wish to add the date and initials to a log printed comment, if this is consistent with GPP’s. It is important to review the relevant programming SOPs within the organisation before finalising a method.

As a final consideration, if SQL is being used regularly or the programmer would like to implement more advanced techniques for handling bulk datasets, the SQL primary key system could be used. By defining or amending a table in a PROC SQL statement, it can state which variables should act as the unique sort order. This information is stored in the SQL dictionaries, and can be referenced by PROC SQL in the future. This has powerful applications for bulk sorting, merging and accessing [3][4].

 

Branching and Merging

An issue that programmers may come across when updating their code, is finding out that after updating one dataset name, unexpectedly later merges cease to work. As a code begins to branch out from the original trunk of the program, it may become harder to keep track of the various uses of every dataset. But how can this be planned for and fixed? What can be done to stop this issue in its tracks, or to fix the issue in hindsight?

Diagram 1

The first suggestion is perhaps a less sophisticated method, but also one of the easiest to implement. Take a pen and blog and begin to draw a simple flowchart or spider diagram, following the main chains of datasets and branching off as per the logic of the code. As SAS is procedural, the only deviation from a usual flowchart will be adding a macro, which can be included as a separate flowchart. Diagram 1 shows example illustration of this technique. Different layers will form naturally in the flow chart, which should line up with the sections of the code. Note that not every data step is listed, just enough for the flow chart to connect.

Alternatively, the log can be utilised to get a better understanding of the code. Using simple statements such as:

data _null_;
put “N””OTE AP: Section XXX has started”;
put “N””OTE AP: Input datasets = ”;
put “N””OTE AP: Output datasets = ”;
run;

Or:

%put %str(N)OTE AP: Derivation for variables XXX, YYY, ZZZ complete;
%put %str(N)OTE AP: Merged variable AAA for calculating BBB;

This code will output information to the log, with initials printed on the line. Then either manually or using a utility SAS program, a programmer can read the log with only the lines which contain their initials. Consider printing the section headers discussed earlier in this blog, with as much or as little information as is required for the complexity of the code.

The final suggestion is to produce one system of overarching macros which can track a dataset and the section names throughout the code. The section below discusses how this would work and shows a prototype of what such a macro could look like, along with the advantages and disadvantages this would offer.

Understanding Prefix and Wildcard Macros

The below macros are a proof of concept for ordering and numbering of dataset names. This macro may only be suitable in some cases and is not necessarily meant to be used in every program but is a further development of ideas already raised in this blog.

The idea of this macro is to generate a prefix, stored in the macro variable “&prefix”, for each dataset created in a program. When the work library is opened it will mean every dataset is stored in order of creation. This will allow ease of debugging and can allow programmers to create a narrative for their work library alone.

A condition for the following to work, each dataset name after the prefix macro variable within the dataset should be entirely unique. Every time a new dataset is created %prefix should be called. In addition, when a dataset needs to be read, then the second macro %prefix_wildcard should be called once. This macro searches the work library for the unique dataset name and stores the actual dataset name including prefix in a macro variable matching the unique dataset name.

Should the previous guidance be followed as specified in this blog regarding dataset naming, by being descriptive in the purpose and function of a dataset, then through the work library it will be clear what is going on without looking at the code. This turns a work library of a large SAS program that may have many datasets, from chaos to a refined ordered map of a program.

*==========================================================================; * Macro Define Section: Section in which both the %prefix and %prefix_wildcard ;
* macros are defined. ; *==========================================================================;

* The macro prefix creates a prefix for datasets created this will follow the form section_sequence-number eg the first prefix in section A will be A_001_unique_dataset_name;
%macro prefix;
%* Define global macros seqnumber prefix and section to create a unique prefix for a new dataset;
  %global seqnumber prefix section;
%* If seqnumber macro variable already exists add 1 to it else create the macro variable as equal to 1;
  %if %symexist(seqnumber)=1 %then %do;
      %let seqnumber=%sysfunc(putn(%eval(&seqnumber+1),z3.));
  %end;
  %else %do;
      %let seqnumber=1;
  %end;
%* Create prefix macro variable;
  %let prefix=&section._&seqnumber._;
%mend prefix;

* The macro prefix_wildcard allows to essentially put a wildcard before a dataset name. This way every dataset should have a unique name then it will create a macro variable from the unique_dataset_name that has the prefix worked out;
%macro prefix_wildcard(unique_dataset_name=);
  %global &unique_dataset_name;
%* Proc Sql to find the prefix of the unique_dataset_name in the sashelp.vmember dataset;
  proc sql noprint;
     select memname
     into :&unique_dataset_name
     from sashelp.vmember
     where libname = "WORK" and memname like upcase ('%'||"&unique_ dataset_
     name") ;
  quit;
%mend prefix_wildcard;

*==========================================================================; * Section A: Section to show example of using %prefix and %prefix_wildcard macros.;
%let section = A; *==========================================================================; %prefix;
data &prefix.sdtm_input_1;
usubjid=1;
run;
%prefix_wildcard(unique_dataset_name = sdtm_input_1);

%prefix;
data &prefix.first_data_step;
    set &sdtm_input_1;
run;
%prefix_wildcard(unique_dataset_name = first_data_step);
%prefix;
proc sort data=&first_data_step out=&prefix.first_data_step_sort;
    by usubjid;
run;

Reference Code

When taking over and having to adapt a large SAS code it can be a daunting task. There could be any number or even all the following problems, ranging from the code being poorly commented, no comments at all, datasets being overwritten any number of times, unclear structure and poor indentation.

It is therefore important to do a quick read of the code to determine which course of action to take. A useful feature of SAS Enterprise Guide (EG) is the indentation tool, this can instantly make the code slightly more readable making skim reading a bit easier. If the working environment is not EG then code can be copied into EG then copied back into the working environment that is being used after indentation is applied.

Following a skim read, i.e. a quick read skipping some detail of the code, the decision to format and improve the existing code or start from scratch should be made. Below is the top level of each process:

  • Starting from Scratch Method: Should a complex SAS program be in a poor state the most efficient use of a programmer’s time may be to setup a new program with its own structure and write the code from scratch utilising the original code in parts in the new program.
  • Format and Improve Method: If it is chosen to format and improve the code, then essentially apply techniques and ideas raised in this blog. The process below gives an idea of what to follow, apart from the first step which should always be done first, the other steps can be done in the order that is found personally most appealing.

Check for dataset overwrites, if they occur remove the overwriting. The first step of this could be just to add suffixes such as _1 and _2 to duplicated dataset names.

If indentation hasn’t already been applied, while skim reading the code, it can be applied here.

Separate the code out into sections and order data steps and procedures logically.

Add comments, not only are these important for the finished code but they are helpful to personally keep track.

Implement naming conventions of datasets.

 

 

How to Work Efficiently with Large SAS Datasets

In addition to the challenge of working with large sas code, numerous SAS programmers experience problems when working with large SAS datasets that have millions of rows, hundreds of columns and are close to the size of a gigabyte or even more.

While managing large SAS code focuses on organising and structuring the logic for readability and efficiency, working with large datasets presents a different set of challenges. Large code is about handling complexity in the programming itself, whereas large datasets, with millions of rows and hundreds of columns, can strain system resources, slow down processing times, and require significant storage. These issues demand solutions focused on optimising data size, memory usage, and access speed. This section will explore techniques for working efficiently with large SAS datasets, focusing on minimising their size and improving performance.

Often it takes enormous amounts of time to process these datasets, running into issues such as insufficient space in file, which can have an impact on delivery timelines. Also, storing such datasets may cause shortages of permanent storage space or can exhaust the available memory on IS systems. Some tools which are often used to help lower the storage space required for datasets, i.e. compression/archive tools, may have a limit to the size of the data it can compress in the first place depending on the file system of the hard disk storing the data (e.g. FAT32 or NTFS), and the compression method used to compress/archive the data (e.g. ZIP or RAR).

These problems can occur when working with large datasets and integrating or pooling the data from various clinical studies. They can also occur with studies that have many laboratory parameters or with those with high numbers of patients.

In order to tackle these problems it is possible to make a large dataset smaller by reducing the number of observations and/or the variables, or by reducing the size of the variables without losing any of its information.

Another challenge is optimising data access. Indexing frequently used variables can reduce access time and improve performance. Moreover, large datasets may benefit from compression techniques, as SAS offers compression options like COMPRESS=YES to save space.

The solutions below are a few clinical programming techniques to reduce the size of your datasets and work with them efficiently.

 

Reducing and Compressing the Size of a Dataset

A large SAS dataset can be made smaller by using SAS statements like LENGTH or dataset compression tools like COMPRESS option to save on the storage space. Reducing the size of the dataset can also help in reducing the time SAS takes to read or access data from a dataset.

Size reduction of a SAS datasets can also be achieved by the selection of required variables. For example, when using the DROP and KEEP functions for appropriate variables to include in the dataset and then remove the unnecessary variables which will reduce their size.

There is two types of compression in SAS CHAR and BINARY. CHAR compresses character variables more efficiently and BINARY works well for numeric variables and mixed-type datasets, offering greater compression.

The "YES | CHAR" compression type uses RLE (Run Length Encoding), which reduces storage by compressing repeated characters. For example, the string WWWBBHHHMMSS (12 bytes) becomes 2W2B3H2M2S (10 bytes), saving space. This compression algorithm is specifically designed for character data in SAS.

Here’s an example of how to use it in a dataset:

data work.temp (compress=yes);
   set work.temp;
run;

The "BINARY" compression type combines RLE (Run Length Encoding) with sliding-window compression to reduce file size. For example, the phrase "count" repeated multiple times would be compressed using a small code to represent the repetition. While RLE alone cannot handle compressing repeated words, sliding-window compression efficiently encodes these repeated phrases. This method is especially effective for numeric data.

Here’s an example of applying binary compression in SAS:

data work.temp (compress=binary);
   set work.temp;
run;

 

How to Utilise SAS Views

Another efficient method for working with large datasets is utilising SAS Views. SAS Views do not store actual data but rather the code to generate the data, refreshing every time the view is accessed. This method can be particularly useful when dealing with external files. Here's an example:

data work.temp / view=work.temp;
   infile 'inputdata.txt' delimiter=';';
   input pt dob weight;
run;

To view the structure of a SAS View:

proc sql;
   describe view sasuser.faview;
quit;

This approach can help streamline data processing by reducing the need for permanent storage while accessing live data as needed.

 

SAS Best Practices for Programming Efficiency

Other techniques can be found through programming efficiency. This can be done in the form of saving storage space or memory, sub-setting observations and sorting the large datasets efficiently.

An example of coding efficiently can be seen here:

data work.temp;
length agecat 8;
set source.study1 (keep=pt dob);
if floor((dob-today())/365.25) lt 18 then agecat = 1;
if floor((dob-today())/365.25) ge 18 then agecat = 2;
run;

Is much less efficient than:

length age 8;
age = floor((dob-today())/365.25);
if age lt 18 then agecat = 1;
else if age ge 18 then agecat = 2;
run;

It is important to consider the displays that need to be generated ahead of time. By planning in advance datasets can be made into smaller sub datasets and therefore only need to use the smaller datasets. i.e. The Lab dataset is notoriously large, but saving them as 3 datasets “Lab – Blood Chemistry”“Lab – Haematology” and “Lab – Other” means less manipulation required.

To save storage space or memory, some suggestions would be to:

  • change numerical variables to character that do not perform any arithmetic calculations.
  • delete the unneeded SAS®datasets from the WORK library as they become redundant within the program or delete them at the end of the program.
  • use_NULL_ as a dataset name.
  • use SAS®Views as alternatives to SAS®
  • SQL JOIN or DATA step MERGEcan help you save space when combining datasets.

 

Sub-Setting Observations and Sorting Large Data

It is recommended to perform data subsets early in the code. By sub-setting or filtering any unwanted records from a dataset, the size of the dataset can be reduced, thus saving storage space and improving efficiency of the program. For this purpose, either a WHERE statement or an IF statement can be used.

Perform sorting and sub-setting in the same SORT procedure to save storage space and processing time. Also, use the TAGSORT option in a PROC SORT command when there is insufficient disk space available to sort a large SAS® dataset.

Using BUFNO and BUFSIZE:

To further enhance efficiency, SAS provides options like BUFNO and BUFSIZE to optimise how much data is transferred per read/write operation. The BUFNO option specifies the number of buffers to be used during processing, while BUFSIZE controls the size of each buffer. For example:

OPTIONS BUFNO=2 BUFSIZE=16384;

In this scenario, two buffers of 16,384 bytes each are used, meaning 32,768 bytes are transferred in one I/O operation. Note that BUFSIZE can only be used for output datasets and becomes a permanent attribute, while BUFNO applies only to the current step or session, offering flexibility for temporary adjustments. These options can reduce the time it takes to process large datasets significantly.

 

 

Conclusion

Mastering good programming practices in SAS is crucial for enhancing coding efficiency, especially when dealing with large datasets and big SAS programs. By applying the SAS tips and tricks covered in this article a programmer can significantly improve both the performance and maintainability of their programs. These SAS best practices help in managing challenges like insufficient space in file SAS environments and handling big data more effectively. By programming efficiently, the overall quality of the clinical trial reporting will improve which is beneficial for gaining regulatory approval of investigation new medicines in drug development.

 

References

[1] The Advantages of Modularization in Programming, G S Jackson - https://www.techwalla.com/articles/the-advantages-of-modularization-in-programming

[2] SDTM Model Concepts and Terms, Judy Li - http://pharma-sas.com/sdtm-model-concepts-and-terms/

[3] Create Index Guide, SAS Support - https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473673.htm

[4] Assessing SQL Dictionaries, SAS Support - https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473711.htm