Issues of Large Datasets
Numerous SAS® programmers experience problems when working with large SAS® datasets that have millions of rows, hundreds of columns and are close to the size of a gigabyte or even more.
Often it takes enormous amounts of time to process these datasets which can have an impact on your delivery timelines. Also, storing such datasets may cause shortages of permanent storage space or can exhaust the available memory on your IS systems. Some tools which are often used to help lower the storage space required for datasets, i.e. compression/archive tools, may have a limit to the size of the data it can compress in the first place depending on the file system of the hard disk storing the data (e.g. FAT32 or NTFS), and the compression method used to compress/archive the data (e.g. ZIP or RAR).
These problems can occur when working with large datasets and integrating or pooling the data from various clinical studies. They can also occur with studies that have many laboratory parameters or with those with high numbers of patients.
In order to tackle these problems it is possible to make a large dataset smaller by reducing the number of observations and/or the variables, or by reducing the size of the variables without losing any of its information.
The solutions below are a few clinical programming techniques to reduce the size of your datasets and work with them efficiently.
Reducing/Compressing the size of a Dataset
You can make a large SAS® dataset smaller by using SAS® statements like LENGTH or dataset compression tools like COMPRESS option to save on the storage space. Reducing the size of the dataset can also help in reducing the time SAS® takes to read or access data from a dataset.
Size reduction of your SAS® datasets can also be achieved by the selection of required variables. For example when using the DROP and KEEP functions you are deciding on the appropriate variables to include in your datasets and remove unnecessary variables which will reduce their size.
Programming Efficiency Methods
Other techniques can be found through programming efficiency. This can be done in the form of saving storage space or memory, sub-setting observations and sorting the large datasets efficiently.
Consider the displays you wish to generate ahead of time. By forward thinking you may be able to store your datasets in smaller sub datasets and therefore only need to use the smaller datasets. i.e. The Lab dataset is notoriously large, but saving them as 3 datasets “Lab – Blood Chemistry”, “Lab – Haematology” and “Lab – Other” means less manipulation required as you have already thought about how you will be required to use them.
Saving storage space or memory:
- change numerical variables to character that do not perform any arithmetic calculations.
- delete the unneeded SAS® datasets from the WORK library as they become redundant within the program or delete them at the end of the program.
- use _NULL_ as a dataset name.
- use SAS® Views as alternatives to SAS® datasets.
- SQL JOIN or DATA step MERGE can help you save space when combining datasets.
Sub-setting observations and sorting large data
It is recommended to perform data subsets early in the code. By sub-setting or filtering any unwanted records from a dataset, we can reduce the size of the dataset thus saving storage space and improving efficiency of the program. For this purpose, either a WHERE statement or an IF statement can be used.
You can perform sorting and sub-setting in the same SORT procedure to save storage space and processing time. Also, you can use the TAGSORT option in a PROC SORT command when there is insufficient disk space available to sort a large SAS® dataset.
Related Blog Posts: