Creating a temporary SAS array of dynamic size

Arrays in SAS are incredibly useful things. For example, if you’re dealing with a concomitant medications dataset and want to check for certain medicines across multiple columns, you’d be hard-pressed to find a faster method than using arrays!

Within the SDTM.CM domain, medication names are spread across several columns, usually: CMTRT (Reported Name of Drug, Med, or Therapy), CMMODIFY (Modified Reported Name), and CMDECOD (Standardized Medication Name).

If we wanted to find and flag, for example, the following three medicines: Aspirin, Antacid, Potassium Chloride ; we could do it as follows:


data cm;
    set sdtm.cm;

    array all_meds {*} $ cmtrt cmmodify cmdecod;
    array check_meds {3} $ _temporary_ ("aspirin", "antacid", "potassium chloride");

    do i = 1 to dim(all_meds);
        do j = 1 to dim(check_meds);
            if upcase(all_meds[i]) = upcase(check_meds[j]) then occur = "Y";
        end;
    end;

    if occur ^= "Y" then delete;
run;

The above method works well to quickly check for multiple conmeds across multiple columns. However, usually a conmeds list is not restricted to only 3 items and counting the number of unique items is a slow and tedious process. If an item needs to be added or removed, then the space to be reserved needs to be updated.

Unfortunately specifying the list of conmeds in a temporary array prevents the use of the dynamic sizing, usually indicated by {*}. This is something that will hopefully be fixed in a newer version of SAS, but until such time I’ve taken it upon myself to create a macro which creates dynamically sized (sort of!) temporary arrays. It does this by counting the number of items and automatically reserving the space for it.

Note: if you are feeling lazy, you can still create a temporary array and simply oversize it, e.g., specify a size of 100 items even though you may only need half that. However, this will result in SAS posting a WARNING to the log about partial array initialization, which is not ideal in the pharmaceutical environment where we want clean logs!

If you’d prefer to avoid another O(n) loop which checks for the maximum length needed for character variables, simply set a static length, say $20.


/*
    Macro makeTempArray
    
    Purpose: To create temporary arrays without knowing the
             size of the array needed beforehand. This is a 
             limitation of the original SAS procedure for
             creating temporary arrays.
    
    Parameters:
        arrayname : an arbitrary name for your array
        ischar    : pass either Y for a character array 
                    or N for numeric
        items     : pass the list of items to be contained in
                    the array, wrapped in %str() and separated
                    by commas
*/

%macro makeTempArray(arrayname=, ischar=, items=);
    %let n=%sysfunc(countw(&items., %str(,), )); /*count the number of items to reserve space for*/
   
    %if &ischar.=Y %then %do; /*if this is a character array, we need the length of the longest item*/ 
        %let l = 1;
        %do j = 1 %to &n.;
            %let l0 = %sysfunc(length(%sysfunc(scan(&items., &j., %str(,), r))));
            %if  &l0. > &l. %then %let l = &l0.;
        %end;
    %end;
    
    array &arrayname. {&n.} %if &ischar.=Y %then $&l.; _temporary_ (
        %do i = 1 %to &n.;
            %let item = %sysfunc(scan(&items., &i., %str(,), r));
            %if &ischar.=Y %then %str("&item." ); %else &item.;
        %end;
    );
%mend makeTempArray;

With this macro, we can now modify our initial starting block of code as follows (items must be separated by commas):


data cm;
    set sdtm.cm;

    array all_meds {*} $ cmtrt cmmodify cmdecod;
    %makeTempArray(arrayname=%str(check_meds), ischar=%str(Y), items=%str(aspirin, antacid, potassium chloride));

    do i = 1 to dim(all_meds);
        do j = 1 to dim(check_meds);
            if upcase(all_meds[i]) = upcase(check_meds[j]) then occur = "Y";
        end;

        if occur = "Y" then leave;
    end;

    if occur ^= "Y" then delete;
run;

Hope this helps you next time you need to cross-check multiple items across multiple columns! Happy hacking!

Updates: My colleague, Mazi Ntintelo has rightly pointed out that the commas within the macro’s scan functions should be wrapped as %str(,) and also that the do loop checking for conmeds can be optimised with a leave statement. Thanks, Mazi!

Significant Figures in SAS

For three significant figures, the SAS Institute provides the following code snippet to accomplish the task.

However, it is often useful to round to more or less than 3 significant figures. I’ve developed a macro to do so for my own use and am sharing the code below.

%macro _nsigfig(varin=, varout=, n=);
if &varin. = 0 then &varout. = 0;
else do;
	if int(&varin.) ^= 0 then do;
	    &varout. = round(&varin., 10**(int(log10(abs(&varin.))) + (1 - &n.)));
	end;
	else do;
	    &varout. = round(&varin., 10**(-1*(abs(int(log10(abs(&varin.)))) + &n.)));
	end;
end;
%mend _nsigfig;

NONMEM Dataset Example (free, with R code)

I’m making available here a basic NONMEM dataset example, along with the R code used to create it. Hopefully it will be helpful to someone in future!

CSV download: http://jvdl.me/downloads/nonmem/dummy_nonmem.csv

XPT download: http://jvdl.me/downloads/nonmem/dummy_nonmem.xpt

library(dplyr)
library(tidyverse)

library(linpk)
library(haven)

rm(list = ls())

# basic structure
stdy <- c(1, 2, 3)
sex <- c(0, 1)
race <- c(1, 2, 3)
ntad <- c(0, 0.5, 1, 2, 4, 8, 12, 16, 24, 48, 72)
# end basic structure

# create dummy foundation with random wt, ht, bmi, age
foundation <- data.frame(stdy) %>% 
    full_join(data.frame(sex), by = character()) %>%
    full_join(data.frame(race), by = character()) %>%
    mutate(
        c = NA,
        id = row_number(),
        wt = runif(row_number(), 70, 100),
        ht = runif(row_number(), 150, 200),
        bmi = wt / ((ht / 100) ^ 2),
        age = runif(row_number(), 30, 60)
    ) %>% 
    mutate_at(
        vars(wt, ht, bmi), list(~ round(., 2))
    ) %>%
    mutate_at(
        vars(age), list(~ trunc(.))
    ) %>%
    full_join(data.frame(ntad), by = character())

# prepare pc
foundation.pc <- foundation %>%
    mutate(
        evid = 0L,
        cmt = 2L,
        mdv = 0, 
        amt = NA
    )

# prepare ex
foundation.ex <- foundation %>%
    filter(ntad == 0) %>%
    mutate(
        evid = 1L,
        cmt = 1L,
        addl = NA,
        ii = NA, 
        mdv = 1,
        amt = case_when(
            stdy == 1 ~ 100L,
            stdy == 2 ~ 200L,
            stdy == 3 ~ 300L,
            TRUE      ~ -99L
        )
    )

# set pc and ex and locf amt as dose
foundation.nonmem <- bind_rows(foundation.pc, foundation.ex) %>%
    arrange(stdy, id, ntad, evid) %>%
    mutate(
        dose = amt,
        tad = case_when(
            ntad == 0 & evid == 0 ~ -0.05,
            TRUE                  ~ ntad
        ),
    ) %>%
    group_by(stdy, id) %>%
    tidyr::fill(dose, .direction = c("downup")) %>%
    ungroup() 

# populate dummy concs with pklin::pkprofile
foundation.nonmem$dv = pkprofile(foundation.nonmem$tad, cl=0.5, vc=11, ka=1.3, dose = list(amt = foundation.nonmem$dose))

#introduce some randomness to pk concs
foundation.nonmem <- foundation.nonmem %>%
    mutate(
        dv = as.numeric(dv) * runif(row_number(), 1.1, 1.25) * (dose / 100),
        dv = ifelse(evid == 0, dv, NA), 
        lndv = case_when(
            dv == 0 ~ 0,
            TRUE    ~ log(dv)
        )
    ) %>%
    mutate_at(vars(dv, lndv), list(~ round(., digits = 2)))

# housekeeping, replace NAs, column names to uppercase, etc.
foundation.nonmem.final <- foundation.nonmem %>% 
    mutate(
        across(everything(), ~replace_na(.x, "."))
    ) %>%
    select(c, stdy, id, tad, ntad, dose, amt, addl, ii, evid, cmt, dv, lndv, mdv, sex, race, age, wt, ht, bmi)

names(foundation.nonmem.final) <- stringr::str_to_upper(names(foundation.nonmem.final))

# export to csv and xpt
setwd("~/R")
write_csv(foundation.nonmem.final, file = "dummy_nonmem.csv")
write_xpt(foundation.nonmem.final, path = "dummy_nonmem.xpt", name = "NM", version = 5)

One-time-pad encryption with R and basic JavaScript

One-time-pad encryption offers unbreakable* security if you can guarantee truly random number generation and security of the generated pad. Implementing it from the perspective of a cryptographic layman is a fun way to learn more about it and could add an extra layer of security to your most sensitive communications.

A format of one-time pad used by the U.S. National Security Agency, code named DIANA. The table on the right is an aid for converting between plain txt and cipher text using the characters at left as the key. The black circle is an artifact of copying, due to a hole punched in the original.
(Image and text copied verbatim from: https://en.wikipedia.org/wiki/File:NSA_DIANA_one_time_pad.tiff)

I started off by trying to mimic the NSA example above. I eventually settled on a 169 character key (mapped as a 13 x 13 grid) as I struggled to encode more characters into the QR codes. It still works fine for short messages and as a proof of concept, it’s more than enough.

The R code is short and simple and is provided below. It starts off by generating 13 x 13 random numbers ranging from 0 to 25, to be mapped as A to Z. It also uses the qrcode library to generate a QR code of the encryption key. This is then exported to a PDF file, to be printed and distributed to both the sender and receiver.

#==========================
# Simple one-time-pad generator
# with QR code generation for encryption key
#
# J. van der Linde (jvdl@jvdl.me)
#
# Updated 30 April 2021
#==========================

rm(list = ls())

library(dplyr)
library(qrcode)
library(grid)
library(gridExtra)

#==========================

setwd("~/R/Projects/onetimepad/")

#export pdf to file
pdf("~/R/Projects/onetimepad/otp.pdf")

#initialise with number of rows, cols, and sheets of paper to generate
dim_row <- 13
dim_col <- 13
total_chars <- dim_row * dim_col
total_sheets <- 3

for (i in 1:total_sheets) {
    print(paste("Producing sheet no.", i, "of", total_sheets))
    otp_num <- floor(runif(total_chars, min = 0, max = 26)) #generate from 0 to 25
    otp_char <- replace(otp_num, TRUE, LETTERS[unlist(otp_num + 1)]) #rewrite numbers 0-25 to A-Z
    otp_string <- paste(unlist(otp_char), collapse = '') #create a continuous string for the QR code
    
    otp_num_matrix <- matrix(otp_num, nrow = dim_row, ncol = dim_col, byrow = TRUE) #create a numeric matrix
    otp_char_matrix <- matrix(otp_char, nrow = dim_row, ncol = dim_col, byrow = TRUE) #create a char matrix
    
    otp_char_df <- data.frame(otp_char_matrix) #create dataframe of char matrix
    rownames(otp_char_df) <- seq(dim_row) #set rownames as 1 ... x
    colnames(otp_char_df) <- seq(dim_col) #set colnames as 1 ... y
    
    #output the table of chars to the PDF
    grid.table(otp_char_df)
    grid.text(paste("SHEET", i, "OF", total_sheets), x = unit(0.5, "npc"), y = unit(0.95, "npc"))
    
    #print QR code of OTP to PDF
    qrcode_gen(otp_string)
    
    if (i != total_sheets){
        grid.newpage()
    }
}

#close pdf device
dev.off()

Below is an excerpt from the PDF output. The idea is that two copies of this document should be printed: one for the sender and one for the receiver. The OTP should be handed over in person to ensure it is not intercepted. Both should store the OTP somewhere safe and destroy this electronic copy.

PDF of generated OTP encryption keys and their corresponding QR codes

Once the need arises to send a message with extra confidentiality, our protagonists, let’s call them Alice and Bob in line with tradition, have to agree on a code to use. They may have prearranged to start at the first sheet and continue sequentially. Or they can decide on a sheet spontaneously; for example, Alice can simply send Bob a text saying, “use sheet 3 today”.

In our scenario, let’s assume Alice needs to meet up with Bob. Perhaps they are concluding a confidential transaction or protesting for democracy in their country. She wishes to send the message, “SEE YOU AT ELEVEN AM IN THE TOWN SQUARE”. Firstly she uses a secure instant messenger like Signal to initiate the conversation with Bob.

Alice: Hi FreedomProtestor123. Use sheet 3 today. Message follows soon.

Bob: Hi DemocracyGirl42, will do.

Current events…

By using modular addition (our code is only for A-Z, numbered 0 to 25) Alice calculates the encrypted message. She adds the first letter of the message to the first letter of the key, modulo 26. She repeats for the 2nd letter of the message and the second letter of the key and so on. She then maps the resulting numbers of the modular addition back to letters.

This can also be done fairly easily with the help of computers. Alice could have used her phone to scan the QR code provided next to the old-fashioned grid and completed the process via app or webpage. Alice must ensure she can trust everything in this chain, from the QR code scanner to not having her internet traffic intercepted.

I link here to a basic HTML/JS example which can either be hosted online or stored offline for use. The code is deliberately kept simple and clean (no CSS) to allow easy verification of the source and to decide whether or not to trust it. You can find my implementation hosted at http://jvdl.me/otp.html ; since it is a simple HTML page any phone or computer can render it.

Using the key on sheet 3

(DKPKAREZTPABSCHIBLTHZZLNARDGRXVGIBGQDPWLESQWKKAVQDKSPHZNMRDBBNJPTWDANLOARNELLMGODWAMQPYACBABRVTYAPDQHVQESOWQJESMMMDLZULCVGTCYXVZYIJBKOODXCXYSAUCAFGNUUPIBXPHLSXPBBLTQTLUO)

and her message “SEE YOU AT ELEVEN AM IN THE TOWN SQUARE”, Alice finally ends up with the following encrypted message:

Alice: VOT IOL ES XAEWWP HU JY MOD SZJN JTAROZ

Bob: Got it, thanks.

Mysterious and cryptic; perhaps even romantic…

Alice can drop the spaces to prevent anything being inferred from word lengths. Bob will still be able to figure out the text. Alice could even make use of a generally lesser-used letter (for example Q or X) instead of a space to split words while still maintaining one continuous message. This avoids highlighting individually encrypted words.

Alice does not need the entire key since her message is short. Bob simply repeats this process with subtraction and modulo instead of addition and modulo. He subtracts the key from the encrypted message to derive the original, decrypted message. He now knows when and where to meet up with Alice.

Alice and Bob are both smart people. They keep all of the following in mind:

  • Alice and Bob ensure that they received the only copies of the pads. Alice generated it and handed it to Bob in person, or vice versa.
  • The message should not be longer than the key (i.e. shouldn’t loop around).
  • They need to guard the pads with their lives and store them securely, especially if being found out could cost them their lives.
  • They need to burn the used sheet after encryption/decryption has occurred.
  • They transmit their encrypted message via another secure platform, such as Signal, if possible.

Thanks for following along! Be safe and secure in your communications. You can easily reproduce the work done here, with free software, if you have a really confidential message you’re trying to get across.

Searching for a string in an entire SAS library

For the most part, experienced SAS programmers know where to look for the source data they need. In the pharmaceutical industry, we are familiar with CDISC standards and data structures.

However, should the data standard be unfamiliar or the source datasets include new or unusual parameters, it may be prudent to have SAS look through the data on your behalf to save time.

We can do this by making use of PROC CONTENTS and SAS macro loops with the forward re-scan rule. If you’re not familiar with the forward re-scan rule, see my other blog post covering the topic.

First, let us define the two input parameters we’ll need: the library we’re delving into, and the character string we’re looking for. These are the only two parameters the end user needs to edit.

/*Search parameters - LIBRARY and text to search for*/
%let lib = %str(sashelp);
%let searchtext = %str(cholesterol);

Next, we’ll find all the datasets in the library specified above by running PROC CONTENTS and packing the results into iterable macro variables. In this case we can allocate a maximum of 999 datasets, but you can increase this value as needed.

/*Find all datasets within the library and allocate into numbered macro variables DS1, DS2, ...*/
proc contents data=&lib.._all_ out=members noprint;
run;

proc sql noprint;
    select distinct memname into :ds1-:ds999 from members;
quit;
%let ds_num = &sqlobs.;

Specify an empty results dataset. As matches are found during iteration, they will be appended to this dataset.

/*Clear the results dataset. As matches are found, they will be appended here.*/
data results;
    set _null_;
run;

Next we will loop over the datasets allocated as DS1, DS2, DS… above. The logical outline of the loop is as follows:

/*The macro logic for looping over each dataset in DS1, DS2, ...*/
%macro loop;
    %do i = 1 %to &ds_num.;
        /*For each dataset in the library, output its metadata contents*/

        /*Check the metadata for matches of the search text. This includes the dataset name, as well as the column names and labels.*/

        /*Check the actual data values for matches of the search text.*/

        /*For the current dataset, set the metadata and row value results.*/

        /*For the output dataset append the results of each loop iteration.*/
    %end;
%mend loop;
%loop;

For each step above, detailed logic follows below:

/*For each dataset in the library, output its metadata contents*/
proc contents data=&lib..&&ds&i out=c noprint;
run;
/*Check the metadata for matches of the search text. This includes the dataset name, as well as the column names and labels.*/
data metadata;
    length msg $500. dataset varname varlabel $100.;
    set c;
            
    /*Dataset name checking*/
    if index(upcase(memname), upcase("&searchtext.")) > 0 then do;
        msg = "Dataset: Found &searchtext.";
        dataset = "&&ds&i";
        varname = "";
        varlabel = "";
        output;
    end;

    /*Column name and label checking*/
    if index(upcase(name), upcase("&searchtext.")) > 0 or index(upcase(label), upcase("&searchtext.")) > 0 then do;
        msg = "Col Meta: Found &searchtext.";
        dataset = "&&ds&i";
        varname = name;
        varlabel = label;
        output;
    end;
            
    keep dataset varname varlabel msg;
            
    /*Get rid of duplicate messages in the case of multiple identical matches.*/
    proc sort nodupkey; by msg dataset varname varlabel;
run;
/*Check the actual data values for matches of the search text.*/
data actualdata;
    length msg $500. dataset varname varlabel varval $100.;
    set &lib..&&ds&i;
            
    /*Create an array of all character variables in the current dataset.*/
    array allChar _CHARACTER_;

    /*Loop over each character column and check for matches of the search text.*/
    do over allChar;
        if index(upcase(allchar), upcase("&searchtext.")) > 0 then do;
            msg = "Value: Found &searchtext.";
            dataset = "&&ds&i";
            varname = vname(allchar);
            varlabel = vlabel(allchar);
            varval = allchar;
            output;
        end;
    end;
    keep dataset varname varlabel varval msg;
            
    /*Get rid of duplicate messages in the case of multiple identical matches.*/
    proc sort nodupkey; by msg dataset varname varlabel varval;
run;
/*For the current dataset, set the metadata and row value results.*/
data allmsgs;
    set metadata actualdata;
    msg = upcase(msg);
run;
/*For the output dataset, OUT, keep appending the results of each loop iteration.*/
data results;
    set 
        results 
        allmsgs
    ;
run;

The steps above conclude the inner logic of the loop. Finally, outside the loop, we can clean up the datasets we no longer need.

/*Clean up*/
proc datasets library=work nolist;
    delete members c metadata actualdata allmsgs;
quit;

We started off with the SASHELP library, looking for the string “cholesterol”. These are the results:

Since we are searching in SASHELP, the results also include values from the SAS helper datasets, i.e., VCOLUMN and VMACRO.

This search function is not case sensitive. Optimise and implement as needed for your particular scenario.