Learning how to double-set

This little toy program helped us to understand what goes on when doing a double set with a key statement. The program is here , the output here, and the log file here. A commented version follows. Click on the links to view the appropriate program, output, and log files.
  1. Reading in (and printing) the data The original datasets are created by a read in [Program file, Output (Data1, Data2)]. You'll be able to come back to these for easy reference.
  2. Creating the indexes [Program file, Logfile]. The Proc datasets procedure is the way to go for creating and deleting indexes. The command (after modify DATASET;) to delete is simply index delete Name_Of_index. Note that no output is created in the .lst file. Note also that the lookup file (data2) does not need to be indexed for this to work. In fact, it does not even need to be sorted. Try it out yourself by downloading the original SAS program to your home directory, modifying and running it.
  3. First try: [Program file, Output] This kind of set statement will work if the keys are unique on both files, since it goes in and picks out the first observation (in physical sort order at the time the index was created), discarding all other matches. (Compare with the original data files).
  4. Second try: [Program file, Output] Although the merge key is not unique on the Master data, we wondered what would happen if we specified the unique modifier. Nothing different happened here, so we still have to find out what happens otherwise. (Compare with the original data files).
  5. Third try: [Program file, Output] We then tried to get multiple observations per lookup observation. The _iorc_ variable is zero if a match was found, and positive if not. So we thought: read until you find a non-match, at which point you then go to the next lookup in the lookup data set. However, as you can see (Compare with the original data files), this did not work. The reason is that our do while loops gets executed at the time of the first non-match as well, leading to observation 4 in the output dataset. We don't really quite understand why it does not output the other observations. We would have thought (after the fact) that the logic would output all matching observations plus the first non-matching, but that does not seem to be the case. For you to figure out - we just retain that this does not work.
  6. Fourth try: [Program file, Output] The trick is to create an auxiliary variable, continue, which allows us to create a supplementary condition within the do while loop. This is the STRONGLY SUGGESTED way of using these double set statements! (Compare with the original data files)

Addendum by Kevin McKinney (Date: June 30, 2000)

The original double set code developed by Kevin and Lars is suitable for a unique list of key values in the TRANSACTION (lookup, data2) dataset and multiple records for the key variable in the MASTER (data1) dataset. The original double set code is not designed to handle lookups for the following situations.

NOTE: If the MASTER dataset is known to have only one record for each key value then the code in example one is more efficient although the original code will still work.

NOTE: Only the matches are output in these examples. Additional code will be needed to handle non-matches.

  1. The dataset in the first set statement (TRANSACTION, lookup) is not a unique list of key values and the dataset in the second set statement (MASTER) does have a unique set of key values. (The reason we use the double set here is because MASTER is far larger than TRANSACTION) Example: data3. The KEY is SEIN YEAR QUARTER. The key value is repeated for the first two records and again in the fifth record.

    Using the standard double set code SAS will not find a record in the master dataset for the second record in the transaction dataset. This occurs because SAS does not move the pointer from its previous position until it needs to look up a record in the transaction dataset with a different key value. Observation five, like observation two, has the same key value as observation one. A record from the master dataset will be returned for observation five since the pointer has been moved when the double set was issued for observations three and four(they have a different key value).

    In this case we can remove the continue loop from the original double-set code since the observations in the MASTER dataset are unique. We also must add the UNIQUE option to the second set statement. This code is more efficient than the original double set code since only one lookup is performed for each record in the transaction dataset.

    THUS: Use this in all cases when you know the MASTER dataset key is unique.

  2. The dataset in the first set statement (TRANSACTION, lookup) is not a unique list of key values and the dataset in the second set statement (MASTER) does NOT have a unique set of key values.

    Example: Imagine we are looking up a series of EIN's from the work history file in the SSEL. Some of these EIN's will be repeated since individuals work for the same EIN in different years and also because multiple people work at the same firm. The dataset has the same structure as in example one. The KEY (ein) is repeated for observation one, two, and five.

    This case requires resetting the pointer on each iteration of the data step. Unfortunately this requires two additional lookups in the MASTER dataset. Once to know when the end of all the records for that EIN is reached. Once more to reset the pointer.

    Resetting the pointer is accomplished by looking up a key value that does not exist in the MASTER dataset. I have chosen "zzzzzzzzz" for the EIN. This may or may not be appropriate in your application.

Any comments? Email Lars or Kevin.