Learning how to double-set
This little toy program helped us to understand what goes on when doing a double set with a key statement.
The program is here , the output here,
and the log file here. A commented version follows. Click on the links to
view the appropriate program, output, and log files.
- Reading in (and printing) the data
The original datasets are created by a read in [Program file,
You'll be able to come back to these for easy reference.
- Creating the indexes [Program file,
Logfile]. The Proc datasets procedure is the way to
go for creating and deleting indexes. The command (after modify DATASET;) to delete is simply index delete Name_Of_index.
Note that no output is created in the .lst file. Note also that the lookup file (data2) does not need to be indexed for this
to work. In fact, it does not even need to be sorted. Try it out yourself by downloading the
original SAS program to your home directory, modifying and running it.
- First try: [Program file,
This kind of set statement will work if the keys are unique on both files, since it goes in and picks out the first
observation (in physical sort order at the time the index was created), discarding all other matches.
(Compare with the original data files).
- Second try: [Program file,
Although the merge key is not unique on the Master data, we wondered what would happen if we specified the unique modifier.
Nothing different happened here, so we still have to find out what happens otherwise.
(Compare with the original data files).
- Third try: [Program file,
We then tried to get multiple observations per lookup observation. The _iorc_ variable is zero if a match was found,
and positive if not. So we thought: read until you find a non-match, at which point you then go to the next lookup in the
lookup data set. However, as you can see
(Compare with the original data files), this did not work.
The reason is that our do while loops gets executed at the time of the first non-match as well, leading to observation
4 in the output dataset. We don't really quite understand why it does not output the other observations. We would have thought
(after the fact) that the logic would output all matching observations plus the first non-matching, but that does not seem to
be the case. For you to figure out - we just retain that this does not work.
- Fourth try: [Program file,
The trick is to create an auxiliary variable, continue, which allows us to create a supplementary condition
within the do while loop. This is the STRONGLY SUGGESTED way of using these double set statements!
(Compare with the original data files)
Addendum by Kevin McKinney (Date: June 30, 2000)
The original double set code developed by Kevin and Lars is suitable
for a unique list of key values in the TRANSACTION (lookup, data2) dataset and
multiple records for the key variable in the MASTER (data1) dataset. The original double set code is not designed to handle lookups for the following
NOTE: If the MASTER dataset is known to have only one record for each
key value then the code in example one is more efficient although the
original code will still work.
NOTE: Only the matches are output in these examples. Additional code will be
needed to handle non-matches.
- The dataset in the first set statement (TRANSACTION, lookup) is not a unique
of key values and the dataset in the second set statement (MASTER) does
have a unique set of key values. (The reason we use the double set here is
because MASTER is far larger than TRANSACTION) Example:
data3. The KEY is
SEIN YEAR QUARTER. The key value is repeated for the first two records and
again in the fifth record.
Using the standard double set code SAS will not find a record in the master
dataset for the second record in the transaction dataset. This occurs
because SAS does not move the pointer from its previous position until it
needs to look up a record in the transaction dataset with a different key
value. Observation five, like observation two, has the same key value as
observation one. A record from the master dataset will be returned for
observation five since the pointer has been moved when the double set
was issued for observations three and four(they have a different key value).
In this case we can remove the continue loop from the
original double-set code since the observations in the
MASTER dataset are unique. We also must add the UNIQUE option to the
second set statement. This code is more efficient than the original
double set code since only one lookup is performed for each record in
the transaction dataset.
THUS: Use this in all cases when you know the
MASTER dataset key is unique.
- The dataset in the first set statement (TRANSACTION, lookup) is not a unique list
of key values and the dataset in the second set statement (MASTER) does NOT
have a unique set of key values.
Imagine we are looking up a series of EIN's from the work
history file in the SSEL. Some of these EIN's will be repeated since
individuals work for the same EIN in different years and also because
multiple people work at the same firm. The dataset has the same structure as in example one. The KEY (ein) is repeated for observation one, two, and five.
This case requires resetting the pointer on each iteration of the data step.
Unfortunately this requires two additional lookups in the MASTER dataset. Once
to know when the end of all the records for that EIN is reached. Once more
to reset the pointer.
Resetting the pointer is accomplished by looking up a key value that does not
exist in the MASTER dataset. I have chosen "zzzzzzzzz" for the EIN. This may
or may not be appropriate in your application.
Any comments? Email Lars or