Statistics and Data Analysis
Descriptive Statistics for Healthcare Data Analysis
Using the SAS® System
Rick Andrews, Office of Research, Development, and Information (ORDI)
Centers for Medicare and Medicaid Services (CMS)
ABSTRACT
This presents a method of analyzing Medicare databy accounting for constant changes in how Medicare enrollees elect to receive health care. When performing analysesof thesedata,researchersmust account for the number of beneficiaries for which theAgency isduty-bound to make payments and how that payment is made. The topics include claim aggregation, member months or person months, the calculation ofcost per person, utilization per thousand, and unit cost.These statistics are relevant when predicting the future costs of the program. A general understanding of SAS programming using version 9.1.3 is assumed.
1
CMSSUG-05006-SQLJanuary 10, 2006
Statistics and Data Analysis
INTRODUCTION
The Agencyis“at risk”, that is, obligated to make possible payments for, individuals who choose Fee-For-Service (FFS) in lieu of a Health Maintenance Organization (HMO) plan.In FFS the beneficiary makes almost all health care decisions independently. Participating providers submit the claim on behalf of FFS enrollees to fiscal agents called intermediaries and carriers who act on behalf of CMS. It is important to know how many individuals CMS are “at risk”, how many services they used, and how much those services cost in order to make projections on the future cost of healthcare. For example, in 1966 when the first citizen was eligible for Medicare,19.1 million beneficiaries (CMS, 2006)enrolled and in 2005 over 42.5 million (CMS, 2006) enrolled.The term member months is synonymous with an HMO and person months with traditional FFS Medicare. The two terms are used interchangeably here.
Figure 1: Percent Difference Comparison (shown below) illustrates the impact of using a cost per member statistic for healthcare data analysis. The payments used are intended to illustrate the impact of increases and/or decreases in membership in relation to a payment amount for a given time period. Each of the four examples reflects an increase in payment amount from the previous to current time period (e.g. quarter to quarter, year to year). Although the payments in the examples are forced to increase from period to period, the membership is free to increase or decrease, reflecting reality, from period to period. Take careful note of the percent differences in the payment amounts compared to the percent differences in cost per member. In example 1, the payment increase was 4.9 percent, but the cost per member increase was only 0.9 percent. This is because membership also increased and did so at a rate slightly lower than the payment increase. In example 4, there was only a 1.0 percent increase in payments, but since the membership dropped 3.5 percent, the result was a 4.7 increase in cost per member. As shown in the illustration, it is quite easy to misinterpret the impact of an increase in payments if membership is not also considered.
Previous Year Payment Amount / Current Year Payment Amount / Percent Difference in Payment Amount / Previous Year Eligibility / Current Year Eligibility / Percent Difference Eligibility / Previous Year Cost per Person / Current Year Cost per Person / Percent Difference in Cost per PersonExample 1 / $150,010 / $157,421 / 4.9% / 5,000 / 5,200 / 4.0% / $30.00 / $30.27 / 0.9%
Example 2 / $170,402 / $175,845 / 3.2% / 4,875 / 4,950 / 1.5% / $34.95 / $35.52 / 1.6%
Example 3 / $180,222 / $185,123 / 2.7% / 4,500 / 4,490 / -0.2% / $40.05 / $41.23 / 2.9%
Example 4 / $200,189 / $202,268 / 1.0% / 4,444 / 4,290 / -3.5% / $45.05 / $47.15 / 4.7%
MEMBER or PERSON MONTHS
The code for creating the import of data and ultimately the statistics will now be presented. A common source used to calculate member months at CMS is called the Denominator file. This file contains one record per beneficiary andincludesdata on all Medicare membersentitled and/or enrolled in a given year. The file can be obtained through the Data Extract System (DESY), a description of which is beyond the scope of this paper.
Many variables exist on the Denominatorfile, whichinclude the beneficiary identification number, age, race, and gender. Forpurposes of this paper, the variables imported for demonstration purposes are intended to answer the question:
“For each month in calendar year 2006, how many FFS beneficiaries are enrolled inSupplementary Medical Insurance (Part-B of Medicare)?”
Figure 2: SAS Input Statement for the Denominator Fileshows the import of twelve variables indicating the entitlement of a member, onefor each month of the year,and twelve variables indicating whether the member has chosen an HMO (e.g. Managed Care plan)in lieu of a FFS plan.
NOTE: The input statement assumes the data are stored on an IBM® mainframe; the format of $CHAR1 may need to change if the data are exported to another operating system.
The next step determines if the beneficiary is entitled for Part-B of Medicare. The Denominator data dictionary identifiesthe following as valid values for the entitlement indicators:
0 = NOT ENTITLED
1 = PART A ONLY
2 = PART B ONLY
3 = PART A AND PART B
A = PART A, STATE BUY-IN
B = PART B, STATE BUY-IN
C = PARTS A AND B, STATE BUY-IN
If the ENT_IND1 variable contains a 2, 3, B, or C the beneficiary was entitled to Part-B of Medicare in January of the year in question, in this case 2006.
The next part of the IFstatement determineswhether the beneficiary has elected an HMO plan or is in FFS.
The Denominator data dictionary identifiesthe following as valid values for the HMO indicators:
0 = NOT A MEMBER OF HMO
4 = PREVENTIVE SERVICES DEMONSTRATION(FFS)
MEMBER or PERSON MONTHS (cont.)
The continuation of the IF statement tests to see if the beneficiary is not a member of an HMO (0) or is part of a disease management demonstration (4),
both of which indicate the member has elected FFS and theMEM_MTHnvariable should be incremented by one. This continues for all twelve months of the year. When the last record is read, meaning the END_OF_FILE has been reached; the record is then explicitly output to a SAS data set.
The result is a file containing twelve variables representing the number of FFS beneficiaries entitled for Part-B of Medicare for each month of the year.MEM_MTH1represents January, MEM_MTH2 represents February, and so on.
The data can then be transposed into a format allowing for easier merging of claims data. Figure 5: SAS Code to Transpose Member Month File shows the syntax for turning the one record of twelve variables into twelve records, one for each month. PROC TRANSPOSE is used to output MEM_MTH1 through MEM_MTH12 into rows, or observations, in lieu of variables.
The date, 01JAN2006 for example, is set to the first day of the month in order to match claims data that will be formatted in the same mannerto allow the member months and claims data to be more easily merged. Note the output of the PROC TRANSPOSE contains a variable called _NAME_, whose value is that of the member month variable. When the value is MEM_MTH1,theSVC_MTH variable is set to 01JAN2006 and so on.The result is listed below in Table 2: Member Months.
MEMBER MONTHS (cont.)
The previous example explains the process of creating a member monthdata set from the Denominator file at CMS. The entitlement and HMO indicators are first read into a SAS data set. Then twelve variables are created to represent the count of beneficiaries for each month of the year based on given criteria. Those data are then transposed from twelve columns into twelve observations or rows. The final step is to create a variable representing the first day of the month each record represents.
The process was spelled out in this manner to explain the thought process of creating the member month file. There were almost one-hundred lines of code used in this method, though a shorter technique is almost always available with some additional thought.
Figure 6: Shortened SAS Code to Transpose Member Month File uses less than two dozen lines of code to perform the same outcome. The first enhancement appears within the INPUT statement. The entitlement and HMO indicators share a common prefix, therefore can be grouped together by separating the first and last by a hyphen and surrounding them with parenthesis. Since each set of variables all have the same format the $CHAR1 can also be surrounded by parenthesis and only need be used once.
Arrays are established to cycle over each of the twelve monthly entitlement and HMO indicators. The ENT{I}, HMO{I}, and MEM{I} array variables will resolve to ENT_IND1, HMO_IND1, andMEM_MTH1 respectfully during the first iteration of the DO loop. During the second iteration the ENT{I} variable will resolve to ENT_IND2and so on.
Each time the criteria is the metMEM{I}array variable is incremented by one. When SAS reaches the END_OF_FILE , it will then set the MEM_MTHS variable to that increment and create a variable called SVC_MTH equal to the first day of the month. Note the use of a macro variable called YR, which is set at the beginning of the program and used to create the SVC_MTH variable. In lieu of hard-coding a date when creating the SVC_MTH, functions are used. The TRIM(LEFT(I)) uses the variable I as the month, one for January, two for February; the INPUT function is then used to create a SAS date set to the first day of the month in order match to the claim aggregation.
Both methods produce the same results shownin Table 2: Member Months.
Keep in mind when creating member months to match the cohort of the claims. If the desire is to study the number of FFS beneficiaries in the state of Maryland who have or could potentially have breast cancer, the denominator (e.g. member months) should be limited to individuals residing in or having claims processed in Maryland and who have the greatest potential for breast cancer (e.g. females).
CLAIM IMPORT
A number of mechanisms exist to obtain Medicare claims data. At the root of all claim filesare the National Claims History (NCH). Another form of the NCH are the Standard Analytic Files (SAFs).The SAFs are created six months after the beginning of the year and are updated every quarter thereafter until an eighteen month window is reached. These files, typically referred to as the eighteen month SAFs, are approximately 99% complete. The reason for the cut-off is so the research community can confidently replicate a study exactly as it existed at the time it wasoriginally produced.
One method of obtaining NCH and SAF data is through the DESY system. These data also need to be read into SAS in order to perform analysis. The structure of these files is rather complex as there is a fixed, header portion and many trailer records, some representing claim lines (e.g. separate procedures billed by a physician on one claim). A complete description of the record layout is also beyond the scope of this paper as only the header portion of the claim is used.
Figure 8: SAS Input Statement for NCH Claim is a much shortened version of the NCH claim input statement. It is only capturing the claim through date (e.g. service date), the count of the number of claim lines and the payment amount. In this presentation the number of claim lines will be used as the number of services rendered (e.g. utilization). It may or may not be the actual number performed as the variable containing this information exists within the trailer records that are not being captured here.The count of claim lines is used for demonstration purposes.
NOTE: The input statement assumes data are stored on an IBM® mainframe; the “informats” may need to change if the data are exported to another operating system.
Table 3: Claims Data Set provides an example of the data used in this process. Notice the SVC_MTH variable has been set to the first day of the month in order to more easily merge with the member month file. To create the SVC_MTH, the MONTH, THRU_DT, and zero (0) options were used as arguments of the INTNX function when reading the claims input file
CLAIM AGGREGATION
Many methods of performing the same task such as aggregating claims data can be done using SAS procedures or a DATA STEP. The benefit of using one or the other is ease of use versus efficiency. Often the benefit of writing less programming code can lead to inefficient processing of data, where writing more programming code can often, though not always, lead to less data processing.
PROC SUMMARY
A typical mechanism for aggregating data in SAS is the SUMMARY procedure. It is important to know when using this procedure it is usually more efficient to physicallysort the data. The BY statementcanbereplaced with the CLASS statement to eliminate the need for a PROCSORT, though data will still be sorted behind the scenes in memory,and in many cases will run out of resources if the not pre-sorted.
The VAR statement is used to identify the numeric variables to be summarized. The OUTPUT statement tellsSAS where to send the data. In this case data are output to a SAS data set called MY_LIB.CLAIM_SUMMARY. The SUM=
option on the OUTPUT statement informs SAS which statistic is being requested. (Other statistics available are MEAN, MIN, MAX, among others.) The name of the output variables in the SUM= option corresponds to the order of the variables in the VAR statement. The variables here are output with the same names, though could have been named LINECNT_SUM and PMT_AMT_SUM. This is required when outputting other statistics. For example, if theMIN=option was used in conjunction with SUM=the variables could be namedLINECNT_MIN and PMT_AMT_MIN to identify them as the minimum values as shown below.
OUTPUT OUT = MY_LIB.CLAIM_SUMMARY
SUM = LINECNT_SUM PMT_AMT_SUM
MIN = LINECNT_MIN PMT_AMT_MIN;
PROC SQL
SAS also features the ANSI standard Structured Query Language (SQL) as an option. Again, it is typically more efficient to pre-sort data. It is not necessary to sort when using SQL, though data will still be sorted in memory behind the scenes, which may cause resource issues.
The CREATE TABLE clause in SQL is equalto the OUT= option in the OUTPUT statement ofPROC SUMMARY. The SELECT clause takes the place of the BY and VAR statements and is used to identify the variables to be output. In lieu of a SUM= option, SQL uses the SUM function to tell SAS which variables to aggregate. The FROM clause is used to identify the input data set and the GROUP BY clause informs SAS which variables to cluster together into groups. In this case, the only variable to be grouped isSVC_MTH. In most cases anytime a variable exists in the SELECT statement not being evaluated with an aggregate function like SUM, MIN, or MAX it needs to be included in the GROUP BY clause.
CLAIM AGGREGATION (cont.)
Whether to use the SUMMARY or SQLprocedure is often a personal preference as one may be more efficient than the other in one instance and not another. In both cases data need to be sorted in order to aggregate, though as mentioned, a physical sort does not need to occur, though is often required to prevent resource issues.In almost every case, if at all possible, a program should be written that does not require a sort as rearranging data causes additional input/output (I/O) operations.
DATA STEP
If the number of variables to be summarized is relatively small an array of accumulators can be created that are incremented when particular criteria have been satisfied. Here three array variables are created that will hold the month the service was performed, an accumulation of the claim line count, and the payment amount. The RETAIN statement tells SAS to keep the value summarized from the previous record.
A DO loop cycles over each of the twelve months in question, 2006 in this case. The MONTH function in the IF statement checks to see if the month is equal to the variable I in the DO loop. During the first iteration of the loop, the variable Iis set to one. If the THRU_DT is January the result of the MONTH function will be one and the IF statement will be satisfied. Once satisfied the MNTH{I} array variable is set to the first day of the month, and theLINECNT and PMT_AMTareadded to LINE{I} and PMNT{I} respectively.
Effectively, twelve variables are created to store the value of the service month, twelve variables are created to store the accumulation of the claim line count, and another twelve variables are created to store the accumulation of the payment amount. When the END_OF_FILE is reached, the last record will contain the summary of all twelve months and each will be output to a separate observation, one record for each month of the year.
The code can get much more complicated as the number of variables needed to be “grouped” increases. There is no tried and true method of determining which method to use; PROC SUMMARY, PROC SQL, or the DATA STEP. In order to get an idea of the value of additional coding versus the physical sorting of data, benchmarks should be created using each method on the computer the aggregation will take place. The point of the various examples is there are many ways to accomplish the same task with some being easier to code and others being more efficient.