getCPS is a package that allows for the fast fetching of CPS Basic microdata from the Census Bureau’s API. Working with the Census Bureau’s API can be a bit complicated and tedious and hence the main motivation behind this package.

In order to get started we need to make sure that getCPS has beeen installed properly and that it can be loaded

library(getCPS)

If you were able to run call above without errors, you should be good to go! If you need help installing the package, the installation instrutions can be found at the main page of this website: Click Here

Pulling Data for a Single State Using get_cps_data_state()

Data for a single state can be pulled by using get_cps_data_state() and passing a year range and variable list as vectors, along with a FIPS state code. In this example we will use my home state of Nevada, whose FIPS state code is ‘32’, and we will pull data on labor force employment status (PEMLR) and age (PRTAGE) for each respondent. You must also include a Census API key in the function, or have one stored in your .Renviron file as "CENSUS_API_KEY". If you don’t have a Census API key you can obtain one at: api.census.gov

Example - Fetching Data for a Single State

cps_data<-get_cps_data_state(year_range = c(2021:2023),
                             variable_list = c('PEMLR', 'PRTAGE'),
                             state_code = "32",
                             census_api_key = "YOUR CENSUS API KEY")
head(cps_data,10)
## [1] "Data not available for oct 2023"
## [1] "Data not available for nov 2023"
## [1] "Data not available for dec 2023"
##    state PEMLR PRTAGE  PWCMPWGT  PWSSWGT       DATE
## 1     32     5     74 1495.8822 1534.770 2021-01-01
## 2     32     5     77 1673.2943 1704.638 2021-01-01
## 3     32     1     29 1762.9311 1763.227 2021-01-01
## 4     32     1     31 2716.6244 2670.363 2021-01-01
## 5     32    -1     12    0.0000 1440.297 2021-01-01
## 6     32    -1      3    0.0000 2040.847 2021-01-01
## 7     32     5     77 1674.9031 1718.445 2021-01-01
## 8     32     5     76 1466.8712 1494.348 2021-01-01
## 9     32     1     32 2161.8659 2152.577 2021-01-01
## 10    32     1     38 2336.8468 2272.478 2021-01-01

This should have returned a data.frame object, with some warnings about months that it was unable to retrieve. This guide was written on November 1st of 2023, so CPS microdata for the months of October, November, and December are not yet available, and hence why we see the warnings. The warnings can be useful in catching errors. For example sometimes certain variables are only available during certain time periods, and trying to retrieve variables in a time period when they are not available will trigger this warning.

Understanding the output:

Our dataframe cps_data has a couple columns: state,PEMLR, PRTAGE, PWCMPWGT, PWSSWGT, and DATE.

  • The state column represents the state FIPS code for the person’s who information is recorded.

  • PEMLR was one of the variables we requested, and is a categorical variable, each value represents a unique code that matches the labor force status for each respondent, later in this guide we will see how we can convert these codes to labels which will make it easier to understand the response recorded.

  • PRTAGE as mentioned earlier is the age of the respondent, topcoded in order to protect the privacy of the respondent. For example someone with an age of 97 would be topcoded at 85.

  • PWCMPWGT and PWSSWGT, are statistical weights for each response and can be used to estimate aggragate population totals. Even though the weights were not requested explicitly in our get_cps_data_state() call, the function has built in logic that automatically pulls the suggested weights for each of the variables requested.

    • If you would like to know the suggested weight of a variable you can use the suggested_weight() function. For example if you wanted to know the suggested weight for PEMLR you coude use suggested_weight("PEMLR", year = "2023"). This will print the suggested weight for PEMLR, by default year is set to “2023” as suggested weights rarely differ from year to year in the CPS Basic microdata.
## [1] "Suggested weight for PEMLR is PWCMPWGT"
  • DATE is just a column that keeps track of which month the response was recorded. This column is already of data type making it easy to sort, filter and manipulate data based on specific dates.

Using get_cps_data_all_states() to Obtain Data for All States

In some cases, you might be interested in pulling CPS data for all states. This can be achieved using the get_cps_data_all_states() function. Similar to get_cps_data_state(), you need to pass the year range and the variables you are interested in.

If you want to filter the data to include only specific states, you can use the state_filter argument, passing it as a character vector of state FIPS codes. If you leave state_filter as FALSE, the function will return data for all states.

Remember to provide your Census API key if you haven’t set it in your .Renviron file.

Example - Fetching Data for All States

The following example demonstrates how to fetch data for the years 2022 and 2023, including variables related to the respondent’s sex (PESEX) and age (PRTAGE). We will retrieve data for all states without filtering.

all_states_data <- get_cps_data_all_states(year_range = c(2022, 2023),
                                           variable_list = c("PESEX", "PRTAGE"),
                                           census_api_key = "YOUR CENSUS API KEY")
head(all_states_data)

The output will be a data.frame with the requested data. If you wish to fetch data only for specific states, you can specify them as follows:

specific_states_data <- get_cps_data_all_states(year_range = c(2022, 2023),
                                                variable_list = c("PESEX", "PRTAGE"),
                                                state_filter = c("32", "06"), # Example for Nevada and California
                                                census_api_key = "YOUR CENSUS API KEY")
head(specific_states_data)

The output will be similar to that generated by get_cps_data_state, with the only difference being that data will be generated for all states and not just one. Remember, working with data covering all states can be quite extensive and might require considerable computational resources depending on the volume of data and the capacity of your system, if you encounter errors due to memory allocation, breaking up the data retrieval into chunks is recommended.

Obtaining Labels for Coded Variables

As we mentioned earlier in this guide, some variables such as PEMLR are categorical variables, when we run either get_cps_data_state() or get_cps_data_all_states(), these are retrieved as codes where each code represents a different response category. Within getCPS there are two functions which can help us match the categorical variable codes to their corresponding labels.

Example - Retreiving variable labels using get_labels()

If we wanted to retrieve the variable labels for PEMLR in 2023 we could use get_labels() in the following manner:

get_labels("PEMLR", year_range = "2023")
## year_range argument defaulting to 2023
##   code                       label
## 1   -1             Not in Universe
## 2    1            Employed-At Work
## 3    2             Employed-Absent
## 4    3        Unemployed-On Layoff
## 5    4          Unemployed-Looking
## 6    5  Retired-Not In Labor Force
## 7    6 Disabled-Not In Labor Force
## 8    7    Other-Not In Labor Force

If a single year and variable are provided, the function returns a data frame with the corresponding code and label matches. For multiple years or variables, it returns a nested list of data frames. By default year_range is set to “2023”.

Example - Retreiving variable labels for multiple years using get_labels():

When using multiple variables or years in the year_range argument, each dataframe can be accesed by using either [] or the $ operator, as in the the following two examples

labels_mult_years <- get_labels("PERRP", year_range = c(2019:2020))
labels_mult_years[["PERRP"]][["2019"]]
##    code                                  label
## 1     1    Ref Pers with other relativew in HH
## 2    10   Non-rel of ref. per w/own rels in HH
## 3    11                               Not used
## 4    12 NON-REL OF REF PER W/NO OWN RELS IN HH
## 5    13    Unmarried partner w/ own rels in HH
## 6    14    Unmar. partner w/ no own rels in HH
## 7    15   Housemate/roommate w/ own rels in HH
## 8    16  Hsemate/roommate w/ no own rels in HH
## 9    17       Roomer/boarder w/ own rels in HH
## 10   18      Roomer/brder w/ no own rels in HH
## 11    2 REF PERS WITH NO OTHER RELATIVES IN HH
## 12    3                                 SPOUSE
## 13    4                                  CHILD
## 14    5                             GRANDCHILD
## 15    6                                 PARENT
## 16    7                         BROTHER/SISTER
## 17    8                         OTHER RELATIVE
## 18    9                           FOSTER CHILD
labels_mult_years$PERRP$`2020`
##    code                                                   label
## 1    40                         Reference Person with Relatives
## 2    41                      Reference Person without Relatives
## 3    42                                     Opposite Sex Spouse
## 4    43           Opposite Sex Unmarried Partner with Relatives
## 5    44        Opposite Sex Unmarried Partner without Relatives
## 6    45                                         Same Sex Spouse
## 7    46               Same Sex Unmarried Partner with Relatives
## 8    47            Same Sex Unmarried Partner without Relatives
## 9    48                                                   Child
## 10   49                                              Grandchild
## 11   50                                                  Parent
## 12   51                                          Brother/Sister
## 13   52                      Other relative of Reference Person
## 14   53                                            Foster Child
## 15   54                       Housemate/Roommate with Relatives
## 16   55                    Housemate/Roommate without Relatives
## 17   56                           Roomer/Boarder with Relatives
## 18   57                        Roomer/Boarder without Relatives
## 19   58    Other Nonrelative of Reference Person with Relatives
## 20   59 Other Nonrelative of Reference Person without Relatives

Automatically Labeling of Retrieved CPS Data Frames

Aside from being able to retrieve the labels for categorical variables, it is also possible to automatically label the output of either get_cps_data_state() and get_cps_data_all_states(). This can be done by passsing the output of either of the aforementioned functions into the label_data() fuction.

label_data(cps_data)
## Downloading Labels from JSON files
## Labels obtained
## Labeling Data...
## Labeling Complete!
##        state PRTAGE  PWCMPWGT  PWSSWGT       DATE code_PEMLR
##     1:    32     74 1495.8822 1534.770 2021-01-01          5
##     2:    32     77 1673.2943 1704.638 2021-01-01          5
##     3:    32     29 1762.9311 1763.227 2021-01-01          1
##     4:    32     31 2716.6244 2670.363 2021-01-01          1
##     5:    32     12    0.0000 1440.297 2021-01-01         -1
##    ---                                                      
## 45325:    32     11    0.0000 1885.300 2023-09-01         -1
## 45326:    32     12    0.0000 1679.196 2023-09-01         -1
## 45327:    32     14    0.0000 1708.978 2023-09-01         -1
## 45328:    32     14    0.0000 1708.978 2023-09-01         -1
## 45329:    32     15    0.0000 1996.577 2023-09-01          7
##                       label_PEMLR
##     1: Retired-Not In Labor Force
##     2: Retired-Not In Labor Force
##     3:           Employed-At Work
##     4:           Employed-At Work
##     5:            Not in Universe
##    ---                           
## 45325:            Not in Universe
## 45326:            Not in Universe
## 45327:            Not in Universe
## 45328:            Not in Universe
## 45329:   Other-Not In Labor Force

label_data() will try to automatically detect categorical variables and create two new columns, code_{variable} and label_{variable}, corresponding to the original variable code and the matching label for the code. In the above example we can see that PEMLR got replaced by code_PEMLR and label_PEMLR