getCPS.Rmd
getCPS is a package that allows for the fast fetching of CPS Basic microdata from the Census Bureau’s API. Working with the Census Bureau’s API can be a bit complicated and tedious and hence the main motivation behind this package.
In order to get started we need to make sure that getCPS has beeen installed properly and that it can be loaded
library(getCPS)
If you were able to run call above without errors, you should be good to go! If you need help installing the package, the installation instrutions can be found at the main page of this website: Click Here
get_cps_data_state()
Data for a single state can be pulled by using
get_cps_data_state()
and passing a year
range and variable list as vectors, along with a FIPS state code. In
this example we will use my home state of Nevada, whose FIPS state code
is ‘32’, and we will pull data on labor force
employment status (PEMLR) and age
(PRTAGE) for each respondent. You must also include a
Census API key in the function, or have one stored in your
.Renviron
file as
"CENSUS_API_KEY"
. If you don’t have a
Census API key you can obtain one at: api.census.gov
cps_data<-get_cps_data_state(year_range = c(2021:2023),
variable_list = c('PEMLR', 'PRTAGE'),
state_code = "32",
census_api_key = "YOUR CENSUS API KEY")
head(cps_data,10)
## [1] "Data not available for oct 2023"
## [1] "Data not available for nov 2023"
## [1] "Data not available for dec 2023"
## state PEMLR PRTAGE PWCMPWGT PWSSWGT DATE
## 1 32 5 74 1495.8822 1534.770 2021-01-01
## 2 32 5 77 1673.2943 1704.638 2021-01-01
## 3 32 1 29 1762.9311 1763.227 2021-01-01
## 4 32 1 31 2716.6244 2670.363 2021-01-01
## 5 32 -1 12 0.0000 1440.297 2021-01-01
## 6 32 -1 3 0.0000 2040.847 2021-01-01
## 7 32 5 77 1674.9031 1718.445 2021-01-01
## 8 32 5 76 1466.8712 1494.348 2021-01-01
## 9 32 1 32 2161.8659 2152.577 2021-01-01
## 10 32 1 38 2336.8468 2272.478 2021-01-01
This should have returned a data.frame
object, with some warnings about months that it was unable to retrieve.
This guide was written on November 1st of 2023, so CPS microdata for the
months of October, November, and December are not yet available, and
hence why we see the warnings. The warnings can be useful in catching
errors. For example sometimes certain variables are only available
during certain time periods, and trying to retrieve variables in a time
period when they are not available will trigger this warning.
Our dataframe cps_data
has a couple
columns:
state,PEMLR, PRTAGE, PWCMPWGT, PWSSWGT, and DATE
.
The state
column represents the
state FIPS code for the person’s who information is recorded.
PEMLR
was one of the variables we
requested, and is a categorical variable, each value represents a unique
code that matches the labor force status for each respondent, later in
this guide we will see how we can convert these codes to labels which
will make it easier to understand the response recorded.
PRTAGE
as mentioned earlier is the
age of the respondent, topcoded in order to protect the privacy of the
respondent. For example someone with an age of 97 would be topcoded at
85.
PWCMPWGT
and
PWSSWGT
, are statistical weights for each
response and can be used to estimate aggragate population totals. Even
though the weights were not requested explicitly in our
get_cps_data_state()
call, the function
has built in logic that automatically pulls the suggested weights for
each of the variables requested.
suggested_weight()
function. For
example if you wanted to know the suggested weight for
PEMLR
you coude use
suggested_weight("PEMLR", year = "2023")
.
This will print the suggested weight for
PEMLR
, by default
year
is set to “2023” as
suggested weights rarely differ from year to year in the CPS Basic
microdata.
suggested_weight("PEMLR")
## [1] "Suggested weight for PEMLR is PWCMPWGT"
DATE
is just a column that keeps track
of which month the response was recorded. This column is already of data
type get_cps_data_all_states()
to Obtain Data
for All States
In some cases, you might be interested in pulling CPS data for all
states. This can be achieved using the
get_cps_data_all_states()
function.
Similar to get_cps_data_state()
, you need to pass the year
range and the variables you are interested in.
If you want to filter the data to include only specific states, you
can use the state_filter
argument, passing it as a
character vector of state FIPS codes. If you leave
state_filter
as FALSE, the function will return data for
all states.
Remember to provide your Census API key if you haven’t set it in your
.Renviron
file.
The following example demonstrates how to fetch data for the years
2022 and 2023, including variables related to the respondent’s sex
(PESEX
) and age
(PRTAGE
). We will retrieve data for all
states without filtering.
all_states_data <- get_cps_data_all_states(year_range = c(2022, 2023),
variable_list = c("PESEX", "PRTAGE"),
census_api_key = "YOUR CENSUS API KEY")
head(all_states_data)
The output will be a data.frame
with
the requested data. If you wish to fetch data only for specific states,
you can specify them as follows:
specific_states_data <- get_cps_data_all_states(year_range = c(2022, 2023),
variable_list = c("PESEX", "PRTAGE"),
state_filter = c("32", "06"), # Example for Nevada and California
census_api_key = "YOUR CENSUS API KEY")
head(specific_states_data)
The output will be similar to that generated by get_cps_data_state
,
with the only difference being that data will be generated for all
states and not just one. Remember, working with data covering all states
can be quite extensive and might require considerable computational
resources depending on the volume of data and the capacity of your
system, if you encounter errors due to memory allocation, breaking up
the data retrieval into chunks is recommended.
As we mentioned earlier
in this guide, some variables such as
PEMLR
are categorical variables, when we
run either get_cps_data_state()
or
get_cps_data_all_states()
, these are
retrieved as codes where each code represents a different response
category. Within getCPS there are two functions which
can help us match the categorical variable codes to their corresponding
labels.
get_labels()
If we wanted to retrieve the variable labels for
PEMLR
in 2023 we could
use get_labels()
in the following
manner:
get_labels("PEMLR", year_range = "2023")
## year_range argument defaulting to 2023
## code label
## 1 -1 Not in Universe
## 2 1 Employed-At Work
## 3 2 Employed-Absent
## 4 3 Unemployed-On Layoff
## 5 4 Unemployed-Looking
## 6 5 Retired-Not In Labor Force
## 7 6 Disabled-Not In Labor Force
## 8 7 Other-Not In Labor Force
If a single year and variable are provided, the function returns a
data frame with the corresponding code and label matches. For multiple
years or variables, it returns a nested list of data frames. By default
year_range
is set to
“2023”.
get_labels()
:
When using multiple variables or years in the year_range
argument, each dataframe can be accesed by using either
[]
or the $
operator, as in the the following two examples
labels_mult_years <- get_labels("PERRP", year_range = c(2019:2020))
labels_mult_years[["PERRP"]][["2019"]]
## code label
## 1 1 Ref Pers with other relativew in HH
## 2 10 Non-rel of ref. per w/own rels in HH
## 3 11 Not used
## 4 12 NON-REL OF REF PER W/NO OWN RELS IN HH
## 5 13 Unmarried partner w/ own rels in HH
## 6 14 Unmar. partner w/ no own rels in HH
## 7 15 Housemate/roommate w/ own rels in HH
## 8 16 Hsemate/roommate w/ no own rels in HH
## 9 17 Roomer/boarder w/ own rels in HH
## 10 18 Roomer/brder w/ no own rels in HH
## 11 2 REF PERS WITH NO OTHER RELATIVES IN HH
## 12 3 SPOUSE
## 13 4 CHILD
## 14 5 GRANDCHILD
## 15 6 PARENT
## 16 7 BROTHER/SISTER
## 17 8 OTHER RELATIVE
## 18 9 FOSTER CHILD
labels_mult_years$PERRP$`2020`
## code label
## 1 40 Reference Person with Relatives
## 2 41 Reference Person without Relatives
## 3 42 Opposite Sex Spouse
## 4 43 Opposite Sex Unmarried Partner with Relatives
## 5 44 Opposite Sex Unmarried Partner without Relatives
## 6 45 Same Sex Spouse
## 7 46 Same Sex Unmarried Partner with Relatives
## 8 47 Same Sex Unmarried Partner without Relatives
## 9 48 Child
## 10 49 Grandchild
## 11 50 Parent
## 12 51 Brother/Sister
## 13 52 Other relative of Reference Person
## 14 53 Foster Child
## 15 54 Housemate/Roommate with Relatives
## 16 55 Housemate/Roommate without Relatives
## 17 56 Roomer/Boarder with Relatives
## 18 57 Roomer/Boarder without Relatives
## 19 58 Other Nonrelative of Reference Person with Relatives
## 20 59 Other Nonrelative of Reference Person without Relatives
Aside from being able to retrieve the labels for categorical
variables, it is also possible to automatically label the output of
either get_cps_data_state()
and
get_cps_data_all_states()
. This can be
done by passsing the output of either of the aforementioned functions
into the label_data()
fuction.
label_data(cps_data)
## Downloading Labels from JSON files
## Labels obtained
## Labeling Data...
## Labeling Complete!
## state PRTAGE PWCMPWGT PWSSWGT DATE code_PEMLR
## 1: 32 74 1495.8822 1534.770 2021-01-01 5
## 2: 32 77 1673.2943 1704.638 2021-01-01 5
## 3: 32 29 1762.9311 1763.227 2021-01-01 1
## 4: 32 31 2716.6244 2670.363 2021-01-01 1
## 5: 32 12 0.0000 1440.297 2021-01-01 -1
## ---
## 45325: 32 11 0.0000 1885.300 2023-09-01 -1
## 45326: 32 12 0.0000 1679.196 2023-09-01 -1
## 45327: 32 14 0.0000 1708.978 2023-09-01 -1
## 45328: 32 14 0.0000 1708.978 2023-09-01 -1
## 45329: 32 15 0.0000 1996.577 2023-09-01 7
## label_PEMLR
## 1: Retired-Not In Labor Force
## 2: Retired-Not In Labor Force
## 3: Employed-At Work
## 4: Employed-At Work
## 5: Not in Universe
## ---
## 45325: Not in Universe
## 45326: Not in Universe
## 45327: Not in Universe
## 45328: Not in Universe
## 45329: Other-Not In Labor Force
label_data()
will try to automatically
detect categorical variables and create two new columns,
code_{variable}
and
label_{variable}
, corresponding to the
original variable code and the matching label for the code. In the above
example we can see that PEMLR
got replaced
by code_PEMLR
and
label_PEMLR