Data sources#
On this page you can find a description of common data sources and their location on the cluster.
Health data#
The following contains the description of the original/raw CMS data.
MBSF and MedPar
data_source |
|
description |
MedPar includes hospitalizations for FFS individuals (1999-2018). MBSF or the enrollment file and also has mortality for everyone (1999-2018). |
fasse_location |
Append |
size |
733 GB |
files |
Medicare data folder tree
├── 1999
│ ├── denominator
│ └── inpatient
├── 2000
│ ├── denominator
│ └── inpatient
├── 2001
│ ├── denominator
│ └── inpatient
├── 2002
│ ├── denominator
│ └── inpatient
├── 2003
│ ├── denominator
│ └── inpatient
├── 2004
│ ├── denominator
│ └── inpatient
├── 2005
│ ├── denominator
│ └── inpatient
├── 2006
│ ├── denominator
│ └── inpatient
├── 2007
│ ├── denominator
│ └── inpatient
├── 2008
│ ├── denominator
│ └── inpatient
├── 2009
│ ├── denominator
│ └── inpatient
├── 2010
│ ├── denominator
│ └── inpatient
├── 4334
│ ├── 2011
│ ├── 2012
│ ├── 2015
│ └── Extract File Documentation
├── 4580
│ ├── 2013
├── 5819
│ ├── 2014
│ └── Extract File Documentation
├── 7087
│ ├── 2015
│ └── Extract File Documentation
├── 8183
│ ├── 2016
│ └── Extract File Documentation
├── 10411
│ └── 2017
├── 2018
│ └── extract_file_documentation
├── Medicare Claims
├── Medicare Enrollment
└── Xwalk
MCBS
See also
Check out the following resources about the CMS health data:
Exposure data#
The following is the description of the air pollution exposure data.
ZIP code-level PM2.5, PM2.5 components, Ozone, and NO2 in the contiguous US#
PM2.5, Ozone, NO2
dataset_author |
Yaguang Wei |
date_created |
Oct 19, 2022 |
data_source |
Gridded PM2.5, PM2.5 components, ozone, and NO2; Esri ZIP code area and point files; U.S. ZIP code database. |
spatial_coverage |
US |
spatial_resolution |
zipcode |
temporal_coverage |
2000-2016 for PM2.5, ozone, and NO2; 2000-2019 for PM2.5 components. |
temporal_resolution |
daily, annually |
description |
For general ZIP Codes with a polygon representation, we estimated their pollution levels by averaging the predictions of grid cells whose centroids lie inside the polygon of that ZIP Code; For other ZIP Codes such as Post Offices or large volume single customers, we treated them as a single point and predicted their pollution levels by assigning the predictions of the nearest grid cell. These are updated ZIP code-level predictions. We filled in the missing values for grids, and added about 200 zip codes that are missing in the Esri files each year. The geographic information for the additional zip codes is extracted from US ZIP code database. Version 2 update: The v2 files ( |
git_repository |
ZIP_add_missing and private ZIP_add_missing |
publication |
The data are officially published through NASA SEDAC at sedac.ciesin.columbia.edu. |
fasse_location |
Add |
files |
├── [2.3G] NO2
│ ├── [6.5M] Annual
│ │ ├── [394K] 2000.rds
│ │ ├── [391K] ...
│ │ └── [391K] 2016.rds
│ └── [2.3G] Daily
│ ├── [393K] 20000101.rds
│ ├── [392K] ...
│ └── [395K] 20161231.rds
├── [2.3G] O3
│ ├── [6.4M] Annual
│ │ ├── [390K] 2000.rds
│ │ ├── [385K] ...
│ │ └── [384K] 2016.rds
│ ├── [2.3G] Daily
│ │ ├── [394K] 20000101.rds
│ │ ├── [388K] ...
│ │ └── [371K] 20161231.rds
│ └── [6.5M] Summer
│ ├── [391K] 2000_summer.rds
│ ├── [387K] ...
│ ├── [386K] 2016_summer.rds
│ └── [ 101] readme.txt
├── [2.3G] PM25
│ ├── [6.5M] Annual
│ │ ├── [395K] 2000.rds
│ │ ├── [392K] ...
│ │ └── [392K] 2016.rds
│ └── [2.3G] Daily
│ ├── [397K] 20000101.rds
│ ├── [395K] ...
│ └── [393K] 20161231.rds
├── [ 88M] PM25_components
│ ├── [4.4M] 2000.rds
│ ├── [4.4M] ...
│ ├── [4.4M] 2019.rds
│ └── [ 850] readme.txt
└── [ 974] README.md
PM2.5 Components - Obsolete#
PM2.5 component data
spatial_coverage |
US |
spatial_resolution |
zipcode |
temporal_coverage |
2000-2019 |
temporal_resolution |
annually |
size |
251 MG |
processing_description |
Superseded by |
fasse_location |
Append |
git_repository |
|
publication |
Amini, H., M. Danesh-Yazdi, Q. Di, W. Requia, Y. Wei, Y. Abu Awad, L. Shi, M. Franklin, C.-M. Kang, J. M. Wolfson, P. James, R. Habre, Q. Zhu, J. S. Apte, Z. J. Andersen, X. Xing, C. Hultquist, I. Kloog, F. Dominici, P. Koutrakis, J. Schwartz. 2022. Annual Mean PM2.5Components (EC, NH4, NO3, OC, SO4) 50m Urban and 1km Non-Urban Area Grids for Contiguous U.S., 2000-2019 v1. (Preliminary Release). Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC). https://doi.org/10.7927/7wj3-en73 |
dataset_author |
Yaguang Wei |
header |
|
files |
├── 2000.csv
├── ...
└── 2019.csv
Predicted daily smoke PM2.5 over the Contiguous US, 2006 - 2020#
Predicted daily smoke PM2.5
dataset_author |
Marissa Childs |
date_created |
October 24, 2020 |
data_source |
other (exposure predictions) |
spatial_coverage |
Contiguous US |
spatial_resolution |
originally 10 km (gridded), aggregated to zcta, census tract, and county by area and population-weighted averages |
temporal_coverage |
2006 - 2020 |
temporal_resolution |
daily |
exposures |
PM2.5 from smoke |
processing_description |
none |
fasse_location |
Append |
publication |
|
git_repository |
|
size |
6 GB |
files |
├── 10km_grid
│ ├── 10km_grid_wgs84
├── county
│ └── tl_2019_us_county
├── tract
│ └── tracts
│ ├── tl_2019_01_tract
│ ├── tl_2019_04_tract
│ ├── ...
│ └── tl_2019_56_tract
└── zcta
└── tl_2019_us_zcta510
Space weather data#
Space weather data
dataset_author |
Carolina L Zilli Vieira |
date_created |
Oct 17 2022 |
data_source |
NASA - solar and geomagnetic activity data from https://omniweb.gsfc.nasa.gov/html/omni_source.html, DAAC NASA (solar radiation) from https://daac.ornl.gov/, BARTOL Neutron Station (neutrons) from https://neutronm.bartol.udel.edu/ |
spatial_coverage |
Global UTC (from raw data) converted to local time. |
spatial_resolution |
county |
temporal_coverage |
1996-2022 |
temporal_resolution |
daily |
processing_description |
We processed the data in UTC to US time zone data. From this source, it is not possible to have spatial data. To do so, we converted UTC global data to US local time data. Then we used these local time zone data to identify county. The numbers change a little by location based in the time zone. We provided daily data, which can be aggregated them to monthly and annual data. |
fasse_location |
|
git_repository |
|
size |
1.18 GB |
files |
PM2.5 US High Resolution Grid, 2000-2016#
PM2.5 US Grid
spatial_coverage |
US |
spatial_resolution |
1km x 1km |
temporal_coverage |
2000-2016 |
temporal_resolution |
annually |
size |
~80 MB/year |
fasse_location |
Append |
processing_description |
Merge by row the 1-column matrix PM2.5 values ( |
publication |
Q. Di, H. Amini, L. Shi, I. Kloog, R. Silvern, J. Kelly, M. B. Sabath, C. Choirat, P. Koutrakis, A. Lyapustin, Y. Wang, L. J. Mickley, J. Schwartz, An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution. Environ. Int. 130, 104909 (2019). https://pubmed.ncbi.nlm.nih.gov/31272018/ |
files |
├── PredictionStep2_Annual_PM25_USGrid_20000101_20001231.rds
├── ...
├── PredictionStep2_Annual_PM25_USGrid_20160101_20161231.rds
├── readme.txt
└── USGridSite.rds
Confounder data#
Gridmet#
This project aggregates Gridmet data into social boundaries such as zip codes or census tracts that can be then joined with data available at those social units such as Medicare or Census data. Specifically it does the following:
Start point: GRIDMET climate data (4x4km grid), Census Bureau Zip Code Tabulation Area (ZCTA) boundaries
Aggregation technique: area weight
Output: ZCTAs with the area-weighted average of each GRIDMET variable
Gridmet data
fasse_location |
|
dataset_author |
Nate Fairbank |
date_created |
July 15, 2022 |
spatial_coverage |
Continental US |
spatial_resolution |
4x4km aggregated to Zip Code Tabulation Area (ZCTA) |
temporal_coverage |
2000-2018 |
temporal_resolution |
daily |
data_source |
GRIDMET, Census Bureau |
data_source description
All original GRIDMET varaibles are preserved. There are a total of 18:
Primary Climate Variables (9): Maximum temperature, minimum temperature, precipitation accumulation, downward surface shortwave radiation, wind-velocity, wind direction, humidity (maximum and minimum relative humidity and specific humidity)
Derived variables (7): Reference evapotranspiration (ASCE Penman-Montieth), Energy Release Component*, Burning Index*, 100-hour and 1000-hour dead fuel moisture, mean vapor pressure deficit, 10-day Palmer Drought Severity Index *fuel model G (conifer forest)
Variables from data processing (2):
CRS: originally “coordinate reference system”, this had a value of “1” for every grid in GRIDMET. As these grids were tabulated into ZCTAs, these “1”s were tabulated as well. Thus, this number indicates how many grids (partial or whole) were part of the area aggregation for that zip code.
AreaProp: To do the area weighting, each ZCTA/grid pairing was given a percentage of how much of the ZCTA’s area was contained in that grid. For each ZCTA, these proportions sum to 1, meaning that 100% of the ZCTA’s area was accounted for. Thus this represents a “check” on the process. A small minority of the data does NOT sum to “1”. These are cases on the edge of the map, such as the Florida Keys, that GRIDMET’s data does not fully cover.
For documentation on GRIDMET variables please refer to their materials.
Notes from the GRIDMET files:
author: John Abatzoglou - University of Idaho, jabatzoglou@uidaho.edu
The projection information for this file is: GCS WGS 1984.
Citation: Abatzoglou, J.T., 2013, Development of gridded surface meteorological data for ecological applications and modeling, International Journal of Climatology, DOI: 10.1002/joc.3413
Days correspond approximately to calendar days ending at midnight, Mountain Standard Time (7 UTC the next calendar day)
ZCTAs were used because they represent the government’s “best guess” at what the spacial boundaries of a zip code are. While zip codes are commonly percieved as denoting spatial boundaries, they are in fact just a collection of addresses. Furthermore, they are “working units” that are defined and changed based on the needs (and whims) of the postal service. There is a degree of compromise/subjectivity here. The best answer would be “don’t use zip codes as a unit of analysis”. If they must be used, ZCTAs represent the best solution.
NOT ALL ZIP CODES HAVE A CORRESPONDING ZCTA. ZCTAs are a trademark of the Census Bureau, an organization fundamentally concerned with PEOPLE. Zip Codes are a trademark of the US Postal Service, an organization fundamentally concerned with MAIL. Some zip codes map to a single address or very small collection of addresses. These represent high-volume mail facilities (think like PO boxes, etc), and are NOT included as seperate ZCTAs. While frustrating from a pure data perspective (why is there all this unmatched data!?) this makes sense from a practical perspective. If a Medicare patient gave a PO Box as their address, and we use that PO Box’s zip code to infer what their exposure was we’d be making an inappropriate inference, as that patient doesn’t actually live inside their PO Box. If matching all these “point” zip codes is necessary, a zip to ZCTA crosswalk is available here:
/n/dominici_nsaph_l3/Lab/data/shapefiles/
Because zip codes change constantly, ZCTAs have to be updated. They were first created following the 2000 census, and started receiving annual updates in 2007. Thus, this process uses the annual file for all data for that year, and the 2000 census file for years 2000-2006.
The Census has made major updates to the ZCTAs every decade. For the 2000 Census, they include suffixes such as “XX” and “HH” to indicate large, unpopulated land areas such as national parks and bodies of water.
“HH” suffix used to represent large water bodies
For more about ZCTAs, read here
processing_description
Stage 1: Crosswalk Development (done in ArcGIS):
GRIDMET’s 4x4km grid was imported and transformed into defined polygon formats (rather than raster or point features)
Census Bureau’s ZCTA shapefiles for that year were imported
The “tabulate intersection” tool was used to calculate, for each ZCTA/grid pair, the proportion of the ZCTA’s area that the grid square contributed. For example, if ZCTA 12345 overlapped 3 grids, there would be three rows: (12345, Grid A, .4), (12345, Grid B, .2), (12345, Grid C, .4).
The crosswalk produced in step 3 was exported
Stage 2: Area-weighted aggregation:
The crosswalk for that year is is imported.
For each day, the GRIDMET file is imported.
The data for each grid (all 16 variables) is joined to the crosswalk by lat/long pair for that grid. Note that if a grid square overlaps, say, three ZCTAs, then its data will be repeated 3 times so that it can be weighted appropriately for each ZCTA.
The data is multiplied by the ZCTA proportion for that grid square.
The data is grouped by ZCTA with the aggregation method “sum”.
That day is appended to the netCDF file
An annual netCDF file is exported.
Shapefiles#
Zipcode_info
dataset_author |
Yaguang Wei |
date_created |
Jun 3, 2020 |
data_source |
The daily and annual estimations of ambient PM2.5 at ZIP Codes; U.S. ZIP code database. |
spatial_coverage |
US |
spatial_resolution |
zipcode |
temporal_coverage |
2000-2016 |
temporal_resolution |
annually |
description |
For general ZIP Codes with a polygon representation, we estimated their pollution levels by averaging the predictions of grid cells whose centroids lie inside the polygon of that ZIP Code; For other ZIP Codes such as Post Offices or large volume single customers, we treated them as a single point and predicted their pollution levels by assigning the predictions of the nearest grid cell. Further description is available on Spatial_aggregation. |
git_repository |
|
fasse_location |
|
files |
pobox_csv
├── pobox_csv
│ ├── ESRI00USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI01USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI02USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI03USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI04USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI05USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI06USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI07USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI08USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI09USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI10USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI11USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI12USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI13USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI14USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI15USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI16USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI17USZIP5_POINT_WGS84_POBOX.csv
│ ├── ESRI18USZIP5_POINT_WGS84_POBOX.csv
│ └── ESRI19USZIP5_POINT_WGS84_POBOX.csv
└── polygon
├── ESRI<yy>USZIP5_POLY_WGS84.cpg
├── ESRI<yy>USZIP5_POLY_WGS84.dbf
├── ESRI<yy>USZIP5_POLY_WGS84.prj
├── ESRI<yy>USZIP5_POLY_WGS84.sbn
├── ESRI<yy>USZIP5_POLY_WGS84.sbx
├── ESRI<yy>USZIP5_POLY_WGS84.shp
├── ESRI<yy>USZIP5_POLY_WGS84.shp.xml
└── ESRI<yy>USZIP5_POLY_WGS84.shx
yy: 00, 01, ..., 18, 19
ZIP to ZCTA crosswalk (2015)#
zip_to_zcta
rce_location |
|
fasse_location |
|
date_created |
Nov 2, 2015 |
spatial_coverage |
contiguous US |
size |
1.8 MB |
header |
|
files |
└── Zip_to_ZCTA_crosswalk_2015_JSI.csv
Other data#
The following are other commonly-used public data sources, many of which may be found in the confounders folder on FASSE.