NSAPH PostgreSQL Data Platform#

The NSAPH Data Platform consists of a series of ETL pipelines that deploy a PostgreSQL database. The platform’s documentation is found here (and the documentation’s source code is here)

A deployed database resides in the Harvard University’s FASRC clusters. Access to the FASRC clusters require a FASRC user username and password. The database is hosted in an RC environment which is accessed by SSH at nsaph.rc.fas.harvard.edu (and not at fasselogin.rc.fas.harvard.edu). New DS support team members can request access to this environment via an RC ticket with the authorization of a PI. Additional documentation for superusers regarding administration of the database and onboarding of new users is found here and here.

Database querying in the nsaph host#

There are multiple ways to query a database. Two options are offered below: Superset interactice sessions and python scripts. Superset and/or database passwords are required.

Using Superset to explore the DB#

Superset allows working with a database interactively. It offers functionalities to query, visualize, and explore the data.

A Superset instance, that is installed and connected to the database, remains active in the nsaph host.

Accessing Superset in Web Browser#

If you have been granted access to the nsaph host, you would first need to connect to FASRC VPN (instructions here) and then to the nsaph host. Open Terminal, and enter the following command:

ssh -L8088:localhost:8088 username@nsaph.rc.fas.harvard.edu

After logging in through this specific port, you can access the Superset interface locally. In your browser, enter the following url: http://localhost:8088/login/

Enter your Superset credentials.

Warning

If you don’t know your Superset credentials, contact a superuser. Superset credentials are different to database or RC credentials.

Once the Superset authentication is completed, you can now explore the datasets you are granted access to. Start by going to SQL Editor under SQL Lab.

_images/superset_welcome.png

Fig. 11 Accessing SQL Lab.#

Warning

Option to Download to csv retrieves the data locally. Do NOT click it if querying medical data.

Using Python to query the DB#

Packages that allow you to query the database are pre-packaged in a conda environment in the nsaph host.

Note

If you would like to customize your own environment, follow this section. Not sure if you need to? Do NOT customize it.

Step 1: Login to the nsaph host

ssh username@nsaph.rc.fas.harvard.edu

Step 2: Initialize the Anaconda environment

/opt/anaconda3/condabin/conda init

Note

You may be prompted to restart the host. If that is the case, exit the NSAPH host and login again. You will only need to do step 2 once. All subsequent logins to NSAPH host will activate the Anaconda automatically.

Step 3: Check the available environments

conda env list

You should find an environment called superset_final. Activate that environment.

source activate superset_final

Ensure package nsaph and nsaph-utils are installed

pip list | grep nsaph

Step 4: Make sure to create a symlink to an l3 <folder> to work in.

ln -s /n/dominici_nsaph_l3/<folder>

Step 5: cd into <folder> and create a database.ini file. The content should look something like:

[nsaph2]
host=localhost
database=nsaph2
user=username
password=*****

Use your database password.

Step 6: Create a python script and include functions that connect to the database and SQL queries. A sample query file can be found here

Both query.py and database.ini should be located in l3 spaces.

Warning

While these two files can live outside of l3, executing query.py with database.ini credentials while outside of l3 will write the file to that directory, which WILL be a violation of CMS data policies.

To execute the query:

python -u query.py database.ini nsaph2

The query will be converted to a .csv file and is saved under the working directory. You can then use Jupyter Lab to explore the data.

Step 7: Go to your dedicated Jupyter Lab directory in NSAPH. Access Jupyter Lab via OOD here: https://fasseood.rc.fas.harvard.edu/

Customize your own conda environment#

In futures cases where nsaph packages are updated and no longer compatible to the existing environment, follow the instructions below to setup a new environment. This also applies if you would like to install new packages beyond those in superset_final.

Step 1: Login to the NSAPH host

ssh -L8088:localhost:8088 username@nsaph.rc.fas.harvard.edu

Step 2: Initialize the Anaconda environment

/opt/anaconda3/condabin/conda init

Note

You may be prompted to restart the host. If that is the case, exit the NSAPH host and login again. You will only need to do step 2 once. All subsequent logins to NSAPH host will activate the Anaconda automatically.

Step 3: Check the available environments

conda env list

Note

Make sure the new environment you are about to name doesn’t already exist. Otherwise the existing envrionment will be replaced.

Step 4: Create a new environment and specify a Python version

conda create --name envName python=3.x

The environment will take a few minutes to initialize. Activate the environment:

source activate envName

Note

Make sure the Python version is the exact one you specified when creating the new environement.

Step 5: Install the most recent version of NSAPH packages

python -m pip install git+https://github.com/NSAPH-Data-Platform/nsaph-utils.git

python -m pip install git+https://github.com/NSAPH-Data-Platform/nsaph-core-platform.git

Check that packages are correctly installed

pip list | grep nsaph

You should see packages nsaph and nsaph-utils.

Step 6: Follow Step 5 - Step 7 in the above section

Add data to the platform#

When you have large volumes of CSV files, it is a standard data science project flow to turn these files into relational databases. This is to increase records lookup speed supported by the PostgreSQL infrastructure.

For our specific databases hosted on FASSE, it is recommend that you follow the below instructions on ingesting non-health data (For ingesting health data, check out: NSAPH/data-paltform-internal-docs).

Below we will walk through an example of ingesting PM2.5 data onto the sandbox database. (/n/dominici_nsaph_l3/Lab/data/pm25_components/pm25_components, GitHub: yycome/PM25_Components).

Note

There are both sandbox and nsaph2 databases hosted on FASSE. sandbox is used for testing, ingesting, building the database without affect the production database nsaph2.

Step 1: Create an empty folder and cd into the folder to store the files generated by the data loader.

Step 2: Activate nsaph virtual environment for ingesting data

source activate nsaph

Step 3: Copy your database.ini file into the directory. The content should at least contain the section:

[sandbox]
host=localhost
database=sandbox
user=username
password=*****

Step 4: We recommend doing a dryrun of the CSV files before commencing the loading process. You can do that by:

python -u -m nsaph.loader.project_loader --domain pm25 --data /n/dominici_nsaph_l3/Lab/data/pm25_components/pm25_components --registry pm25.yaml --dryrun --pattern *.csv

Note

args –domain and –registry should have the same content name. E.g., –domain xyz needs to be followed with –registry xyz.yaml.

Doing a dryrun has two advantages. 1) This makes sure your CSV is ingestable by the data loader. If the data loader cannot generate the YAML file from your CSVs, then asking the loader to ingest into the database will certainly fail. 2) You can manually examine the column types in the YAML file. In this example, column br has units microgram per cubic meter. Therefore it is reasonable to expect type NUMERIC for this column.

A quick sanity check, at this point your directory should at least contain the following files

- database.ini
- pm25.yaml
- a .log file

Step 5: Now we are ready to ingest the PM2.5 data onto the database!

Note

The original path /n/dominici_nsaph_l3/Lab/data/pm25_components/pm25_components will not work because the filenames start with year (e.g. 2018.csv). By PostgreSQL convention, table names cannot start with numbers. That is, pm25.2018 is NOT a valid table name. Therefore, you should create a copy of that folder and rename to something that follows PSQL syntax. Do NOT change the files in the original directory as these data are used by other researchers.

python -u -m nsaph.loader.project_loader --domain pm25 --data /change/path/here --registry pm25.yaml --reset --pattern *.csv --db database.ini --connection sandbox

Step 6: In PSQL, try something like

SELECT zip, br FROM pm25.pm25_2002
LIMIT 10;

And you will find the data there!

For the comprehensive documentation on the data loader (nsaph.loader.project_loader), check out: https://nsaph-data-platform.github.io/nsaph-platform-docs/common/core-platform/doc/ProjectLoader.html