How to access CMS data on FASSE#
The following are instructions for logging in to FASSE and setting up your own workspace.
Prerequisites. Join our project group#
Get a FASRC account by requesting it here.
Navigate to the Add Grants page in portal, you will need to login with your FASRC account
Expand the plus sign next to “Other”
Find the project group you want to be added to:
Select the checkbox for the project group you want to be added to
Your PI will have to approve the addition. Once you’re notified of the approval, it can take up to an hour for your permissions to be configured. If you’re not able to access the VPN or your home directory, try waiting an hour and logging in again.
Step 1. Connect to Harvard’s VPN#
vpn.rc.fas.harvard.eduin the Cisco AnyConnect text box (see figures).
Type your username in the format
username@fasse, password and verification code (same as for FASRC).
CMS prohibits accessing data while outside of the U.S., this includes not only opening data files but also submitting code/jobs to run on the data.
Step 2. Access FASSE#
There are a few ways to access FASSE. You can access it via VDI/OoD (in the web browser) by clicking the link here: https://fasseood.rc.fas.harvard.edu/
You can also access it via command line (Terminal) by typing:
To learn more about working in the command line, check out this Unix Shell tutorial.
The username, password and verification code are the same as in the previous step (and the same as for FASRC).
For more information, see the official documentation.
Step 3. Project workspace#
Your project name should be informative for the group members and outsiders. Think of a project name in the following format:
heat_alert-mortality-reinforcement_learning or shorter
In practice, you may have multiple exposures and outcomes. In that case, use your best judgement for your project name based on the guidelines. Avoid adding information such as usernames and current date or year.
Next, you should create a folder with your project name in the NSAPH projects folder at
You can do that by opening “File System” in FAS-RC Remote Desktop and navigating to the projects folder (see Fig.).
Create there a new folder with your project name (ie,
Use your project name folder in
/n/dominici_nsaph_l3/Lab/projects as a workspace
for your analysis data and code.
Step 4. Create a git repository on GitHub#
Navigate to NSAPH Projects GitHub organization in your web browser. NSAPH Projects GitHub organization is a shared account where all NSAPH members can collaborate across many projects at once. If you are not already a member of NSAPH Projects, ask one of the admins to add you to the organization.
Crete a new git repository under NSAPH Projects and name it with your project name.
Going forward, make sure to update your GitHub repository daily with your analysis code and documentation.
If you are not familiar with using
git, check out this git tutorial.
Also, check out our guidelines for collaborative work on GitHub.
You should link your GitHub account to the FASSE workspace by typing the commands below in FASSE’s command line. By doing this, all code contributions (commits) from FASSE will be linked to your GitHub account.
git config --global user.name "Mona Lisa" git config --global user.email "firstname.lastname@example.org"
While you may find FASRC documents suggesting the use of SSH to your repositories, FASSE environments are configured specifically so that the port used for SSH is blocked. Therefore, you should use the HTTPS version of git repo address when VCS your projects. This does mean that you are required to enter username and password each time a sync is performed between remote and the local.
The prompt for your password is NOT your actual Github password. Instead, you need to enter the generated token in replacement of the password. See how to generate token here.
Step 5. Analytic Data#
Much of the NSAPH data is already available on FASSE. Check out the data catalogue here.
If you’d like to use any of the analytic datasets, create a symbolic link (symlink) of that dataset instead of creating a new copy. A symbolic link is a reference to another file or directory that the operating system interprets as a path to that file or directory (a shortcut).
This is how you create a symlink from your
data folder (in the command line):
cd data ln -s ../analytic/DATA_FOLDER .
Step 6. Setting up R and RStudio#
To load R and install packages, follow these directions.
If you’re using RStudio, you’ll need your
R_LIBS_USER path to set up the interactive session.
In RStudio, if you want to see files outside of your home directory, you can click the three dots
on the upper right-hand side of the Files window in RStudio (under the refresh arrow) and type
in the directory path you want. If you want to save files outside your home directory, you can change
your working directory using the command
setwd([directory path]) in the Console.
Step 7. Organize your folder#
Consider organizing your project folder (and repository) as follows:
project-name ├── README.md ├── data/ ├── code/ ├── figures/ ├── reports/ ├── results/ └── .gitignore
Make sure to use the
.gitignore special files. A
README.md file is a standard documentation file where you should put information about the content of your
.gitignore file tells Git which files to ignore when committing your project to the GitHub
repository. It should be located in the root directory of your repo. Large data file and sensitive data should be
ignored by Git.
Be careful not to push sensitive data on GitHub. Don’t forget that Medicare/Medicaid data should not leave Harvard,
but your analysis code should be versioned with Git. A
.gitignore file helps with that.
Add path and/or file names of your data in the
.gitignore file. You can ignore the
data sub-folder and/or
all files of a certain format like
.rst. Add these as new lines in
.gitignore. For example:
data/ *.csv *.nc *.rst
Step 8. Scratch Space#
A scratch space provides a dedicated area for temporary storage and facilitates efficient data processing. It allows you to perform tasks such as data preprocessing, computationally intensive computations, and analysis, while storing intermediate files and temporary results. It is particularly useful in high-performance computing environments, scientific research, and data-intensive tasks where speed, efficiency, and data management are critical. The scratch datasystem is highly optimized for high-throughput file read/write. This means if you have to do lots of edits to large files or work with many small files, scratch may provide a more efficient means of operating.
When to use a scratch space? Whenever you need a temporary storage for intermediate data during processing or computations, use it to store data that is being actively manipulated or processed but is not needed for long-term storage.
Example scratch space usages
When creating a large dataset compiled from several analytic datasets. Merged data is large but takes time to compile so you want to save it to use for analyses but not take up valuable space on FASSE. Instead store the compiled dataset in scratch and access each time you need to run your analysis. if the merged dataset goes unmodified for 90 days it may be deleted, in which case you will need to rerun your script to merge the data.
For bootstrap analyses it is often quicker to create a set of bootstrap datasets. these datasets individually are large and collectively would overrun the usable space on FASSE. instead, store the temporary datasets in scratch and then have your bootstrap analyses load each individually. delete after use!
The lab has scratch spaces available within FASSE and CANNON:#
How to use the scratch folder? Create your own directory in this scratch space using your project name. For example, if your FASSE project name is
heat-stress-project, then create a folder within the FASSE scratch folder:
Instead of storing large files in your local project folder, write temporary files in this path.
Use a symlink to a scratch folder Ideally, you should not include full paths in a repository’s code. To avoid including full paths in your code, create a symlink to the scratch folder inside the
data/ folder of your project using.
data/ will contains a shortcut link to the scratch folder:
ls -s /n/holyscratch01/LABS/dominici_nsaph/Lab/<project_name> .
When you write or read a file use:
# In R library(tidyverse) write_csv("data/intermediate/<project_name>/<filename>") read_csv("data/intermediate/<project_name>/<filename>")
# In Python import pandas as pd df.write_csv("data/intermediate/<project_name>/<filename>") pd.read_csv("data/intermediate/<project_name>/<filename>")
To force Github to ignore file changes in the scrath folder, you will need to add a
.gitignore inside your data folder. Here is an example of the gitignore file:
* !*/ !.gitignore !README.md
The terminology of CPUs, nodes, and cores can be confusing in FASSE. One node is like a computer, and it contains one CPU and many cores. The multiple cores can do calculations at the same time. When requesting a job in FASSE, you request a portion of a node. However, the Slurm job management system refers to CPUs as the cores of the CPU.
Make sure to acknowledge the use of FASRC in your publications. From the FASRC website, “Please use the following text as a guideline: ‘The computations in this paper were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.’”