LEGO Data Model#

Overview#

The LEGO Data Model is a structured framework designed to standardize data storage, structure, and management across various projects. By adopting a modular approach—similar to LEGO building blocks—we ensure consistency, reproducibility, and scalability in data organization.

Access the LEGO Catalog at 🔗 https://lego-catalog.netlify.app/

_images/lego.jpg

Why the LEGO Data Model?#

The LEGO Data Model is inspired by the modularity and standardization of LEGO blocks. Its key principles include:

  • Modular Structure: Data is organized into well-defined, reusable components.

  • Standardized Formats: Ensures interoperability between datasets.

  • Hierarchical Organization: Data is structured by domains, subdomains, and time resolutions.

  • Predefined Schema: Every dataset follows a standard schema with linking elements like county IDs.

_images/lego_system.png

Data Standards in the LEGO Data Model#

File Formats#

To optimize storage and processing, the LEGO Data Model supports:

  • Parquet: Columnar storage format for tabular data.

  • Shapefile (SHP): Used for spatial data.

Folder Structure & Naming Conventions#

All datasets in our lab follow a structured hierarchy to ensure logical arrangement and easy access.

Folder Hierarchy Example#

<main lab folder>/lego
├── <domain>
│   ├── <subdomain>__<data_source>
│   │   ├── <geo_resolution>__<time_resolution>
│   │   │   ├── <filename>_yyyy.parquet

Key Folder Components#

  • Domain: Broad research category (e.g., health, environment, social).

  • Subdomain: Specific dataset type (e.g., hospitalization, demographics, air pollution).

  • Data Source: The origin of the dataset (e.g., Medicare, Medicaid).

  • Geographic Resolution: The spatial granularity (e.g., county, state, ZCTA).

  • Time Resolution: The temporal frequency (e.g., annual, monthly, daily).

  • File Naming Convention: Maintains consistency for dataset identification.

Notes#

  • Files are stored yearly.

  • All files that share a common datapath/filename should have identical variables/columns.

Example: Navigating the Medicare Core Datasets#

  • Access the Dataset Overview : Navigate to the Home Contents tab and select medicare. View the description, metadata, file path, and keywords.

_images/lego_medicare_datapath.png
  • View the Content or Subdatasets : Click the Content or Subdatasets tabs to explore related files (e.g., mortality, admissions, outcomes).

_images/lego_medicare_content.png
  • Explore Specific Files : Click on a file to view its data dictionary (e.g., zcta_yearly counts_yyyy.parquet).

_images/lego_medicare_data_dictionary.png

Leveraging the LEGO Data Model for Data Requests#

The LEGO Data Model ensures consistency, collaboration, and reproducibility. Researchers can search for datapaths, data dictionaries, and associated data pipelines on github, to collaborate seamlessly through a shared centralized dataset structures.