LEGO Data Model#
Overview#
The LEGO Data Model is a structured framework designed to standardize data storage, structure, and management across various projects. By adopting a modular approach—similar to LEGO building blocks—we ensure consistency, reproducibility, and scalability in data organization.
Access the LEGO Catalog at 🔗 https://lego-catalog.netlify.app/

Why the LEGO Data Model?#
The LEGO Data Model is inspired by the modularity and standardization of LEGO blocks. Its key principles include:
Modular Structure: Data is organized into well-defined, reusable components.
Standardized Formats: Ensures interoperability between datasets.
Hierarchical Organization: Data is structured by domains, subdomains, and time resolutions.
Predefined Schema: Every dataset follows a standard schema with linking elements like county IDs.

Data Standards in the LEGO Data Model#
File Formats#
To optimize storage and processing, the LEGO Data Model supports:
Parquet: Columnar storage format for tabular data.
Shapefile (SHP): Used for spatial data.
Folder Structure & Naming Conventions#
All datasets in our lab follow a structured hierarchy to ensure logical arrangement and easy access.
Folder Hierarchy Example#
<main lab folder>/lego
├── <domain>
│ ├── <subdomain>__<data_source>
│ │ ├── <geo_resolution>__<time_resolution>
│ │ │ ├── <filename>_yyyy.parquet
Key Folder Components#
Domain: Broad research category (e.g., health, environment, social).
Subdomain: Specific dataset type (e.g., hospitalization, demographics, air pollution).
Data Source: The origin of the dataset (e.g., Medicare, Medicaid).
Geographic Resolution: The spatial granularity (e.g., county, state, ZCTA).
Time Resolution: The temporal frequency (e.g., annual, monthly, daily).
File Naming Convention: Maintains consistency for dataset identification.
Notes#
Files are stored yearly.
All files that share a common datapath/filename should have identical variables/columns.
Leveraging the LEGO Data Model for Data Requests#
The LEGO Data Model ensures consistency, collaboration, and reproducibility. Researchers can search for datapaths, data dictionaries, and associated data pipelines on github, to collaborate seamlessly through a shared centralized dataset structures.