Skip to content

[Backend User Story] Revise ETL Usage #786

@NewtonLC

Description

@NewtonLC

Connected to issue #782

User Story

As a backend developer, I want to remove ETL execution from the CI pipeline so that pull requests run faster, CI resources are used efficiently, and test workflows scale better as SafeHome grows.


Goalset

Current State

ETL scripts are currently executed on every push and pull request as part of the CI pipeline, even though the datasets that the ETL is sourcing from haven't been updated in a long time.

This setup is inefficient:

  • Slow pipelines: ETL adds several minutes to every PR and push.
  • Wasted resources: CI compute time and network usage are consumed even when changes only require linting, builds, or tests.
  • Low dataset churn:
    • Tsunamis dataset last updated in 2022
    • Liquefactions dataset last updated in 2024
    • Soft Stories last updated July 2025 and no longer appears to update regularly

We don't need to create and load a new Postgres instance with ~50MB of data in this pipeline, since we're not persisting it and are just using it to run tests.

Desired Outcome

The goal is to separate ETL from CI by:

  • Removing ETL execution from default PR and push workflows
  • Using static inserts, seeded data, or fixtures for tests
  • Allowing ETL to run intentionally and independently when data updates are actually needed

This change should reduce CI time, lower resource usage, and make the pipeline more maintainable as the project evolves.


Acceptance Criteria

  • ETL scripts are not executed as part of the default CI workflow for pull requests and standard pushes.
  • Tests rely on explicit inserts, seeded data, or fixtures rather than requiring a full ETL run.
  • A clear and documented mechanism exists to run ETL separately and intentionally, such as:
    • Manual triggers
    • Scheduled jobs
    • Dedicated CI workflows
  • CI pipelines are measurably faster and focus on linting, builds, and tests.
  • Docker and initialization logic are updated to reflect the separation between application startup and ETL execution.
  • Documentation is added explaining:
    • Why ETL was removed from CI
    • How test data is provisioned
    • How and when ETL should be run going forward
    • Any implications for CI, Docker, and local development
  • There are no regressions in test reliability or production data workflows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions