-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Connected to issue #782
User Story
As a backend developer, I want to remove ETL execution from the CI pipeline so that pull requests run faster, CI resources are used efficiently, and test workflows scale better as SafeHome grows.
Goalset
Current State
ETL scripts are currently executed on every push and pull request as part of the CI pipeline, even though the datasets that the ETL is sourcing from haven't been updated in a long time.
This setup is inefficient:
- Slow pipelines: ETL adds several minutes to every PR and push.
- Wasted resources: CI compute time and network usage are consumed even when changes only require linting, builds, or tests.
- Low dataset churn:
- Tsunamis dataset last updated in 2022
- Liquefactions dataset last updated in 2024
- Soft Stories last updated July 2025 and no longer appears to update regularly
We don't need to create and load a new Postgres instance with ~50MB of data in this pipeline, since we're not persisting it and are just using it to run tests.
Desired Outcome
The goal is to separate ETL from CI by:
- Removing ETL execution from default PR and push workflows
- Using static inserts, seeded data, or fixtures for tests
- Allowing ETL to run intentionally and independently when data updates are actually needed
This change should reduce CI time, lower resource usage, and make the pipeline more maintainable as the project evolves.
Acceptance Criteria
- ETL scripts are not executed as part of the default CI workflow for pull requests and standard pushes.
- Tests rely on explicit inserts, seeded data, or fixtures rather than requiring a full ETL run.
- A clear and documented mechanism exists to run ETL separately and intentionally, such as:
- Manual triggers
- Scheduled jobs
- Dedicated CI workflows
- CI pipelines are measurably faster and focus on linting, builds, and tests.
- Docker and initialization logic are updated to reflect the separation between application startup and ETL execution.
- Documentation is added explaining:
- Why ETL was removed from CI
- How test data is provisioned
- How and when ETL should be run going forward
- Any implications for CI, Docker, and local development
- There are no regressions in test reliability or production data workflows.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status