Skip to content

[25.1] Simplify docker build, use node as specified in requirements#21695

Merged
mvdbeek merged 3 commits intogalaxyproject:release_25.1from
mvdbeek:fix_dockerfile_client_build
Feb 2, 2026
Merged

[25.1] Simplify docker build, use node as specified in requirements#21695
mvdbeek merged 3 commits intogalaxyproject:release_25.1from
mvdbeek:fix_dockerfile_client_build

Conversation

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Jan 29, 2026

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@mvdbeek mvdbeek requested review from ksuderman and nuwang January 29, 2026 16:30
# Clean up *all* node_modules, including plugins. Everything is already built+staged.
RUN find . -name "node_modules" -type d -prune -exec rm -rf '{}' +
# Remove pre-built visualization plugin static files (not present in base image, ~220MB)
RUN find config/plugins/visualizations -mindepth 2 -maxdepth 2 -name "static" -type d -exec rm -rf '{}' +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fabulous! This should significantly reduce size.


#======================================================
# Stage 3 - Build final image based on previous stages
# Stage 2 - Build final image based on previous stage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate stage for client builds was originally introduced for speed - because most of the time, the client and server builds were stuck on IO. IIRC, this almost halved the build time. My understanding is that the client build is much faster now, but how do the build times compare with a single stage vs multiple?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek how long is the Docker build taking for you now? My preliminary timing (on your previous PR) is that since Docker cannot longer build the stages in parallel the build time went from 15-ish minutes to over 40 minutes on my M1 Mac.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 minutes more or less, it'll be much faster on 26.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also somewhat doubt that parallelizing the build really helped ? You're still bottlenecked by IO ?
https://github.com/mvdbeek/ansible-galaxy/actions/runs/21443318585/job/61752032548 took 17 minutes against master (so 25.1), dev was 13 minutes (more or less what 26.0 will be).

Copy link
Member Author

@mvdbeek mvdbeek Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last successfull container build against 25.1 was https://github.com/galaxyproject/galaxy/actions/runs/21183296928, which took about 10 minutes. So it is slightly slower but I think the upside is that we're tracking what galaxy does ? If you want to restore the two-stage build you'd have to run the dependency installation as well for the client build (you'd be doing that twice then) or use a system method to provide node and corepack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also somewhat doubt that parallelizing the build really helped ? You're still bottlenecked by IO ? https://github.com/mvdbeek/ansible-galaxy/actions/runs/21443318585/job/61752032548 took 17 minutes against master (so 25.1), dev was 13 minutes (more or less what 26.0 will be).

We should still see some benefit as the client build is much more CPU intensive compiling JavaScript, tree-shaking, etc. The CPU work can overlap with the server build network IO, and since they download from difference sources (server from pypi, client from npm) they don't compete for the same remote resources. Unless the network is really slow, parallel downloads shouldn't be a bottleneck.

I am trying to run some timing benchmarks, but my Mac decided that today would be the opportune time to run low on disk space and everything I do is taking +60 minutes...

If you want to restore the two-stage build you'd have to run the dependency installation as well for the client build (you'd be doing that twice then) or use a system method to provide node and corepack

I was about to open a PR for that when I saw this. All we really need to do is install Node.js in the client build (we don't need all of the pinned requirements to build the client) and create a virtual env for corepack to install shims into.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • All you really need to do is parse out the correct version of node, create a virtualenv, install the right node via nodejs-wheel, test it, curse at ansible and/or docker, and you'll spend more hours on this then you'll ever save building this manually. I think you can get much faster build times and benefit for the whole community if for instance you adapted the ansible role to (optionally) use uv. But that's just my 2c ...

Copy link
Member Author

@mvdbeek mvdbeek Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For apples to apples comparisons i've removed gha caching that we're not using on the galaxy CI either, which brings the build process down to 11 minutes on 25.1 and 8 minutes on dev (https://github.com/mvdbeek/ansible-galaxy/actions/runs/21513617589 -- note the extra docker daemon import that takes another minute). The last successful one here was 9 minutes, so this seems close enough ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added uv as an option: https://github.com/mvdbeek/ansible-galaxy/actions/runs/21515067132 -- 7 minutes which already includes a minute of just exporting the image to docker.

@github-actions github-actions bot added this to the 26.1 milestone Jan 29, 2026
Copy link
Member

@nuwang nuwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried Keith's suggestion of creating a dummy virtualenv for the corepack shims:

Build time with 1 stage and Marius's changes included

time docker build --no-cache -f .k8s_ci.Dockerfile -t quay.io/galaxyproject/galaxy-min:latest . --platform linux/amd64
22.33s user 54.36s system 12% cpu 10:38.98 total

Build time with 2 stages and Marius's changes included

time docker build --no-cache -f .k8s_ci.Dockerfile -t quay.io/galaxyproject/galaxy-min:latest . --platform linux/amd64
22.81s user 54.37s system 12% cpu 10:23.58 total

It doesn't look like 2 stages provide much benefit anymore.

However, in neither case could I get the client to actually load - it 404s for the static files:

 docker run --rm -it -p 8080:8080 quay.io/galaxyproject/galaxy-min:latest
 
 ...
 uvicorn.access INFO 2026-01-31 07:09:37,296 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:63737 - "GET / HTTP/1.1" 200
uvicorn.access INFO 2026-01-31 07:09:37,312 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:63737 - "GET /.well-known/appspecific/com.chrome.devtools.json HTTP/1.1" 404
uvicorn.access INFO 2026-01-31 07:09:37,318 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:63737 - "GET /static/dist/base.css?v=1769839531000 HTTP/1.1" 404
uvicorn.access INFO 2026-01-31 07:09:37,328 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:23004 - "GET /static/style/jquery-ui/smoothness/jquery-ui.css?v=1769839531000 HTTP/1.1" 404
uvicorn.access INFO 2026-01-31 07:09:37,329 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:63737 - "GET /static/dist/libs.bundled.js?v=1769839531000 HTTP/1.1" 404
uvicorn.access INFO 2026-01-31 07:09:37,329 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:57402 - "GET /static/dist/analysis.bundled.js?v=1769839531000 HTTP/1.1" 404

ARG GALAXY_PLAYBOOK_BRANCH

# Add Galaxy source code
COPY . $SERVER_DIR/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason to move this up? The layer caching is affected on subsequent builds because the apt installs below are always rerun.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just didn't remember that docker caches stuff. it's been years since I build anything manually.

But before cloning the playbook so we always get the latest changes.
mvdbeek pushed a commit to mvdbeek/galaxy that referenced this pull request Feb 1, 2026
This test ensures that the client build output (base.css) can be fetched
from /static/dist/base.css. This catches issues where the client build
is not properly included in Docker images, which would result in 404
errors for static files.

See: galaxyproject#21695

https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
mvdbeek pushed a commit to mvdbeek/galaxy that referenced this pull request Feb 1, 2026
Adds a check in the container image CI workflow to verify that
/static/dist/base.css can be fetched from the deployed Galaxy instance.
This catches issues where the client build is not properly included
in the Docker image, which would result in 404 errors for static files.

See: galaxyproject#21695

https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
mvdbeek pushed a commit to mvdbeek/galaxy that referenced this pull request Feb 1, 2026
Adds a check in the container image CI workflow to verify that
/static/dist/base.css can be fetched from the deployed Galaxy instance.
This catches issues where the client build is not properly included
in the Docker image, which would result in 404 errors for static files.

Also fixes the Dockerfile to explicitly enable client build with
-e galaxy_build_client=true. The simplified single-stage build was
missing this flag, causing the client build to be skipped since the
playbook defaults to not building the client.

See: galaxyproject#21695

https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
mvdbeek pushed a commit to mvdbeek/galaxy that referenced this pull request Feb 1, 2026
Adds a check in the container image CI workflow to verify that
/static/dist/base.css can be fetched from the deployed Galaxy instance.
This catches issues where the client build is not properly included
in the Docker image, which would result in 404 errors for static files.

Also adds explicit staging step in the Dockerfile to copy client build
output from client/dist/ to static/dist/ after the playbook runs. This
ensures the client build is staged correctly even if the playbook doesn't
run the stage-build npm script.

See: galaxyproject#21695

https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
mvdbeek pushed a commit to mvdbeek/galaxy that referenced this pull request Feb 1, 2026
Adds a check in the container image CI workflow to verify that
/static/dist/base.css can be fetched from the deployed Galaxy instance.
This catches issues where the client build is not properly included
in the Docker image, which would result in 404 errors for static files.

Also removes the redundant second COPY statement in the Dockerfile.
In the original multi-stage build, the second COPY was needed to bring
in static files from the client_build stage. In the simplified single-stage
build, the first COPY already includes everything from stage1, making
the second COPY redundant.

See: galaxyproject#21695

https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
Adds a check in the container image CI workflow to verify that
/static/dist/base.css can be fetched from the deployed Galaxy instance.
This catches issues where the client build is not properly included
in the Docker image, which would result in 404 errors for static files.

Also removes the redundant second COPY statement in the Dockerfile.
In the original multi-stage build, the second COPY was needed to bring
in static files from the client_build stage. In the simplified single-stage
build, the first COPY already includes everything from stage1, making
the second COPY redundant.

See: galaxyproject#21695

https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 1, 2026

Ok, the static was getting copied inside galaxy/static, instead of replacing it, simply copying the whole galaxy dir should be sufficient ... i've added a simple test that checks that the build serves the client css correctly.

Copy link
Member

@nuwang nuwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mvdbeek, this looks great to me. As a general observation, it looks like the image has ballooned over time, with release 21 being around 250MB, and the latest releases being double that size. It's trending entirely in the wrong direction :-) I think we will need some follow up investigations to identify causes and fix. I would also like to consider getting rid of the docker-galaxy-k8s repo altogether, in favour of a dedicated sub-folder in galaxy where the playbook could live. That would significantly simplify the number of layers needed for this. Feedback would be appreciated.

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 2, 2026

More dependencies, more visualizations (jupyterlite alone is now 70MB, probably ships a whole wasm python), node being a runtime requirement now. If you build from a non-release branch you'll also be including dev requirements (to be confirmed, I see mypy/mypyc)

@mvdbeek mvdbeek merged commit ab55b0d into galaxyproject:release_25.1 Feb 2, 2026
47 of 50 checks passed
@nsoranzo nsoranzo deleted the fix_dockerfile_client_build branch February 2, 2026 14:29
@galaxyproject galaxyproject deleted a comment from github-actions bot Feb 2, 2026
@ksuderman
Copy link
Contributor

@mvdbeek I really wish this hadn't been merged...

I don't really care how long it takes GitHub to build the image, I do care how long it takes me to build an image on my M1 Mac, which I may do several times per day. Building on a Mac with Apple silicon is many orders of magnitude slower than Ubuntu due to the translation overhead incurred building an amd64 image on arm64. I am checking my Docker installation to see if the Tahoe update broke anything and I could always set up a Jetstream2 VM or GitHub action to build images, but I would prefer we didn't double the build time unless there is a very compelling use case.

Ubuntu M1 Mac
parallel 5:05 23:42
serial 11:52 65:00

M1 Mac Tahoe 26.2

  • Docker v29.1.5
  • Buildx v0.30.1-desktop.2

Ubuntu 24.04 (Jetstream2 VM)

  • Docker v28.4.0
  • Buildx v0.27.0

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 2, 2026

If you have contributions that improve the build time they are certainly welcome. A working build is preferable over one that isn't, and on github there is no slowdown. With UV as the dep manager (see my comments above) the build is actually faster than it was.

@ksuderman
Copy link
Contributor

Well, building the stages in parallel is one way to decrease the build time 😉 There are ways to fix the build other than this. 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants