[25.1] Simplify docker build, use node as specified in requirements#21695
Conversation
| # Clean up *all* node_modules, including plugins. Everything is already built+staged. | ||
| RUN find . -name "node_modules" -type d -prune -exec rm -rf '{}' + | ||
| # Remove pre-built visualization plugin static files (not present in base image, ~220MB) | ||
| RUN find config/plugins/visualizations -mindepth 2 -maxdepth 2 -name "static" -type d -exec rm -rf '{}' + |
There was a problem hiding this comment.
Fabulous! This should significantly reduce size.
|
|
||
| #====================================================== | ||
| # Stage 3 - Build final image based on previous stages | ||
| # Stage 2 - Build final image based on previous stage |
There was a problem hiding this comment.
A separate stage for client builds was originally introduced for speed - because most of the time, the client and server builds were stuck on IO. IIRC, this almost halved the build time. My understanding is that the client build is much faster now, but how do the build times compare with a single stage vs multiple?
There was a problem hiding this comment.
@mvdbeek how long is the Docker build taking for you now? My preliminary timing (on your previous PR) is that since Docker cannot longer build the stages in parallel the build time went from 15-ish minutes to over 40 minutes on my M1 Mac.
There was a problem hiding this comment.
10 minutes more or less, it'll be much faster on 26.0
There was a problem hiding this comment.
I also somewhat doubt that parallelizing the build really helped ? You're still bottlenecked by IO ?
https://github.com/mvdbeek/ansible-galaxy/actions/runs/21443318585/job/61752032548 took 17 minutes against master (so 25.1), dev was 13 minutes (more or less what 26.0 will be).
There was a problem hiding this comment.
last successfull container build against 25.1 was https://github.com/galaxyproject/galaxy/actions/runs/21183296928, which took about 10 minutes. So it is slightly slower but I think the upside is that we're tracking what galaxy does ? If you want to restore the two-stage build you'd have to run the dependency installation as well for the client build (you'd be doing that twice then) or use a system method to provide node and corepack
There was a problem hiding this comment.
I also somewhat doubt that parallelizing the build really helped ? You're still bottlenecked by IO ? https://github.com/mvdbeek/ansible-galaxy/actions/runs/21443318585/job/61752032548 took 17 minutes against master (so 25.1), dev was 13 minutes (more or less what 26.0 will be).
We should still see some benefit as the client build is much more CPU intensive compiling JavaScript, tree-shaking, etc. The CPU work can overlap with the server build network IO, and since they download from difference sources (server from pypi, client from npm) they don't compete for the same remote resources. Unless the network is really slow, parallel downloads shouldn't be a bottleneck.
I am trying to run some timing benchmarks, but my Mac decided that today would be the opportune time to run low on disk space and everything I do is taking +60 minutes...
If you want to restore the two-stage build you'd have to run the dependency installation as well for the client build (you'd be doing that twice then) or use a system method to provide node and corepack
I was about to open a PR for that when I saw this. All we really need to do is install Node.js in the client build (we don't need all of the pinned requirements to build the client) and create a virtual env for corepack to install shims into.
There was a problem hiding this comment.
- All you really need to do is parse out the correct version of node, create a virtualenv, install the right node via nodejs-wheel, test it, curse at ansible and/or docker, and you'll spend more hours on this then you'll ever save building this manually. I think you can get much faster build times and benefit for the whole community if for instance you adapted the ansible role to (optionally) use uv. But that's just my 2c ...
There was a problem hiding this comment.
For apples to apples comparisons i've removed gha caching that we're not using on the galaxy CI either, which brings the build process down to 11 minutes on 25.1 and 8 minutes on dev (https://github.com/mvdbeek/ansible-galaxy/actions/runs/21513617589 -- note the extra docker daemon import that takes another minute). The last successful one here was 9 minutes, so this seems close enough ?
There was a problem hiding this comment.
I also added uv as an option: https://github.com/mvdbeek/ansible-galaxy/actions/runs/21515067132 -- 7 minutes which already includes a minute of just exporting the image to docker.
There was a problem hiding this comment.
I tried Keith's suggestion of creating a dummy virtualenv for the corepack shims:
Build time with 1 stage and Marius's changes included
time docker build --no-cache -f .k8s_ci.Dockerfile -t quay.io/galaxyproject/galaxy-min:latest . --platform linux/amd64
22.33s user 54.36s system 12% cpu 10:38.98 total
Build time with 2 stages and Marius's changes included
time docker build --no-cache -f .k8s_ci.Dockerfile -t quay.io/galaxyproject/galaxy-min:latest . --platform linux/amd64
22.81s user 54.37s system 12% cpu 10:23.58 total
It doesn't look like 2 stages provide much benefit anymore.
However, in neither case could I get the client to actually load - it 404s for the static files:
docker run --rm -it -p 8080:8080 quay.io/galaxyproject/galaxy-min:latest
...
uvicorn.access INFO 2026-01-31 07:09:37,296 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:63737 - "GET / HTTP/1.1" 200
uvicorn.access INFO 2026-01-31 07:09:37,312 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:63737 - "GET /.well-known/appspecific/com.chrome.devtools.json HTTP/1.1" 404
uvicorn.access INFO 2026-01-31 07:09:37,318 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:63737 - "GET /static/dist/base.css?v=1769839531000 HTTP/1.1" 404
uvicorn.access INFO 2026-01-31 07:09:37,328 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:23004 - "GET /static/style/jquery-ui/smoothness/jquery-ui.css?v=1769839531000 HTTP/1.1" 404
uvicorn.access INFO 2026-01-31 07:09:37,329 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:63737 - "GET /static/dist/libs.bundled.js?v=1769839531000 HTTP/1.1" 404
uvicorn.access INFO 2026-01-31 07:09:37,329 [pN:main.1,p:74,tN:MainThread] 172.253.118.141:57402 - "GET /static/dist/analysis.bundled.js?v=1769839531000 HTTP/1.1" 404
.k8s_ci.Dockerfile
Outdated
| ARG GALAXY_PLAYBOOK_BRANCH | ||
|
|
||
| # Add Galaxy source code | ||
| COPY . $SERVER_DIR/ |
There was a problem hiding this comment.
Was there a reason to move this up? The layer caching is affected on subsequent builds because the apt installs below are always rerun.
There was a problem hiding this comment.
Nope, just didn't remember that docker caches stuff. it's been years since I build anything manually.
But before cloning the playbook so we always get the latest changes.
This test ensures that the client build output (base.css) can be fetched from /static/dist/base.css. This catches issues where the client build is not properly included in Docker images, which would result in 404 errors for static files. See: galaxyproject#21695 https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
Adds a check in the container image CI workflow to verify that /static/dist/base.css can be fetched from the deployed Galaxy instance. This catches issues where the client build is not properly included in the Docker image, which would result in 404 errors for static files. See: galaxyproject#21695 https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
Adds a check in the container image CI workflow to verify that /static/dist/base.css can be fetched from the deployed Galaxy instance. This catches issues where the client build is not properly included in the Docker image, which would result in 404 errors for static files. Also fixes the Dockerfile to explicitly enable client build with -e galaxy_build_client=true. The simplified single-stage build was missing this flag, causing the client build to be skipped since the playbook defaults to not building the client. See: galaxyproject#21695 https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
Adds a check in the container image CI workflow to verify that /static/dist/base.css can be fetched from the deployed Galaxy instance. This catches issues where the client build is not properly included in the Docker image, which would result in 404 errors for static files. Also adds explicit staging step in the Dockerfile to copy client build output from client/dist/ to static/dist/ after the playbook runs. This ensures the client build is staged correctly even if the playbook doesn't run the stage-build npm script. See: galaxyproject#21695 https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
Adds a check in the container image CI workflow to verify that /static/dist/base.css can be fetched from the deployed Galaxy instance. This catches issues where the client build is not properly included in the Docker image, which would result in 404 errors for static files. Also removes the redundant second COPY statement in the Dockerfile. In the original multi-stage build, the second COPY was needed to bring in static files from the client_build stage. In the simplified single-stage build, the first COPY already includes everything from stage1, making the second COPY redundant. See: galaxyproject#21695 https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
Adds a check in the container image CI workflow to verify that /static/dist/base.css can be fetched from the deployed Galaxy instance. This catches issues where the client build is not properly included in the Docker image, which would result in 404 errors for static files. Also removes the redundant second COPY statement in the Dockerfile. In the original multi-stage build, the second COPY was needed to bring in static files from the client_build stage. In the simplified single-stage build, the first COPY already includes everything from stage1, making the second COPY redundant. See: galaxyproject#21695 https://claude.ai/code/session_01MUPhw6AEjCXdWutLRgu6Hz
|
Ok, the static was getting copied inside galaxy/static, instead of replacing it, simply copying the whole galaxy dir should be sufficient ... i've added a simple test that checks that the build serves the client css correctly. |
There was a problem hiding this comment.
Thanks @mvdbeek, this looks great to me. As a general observation, it looks like the image has ballooned over time, with release 21 being around 250MB, and the latest releases being double that size. It's trending entirely in the wrong direction :-) I think we will need some follow up investigations to identify causes and fix. I would also like to consider getting rid of the docker-galaxy-k8s repo altogether, in favour of a dedicated sub-folder in galaxy where the playbook could live. That would significantly simplify the number of layers needed for this. Feedback would be appreciated.
|
More dependencies, more visualizations (jupyterlite alone is now 70MB, probably ships a whole wasm python), node being a runtime requirement now. If you build from a non-release branch you'll also be including dev requirements (to be confirmed, I see mypy/mypyc) |
|
@mvdbeek I really wish this hadn't been merged... I don't really care how long it takes GitHub to build the image, I do care how long it takes me to build an image on my M1 Mac, which I may do several times per day. Building on a Mac with Apple silicon is many orders of magnitude slower than Ubuntu due to the translation overhead incurred building an amd64 image on arm64. I am checking my Docker installation to see if the Tahoe update broke anything and I could always set up a Jetstream2 VM or GitHub action to build images, but I would prefer we didn't double the build time unless there is a very compelling use case.
M1 Mac Tahoe 26.2
Ubuntu 24.04 (Jetstream2 VM)
|
|
If you have contributions that improve the build time they are certainly welcome. A working build is preferable over one that isn't, and on github there is no slowdown. With UV as the dep manager (see my comments above) the build is actually faster than it was. |
|
Well, building the stages in parallel is one way to decrease the build time 😉 There are ways to fix the build other than this. 🤷♂️ |
How to test the changes?
(Select all options that apply)
License