kojipkgs: what it is and how it works
The last week or so I have spent a ton of time on kojipkgs (sorry again for any build failures this may have caused), so I thought it would be good to outline what it is and how it’s used and finally the working setup we have now.
kojipkgs is a concept that koji has of a host/url to go to to download packages. On small koji installs this host/url can be simply the koji hub host. Or it can be a seperate host or hosts, as long as it has access to all the packages koji wants. So in practice this means it has to share a NFS mount or other shared storage with the koji hub.
For many years Fedora infrastructure had a kojipackages instance that NFS mounted the koji storage, ran a apache web server locally on port 8080 and then had a squid instance in front of that, listening on port 80 and 443 and talking to apache for any items that were not in it’s cache. This worked out well because squid could cache commonly used objects and save access to the NFS server as well as return more quickly. However, it also has drawbacks: It’s a single point of failure and it’s exposed to the outside world and squid’s SSL code has not gotten anywhere near the same scrutiny that apache’s has.
A bit of a side note, but related, lets talk about how koji does builds for a minute. If you tell koji to do an official build of some package, first it makes a mock chroot and downloads and installs the packages in the srpm-build koji group (you can see these and what packages are in them with a ‘koji list-groups’ command). Note that this install is done with the HOST dnf into the chroot. The src.rpm is then built from pkgs git and lookaside cache and then koji starts builds for any needed arches. These builds first download the src.rpm via a koji urllib2 call and the make a mock chroot and download and install all the packages in the ‘build’ group. Then rpmbuild is called, etc.
So, week before last I updated builders (as I often do) and also upgraded the kojipkgs01 squid server (running rhel7) to the latest versions. However, we started seeing downloading problems at build time. Sometimes (but rarely) dnf would just fail to download a package it needed and fail the build. I figured this would be a great time to try and remove the single point of failure of that one server.
I spun up a second kojipkgs instance (kojipkgs02) setup just like the first one (rhel7, squid, apache, NFS). Then I setup our two proxies in that main datacenter to handle kojipkgs requests and use haproxy to send them in to kojipkgs01/02. Then, a switch of dns and it was switched over. Now, however we hit a new problem. Sometimes the src.rpm download via koji was not working. It was silently downloading only part of the src.rpm and then failing the build. Additionally, the build download issues were still happening. I tried isolating to just one proxy. Just one backend. Various haproxy balancer setups so that requests that went to one backend kept on that backend. I replaced both kojipkgs instances with Fedora 25 (much newer squid version). Koji has the ability to pass several urls to the mock setup it uses for builds, so we passed it multiple urls to kojipkgs there and that seemed to get rid of that problem, but the src.rpm download issue was still there.
Finally, we tried disabling squid’s “smp mode”. Suddenly everything started working nicely.
So in the end of this saga we have:
- Nice Highy Available kojipkgs. (2 proxies using 2 backends, so we can update, restart, reboot a proxy/backend anytime we like without causing problems).
- A bug report on librepo to retry downloads the number of times it claims to in dnf.conf default even if there’s only one baseurl and it was slow/timed out: https://bugzilla.redhat.com/show_bug.cgi?id=1414116
- A koji ticket to make the src.rpm more robust: https://pagure.io/koji/issue/290 and some PR’s to change it use requests and validate the downloaded rpm already.
- I still need to file an upstream bug on squid about smp mode, but I don’t have too much info to give them (there were no errors anywhere, just some small number of connections would hang/not finish).