First full week of May infra bits 2025

nirik

2025-05-10 16:37

This week was a lot of heads down playing with firmware settings and doing some benchmarking on new hardware. Also, the usual fires and meetings and such.

Datacenter Move

Spent a fair bit of time this week configuring and looking at the new servers we have in our new datacenter. We only have management access to them, but I still (somewhat painfully) installed a few with RHEL9 to do some testing and benchmarking.

One question I was asked a while back was around our use of linux software raid over hardware raid. Historically, there were a few reasons we choose mdadm raid over hardware raid:

It's possble/easy to move disks to a different machine in the event of a controller failure and recover data. Or replace a failed controller with a new one and have things transparently work. With hardware raid you need to have the same exact controller and same firmware version.
Reporting/tools are all open source for mdadm. You can tell when a drive fails, you can easily re-add one, reshape, etc. With hardware raid you are using some binary only vendor tool, all of them different.
In the distant past being able to offload to a seperate cpu was nice, but anymore servers have a vastly faster/better cpu, so software raid should actually perform better than hardware raid (barring different settings).

So, I installed one with mdadm raid another with a hardware raid and did some fio benchmarking. The software raid won overall. Hardware was actually somewhat faster on writes, but the software raid murdered it in reads. Turns out the cache settings defaults here were write-through for software and write-back for hardware, so the difference in writes seemed explainable to that.

We will hopfully finish configuring firmware on all the machines early next week, then the next milestone should be network on them so we can start bootstrapping up the services there.

Builders with >32bit inodes again

We had a few builders hit the 'larger than 32 bit inode' problem again. Basically btrfs starts allocating inode numbers when installed and builders go through a lot of them by making and deleting and making a bunch of files during builds. When that hits > 4GB, i686 builds start to fail because they cannot get a inode. I reinstalled those builders and hopefully we will be ok for a while more again. I really am looking forward to i686 builds completely going away.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114484593787412504

review of the SLZB-06M

nirik

2025-05-03 17:55

I've been playing with Homeassistant a fair bit of late and I've collected a bunch of interesting gadgets. Today I'd like to talk about / review the SLZB-06M.

So the first obvious question: what is a SLZB-06M?

It is a small, Ukrainian designed device that is a: "Zigbee 3.0 to Ethernet, USB, and WiFi Adapter" So, basically you connect it to your wired network, or via usb or via wifi and it gateways that to a Zigbee network. It's really just a esp32 with a shell and ethernet/wifi/bluetooth/zigbee, but all assembled for you and ready to go.

I'm not sure if my use case is typical for this device, but it worked out for me pretty nicely. I have a pumphouse that is down a hill and completely out of line-of-sight of the main house/my wifi. I used some network over power/powerline adapters to extend a segment of my wired network over the power lines that run from the house to it, and that worked great. But then I needed some way to gateway the zigbee devices I wanted to put there back to my homeassistant server.

The device came promptly and was nicely made. It has a pretty big antenna and everything is pretty well labeled. On powering it home assistant detected it no problem and added it. However, then I was a bit confused. I already have a usb zigbee adapter on my home assistant box and the integration was just showing things like the temp and firmware. I had to resort to actually reading the documentation! :)

Turns out the way the zigbee integration works is via zigbee2mqtt. You add the repo for that, install the add on and then configure a user. Then you configure the device via it's web interface on the network to match that. Then, the device shows up in a zigbee2mqtt pannel. Joining devices to it is a bit different from a normal wifi setup, you need to tell it to 'permit join', either anything, or specific devices. Then you press the pair button or whatever on the device and it joins right up. Note that devices can only be joined to one zigbee network, so you have to make sure you do not add them to other zigbee adapters you have. You can set a seperate queue for each one of these adapters, so you can have as many networks as you have coordinator devices for.

You can also have the SLZB-06M act as a bluetooth gateway. I may need to do that if I ever add any bluetooth devices down there.

The web interface lets you set various network config. You can set it as a zigbee coordinator or just a router in another network. You can enable/disable bluetooth, do firmware updates (but homeassistant will do these directly via the normal integration), adjust the leds on the device (off, or night mode, etc). It even gives you a sample zigbee2mqtt config to start with.

After that it's been working great. I now have a temp sensor and a smart plug (on a heater we keep down there to keep things from freezing when it gets really cold). I'm pondering adding a sensor for our water holding tank and possibly some flow meters for the pipes from the well and to the house from the holding tank.

Overall this is a great device and I recommend it if you have a use case for it.

Slava Ukraini!

Beginning of May infra bits 2025

nirik

2025-05-03 16:52

Wow, it's already May now. Time races by sometimes. Here's a few things I found notable in the last week:

Datacenter Move

Actual progress to report this week! Managed to get access to the mgmt on all our new hardware in the new datacenter. Most everything is configured right in dhcp config now (aarch64 and power10's need still some tweaking there).

This next week will be updating firmware, tweaking firmware config, setting up access, etc on all those interfaces. I want to try and do some testing on various raid configs for storage and standardize the firmware configs. We are going to need to learn how to configure the lpars on the power10 machines next week as well.

Then, the following week hopefully we will have at least some normal network for those hosts and can start doing installs on them.

The week after that I hope to start moving some 'early' things: possibly openqa and coreos and some of our more isolated openshift applications. That will continue the week after that, then it's time for flock, some more moving and then finally the big 'switcharoo' week on the 16th.

Also some work on moving some of our soon to be older power9 hardware into a place where it can be added to copr for more/better/faster copr builders.

OpenShift cluster upgrades

Our openshift clusters (prod and stg) were upgraded from 4.17 to 4.18. OpenShift upgrades are really pretty nice. There was not much in the way of issues (although a staging compute node got stuck on boot and had to be power cycled).

One interesting thing with this upgrade was that support for cgroups v1 was listed as going away in 4.19. It's not been the default in a while, but our clusters were installed so long ago that they were still using it as a default.

I like that the upgrade is basically to edit one map and change a 1 to a 2 and then openshift reboots nodes and it's done. Very slick. I've still not done the prod cluster, but likely next week.

Proxy upgrades

There's been some instablity with our proxies in particular in EU and APAC. We are going to be over the coming weeks rolling out newer/bigger/faster instances which should hopefully reduce or eliminate problems folks have sometimes been seeing.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114445144640282791

Late April infra bits 2025

nirik

2025-04-26 16:17

Another week has gone by. It was a pretty quiet one for me, but it had a lot of 'calm before the storm' vibes. The storm being of course that may will be very busy setting up the new datacenter to try and migrate to it in june.

Datacenter Move

Still don't have access to our new hardware, but I'm hoping early next week I will. I did find out a good deal more about networking there and setup our dhcp server already with all the mac addresses and ip's for the management interfaces. As soon as that comes up they should just get the right addresses and be ready to work on.

Next week then would be spent setting firmware the way we want it, testing a few install paramaters to make sure how we want to install the hosts, then move on to installing all the machines.

Then on to bootstrapping things up (we need a dns server, a tftp server, etc) and then installing openshift clusters and virthosts.

So, we are still on track for the move in June as long as the management access comes in next week as planned.

nftables in production

We rolled out our switch from iptables to nftables in production on thursday. Big shout out to James Antill for all the scripting work and getting things so they could migrate without downtime.

The switch did take a bit longer than we would have liked, and there were a few small hiccups, but overall it went pretty well.

There are still some few openqa worker machines we are going to migrate next week, but otherwise we are all switched.

Staging koji synced

To allow for some testing, I did a sync of our production koji data over to the staging instance. This takes a long long time because it loads the prod db in, vacuums it, then modifies it for staging.

There was a bit of breakage at the end (I needed to change some sequences) but otherwise it went fine and now staging has all the same tags/etc as production does.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114405223201008788

Later April infra bits 2025

nirik

2025-04-20 18:18

Another busy week gone by, and I'm a day late with this blog post, but still trying to keep up with it. :)

Fedora 42 out! Get it now!

Fedora 42 was released on tuesday. The "early" milestone even. There was a last minute bug found ( see: https://discussion.fedoraproject.org/t/merely-booting-fedora-42-live-media-adds-a-fedora-entry-to-the-uefi-boot-menu/148774 ) Basically booting almost any Fedora 42 live media on a UEFI install results in it adding the "Fedora" Live media to your boot manager list. This is just booting, not installing or doing anything else. On the face of it this is really not good. We don't want live media to affect installs without installing or choosing to do so. However, in this case the added entry is pretty harmless. It will result in the live media booting again after install if you leave it attached, and if not, almost all UEFI firmware will just see that the live media isn't attached and ignore that entry.

In the end we decided not to try and stop the release at the last minute for this, and I think it was the right call. It's not great, but it's not all that harmfull either.

Datacenter Move news

Networking got delayed and the new date we hope to be able to start setting things up is this coming friday. Sure hope that pans out as our window to setup everything for the move is shrinking.

There was some more planning ongoing, but will be great to actually start digging in and getting things all setup.

AI Scraper news

The scrapers seem to have moved on from pagure.io. It's been basically unloaded for the last week or more. Sadly, they seem to have discovered koji now. I had to block a few endpoints on the web frontend to stop them. Unfortunately there was a short outage of the hub caused by this, and there were 2 builds that were corrupted as a result. Pretty aggravating.

Nftables

Worked with James to roll out our iptables->nftables switch to production. All the builders are now using nftables. Hopefully we will roll out more next week.

Thats it for this week, catch everyone next week!

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114371722035321670

Early Mid April infra bits 2025

nirik

2025-04-12 17:43

Another week has gone by, and here's some more things I'd like to highlight from the last week.

Datacenter Move

I wrote up a community blog post draft with updates for the community. Hopefully it will be up early next week and I will also send a devel-announce list post and discussion thread.

We had a bit of a snafu around network cards. The new aarch64 boxes we got we missed getting 10G nics, so we are working to aquire those soon. The plan in the new datacenter is to have everything on dual 10G nics connected to different switches, so networking folks can update them without causing us any outages.

Some new power10 machines have arrived. I'm hopeful we might be able to switch to them as part of the move. We will know more about them once we are able to get in and start configuring them.

Next week I am hoping to get out of band management access to our new hardware in the new datacenter. This should allow us to start configuing firmware and storage and possibly do initial installs to start bootstraping things up.

Exciting times. I Hope we have enough time to get everything lined up before the june switcharoo date. :)

Fun with databases

We have been having a few applications crash/loop and others behave somewhat sluggishly of late. I finally took a good look at our main postgres database server (hereafter called db01). It's always been somewhat busy, as it has a number of things using it, but once I looked at i/o: yikes. (htop's i/o tab or iotop are very handy for this sort of thing). It showed that a mailman process was using vast amounts of i/o and basically causing the machine to be at 100% all the time. A while back I set db01 to log slow queries. So, looking at that log showed that what it was doing was searching the mailman.bounceevents table for all entries were 'processed' was 'f'. That table is 50GB. It has bounce events back 5 or 6 years at least. Searching around I found a 7 year old bug filed by my co-worker Aurélien: https://gitlab.com/mailman/mailman/-/issues/343

That was fixed! bounces are processed. However, nothing ever cleans up this table at least currently. So, I proposed we just truncate the table. However, others made a good case that the less invasive change (we are in freeze after all) would just be to add a index.

So, I did some testing in staging and then made the change in production. The queries went from: ~300 seconds to pretty much 0. i/o was now still high but around the 20-30% range most of the time.

It's amazing what indexes will do.

Fedora 42 go for next week!

Amazingly, we made a first rc for fedora 42 and... it was GO! I think we have done this once before in all of fedora history, but it's sure pretty rare. So, look for the new release out tuesday.

I am a bit sad in that there's a bug/issue around the Xfce spin and initial setup not working. Xfce isn't a blocking deliverable, so we just have to work around it. https://bugzilla.redhat.com/show_bug.cgi?id=2358688 I am not sure whats going on with it, but you can probibly avoid it by making sure to create a user/setup root in the installer.

I upgraded my machines here at home and... nothing at all broke. I didn't even have anything to look at.

comments? additions? reactions?

As always, comment on mastodon: posts/2025/04/12/early-mid-april-infra-bits-2025.rst

Early April infra bits 2025

nirik

2025-04-05 17:34

Another week gone by and it's saturday morning again. We are in final freeze for Fedora 42 right now, so things have been a bit quieter as folks (hopefully) are focusing on quashing release blocking bugs, but there was still a lot going on.

Unsigned packages in images (again)

We had some rawhide/branched images show up again with unsigned packages. This is due to my upgrading koji packages and dropping a patch we had that tells it to never use the buildroot repo for packages (unsigned) when making images, and to instead use the compose repo for packages.

I thought this was fixed upstream, but it was not. So, the fix for now was a quick patch and update of koji. I need to talk to koji upstream about a longer term fix, or perhaps the fix is better in pungi. In any case, it should be fixed now.

Amusing idempotentness issue

In general, we try and make sure our ansible playbooks are idempotent. That is, that if you run it once, it puts things in the desiired state, and if you run it again (or as many times as you want), it shouldn't change anything at all, as the thing is in the desired state.

There are all sorts of reasons why this doesn't happen, sometimes easy to fix and sometimes more difficult. We do run a daily ansible-playbook run over all our playbooks with '--check --diff', that is... check what (if anything) changed and what it was.

I noticed on this report that all our builders were showing a change in the task that installs required packages. On looking more closely, it turns out the playbook was downgrading linux-firmware every run, and dnf-automatic was upgrading it (because the new one was marked as a security update). This was due to us specifying "kernel-firmware" as the package name, but only the older linux-firmware package provided that name, not the new one. Switching that to the new/correct 'linux-firmware' cleared up the problem.

AI scraper update

I blocked a ton of networks last week, but then I spent some time to look more closely at what they were scraping. Turns out there were 2 mirrors of projects (one linux kernel and one git ) that the scrapers were really really interested in. Since those mirrors had 0 commits or updates in the last 5 years since they were initially created, I just made those both 403 in apache and... the load is really dramatically better. Almost back to normal. I have no idea why they wanted to crawl those old copies of things already available elsewhere, and I doubt this will last, but for now this gives us a bit of time to explore other options (because I am sure they will be back).

Datacenter Move

I'm going to likely be sending out a devel-announce / community blog post next week, but for anyone who is reading this a sneak preview:

We are hopfully going to gain at least some network on our new hardware around april 16th or so. This will allow us to get in and configure firmware, decide setup plans and start installing enough machines to bootstrap things up.

The plan currently is still to do the 'switcharoo' (as I am calling it) on the week of June 16th. Thats the week after devconf.cz and two weeks after flock.

For Fedora linux users, there shouldn't be much to notice. Mirrorlists will all keep working, websites, etc should keep going fine. pagure.io will not be directly affected (it's moving later in the year).

For Fedora contributors, monday and tuesday we plan to "move" the bulk of applications and services. I would suggest just trying to avoid doing much on those days as services may be moving around or broken in various ways. Starting wed, we hope to make sure everything is switched and fix problems or issues. In some ideal world, we could just relax then, but if not, Thursday and Friday will continue stablization work.

The following week, the newest of the old machines in our current datacenter will be shipped to the new one. We will bring those up and add capacity on them (many of them will add openqa or builder resources).

That is at least the plan currently.

Spam on matrix

There's been another round of spam on matrix this last week. It's not just Fedora thats being hit, but many other communities that are on Matrix. It's also not like older communications channels (IRC) didn't have spammers on them at times in the past either. The particularly disturbing part on the matrix end is that the spammers post _very_ distirbing images. So, if you happen to look before they get redacted/deleted it's quite shocking (which is of course what the spammer wants). We have (for a long while) a bot in place and it redacts things pretty quickly usually, but then you have sometimes a lag in matrix federation, so folks on some servers still see the images until their server gets the redaction events.

There are various ideas floated to make this better, but due to the way matrix works, along with wanting to allow new folks to ask questions/interact, there is not any simple answers. It may take some adjustments to the matrix protocol.

If you are affected by this spam, you may want to set your client to not 'preview' images (so it won't load them until you click on them), and be patient as our bot bans/kicks/redacts offenders.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114286697832557392

Late March infra bits 2025

nirik

2025-03-29 20:04

Another week, another saturday blog post.

Mass updates/reboots

We did another mass update/reboot cycle. We try and do these every so often, as the fedora release schedule permits. We usually do all our staging hosts on a monday, on tuesday a bunch of hosts that we can reboot without anyone really noticing (ie, we have HA/failover/other paths or the service is just something that we consume, like backups), and finally on wednsday we do everything else (hosts that do cause outages).

Things went pretty smoothly this time, I had several folks helping out this time and thats really nice. I have done them all by myself, but it takes a while. We also fixed a number of minor issues with hosts: serial consoles not working right and nbde not running correctly and also zabbix users being setup correctly locally. There was also a hosted server where reverse dns was wrong, causing ansible to have the wrong fqdn and messing up our update/reboot playbook. Thanks James, Greg and Pedro!

I also used this outage to upgrade our proxies from Fedora 40 to Fedora 41.

After that our distribution of instances is:

number / ansible_distribution_version

252 41

105 9.5

21 8.10

8 40

2 9

1 43

It's interesting that we now have 2.5x as many Fedora instances as RHEL. Although thats mostly the case due to all the builders being Fedora.

The Fedora 40 GA compose breakage

Last week we got very low on space on our main fedora_koji volume. This was mostly caused by the storage folks syncing all the content to the new datacenter, which meant that it kept snapshots as it was syncing.

In an effort to free space (before I found out there was nothing we could do but wait) I removed an old composes/40/ compose. This was the final compose for Fedora 40 before it was released and the reason in the past that we kept it was to allow us to make delta rpms more easily. It's the same content as the base GA stuff, but it's in one place instead of split between fedora and fedora-secondary trees. Unfortunately, there were some other folks using this. Internally they were using it for some things and iot also was using it to make their daily image updates.

Fortunately, I didn't actually fully delete it, I just copied it to an archive volume, so I was able to just point the old location to the archive and everyone should be happy now.

Just goes to show you if you setup something for yourself, often unknown to you others find it helpfull as well, so retiring things is hard. :(

New pagure.io DDoS

For the most part we are handling load ok now on pagure.io. I think this is mostly due to us adding a bunch of resources, tuning things to handle higher load and blocking some larger abusers.

However, on friday we got a new fun one: A number of ip's were crawling an old (large) git repo grabbing git blame on ever rev of every file. This wasn't causing a problem on the webserver or bandwith side, but instead causing problems for the database/git workers. Since they had to query the db on every one of those and get a bunch of old historical data, it saturated the cpus pretty handily. I blocked access to that old repo (thats not even used anymore) and that seemed to be that, but they may come back again doing the same thing. :(

We do have a investigation open for what we want to do long term. We are looking at anubis, rate limiting, mod_qos and other options.

I really suspect these folks are just gathering content which they plan to resell to AI companies for training. Then the AI company can just say they bought it from bobs scraping service and 'openwash' the issues. No proof of course, but just a suspicion.

Final freeze coming up

Finally the final freeze for Fedora 42 starts next tuesday, so we have been trying to land anything last minute. If you're a maintainer or contributor working on Fedora 42, do make sure you get everthing lined up before the freeze!

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114247602988630824

Mid Late March infra bits 2025

nirik

2025-03-22 17:14

Fedora 42 Beta released

Fedora 42 Beta was released on tuesday. Thanks to everyone in the Fedora community that worked so hard on it. It looks to be a pretty nice relase, lots of things in it and working pretty reasonably already. Do take it for a spin if you like: https://fedoramagazine.org/announcing-fedora-linux-42-beta/

Of course with the Beta out the door, our infrastructure freeze is lifted and so I merged 11 PR's that were waiting for that on Wed. Also, next week we are going to get in a mass update/reboot cycle before the final freeze the week after.

Ansible galaxy / collections fun

One of the things I wanted to clean up what the ansible collections that were installed on our control host. We have a number that are installed via rpm (from EPEL). Those are fine, we know they are there and what version, etc. Then, we have some that are installed via ansible. We have a requirements.txt file and running the playbook on the control host installs those exact versions of roles/collections from ansible galaxy. Finally we had a few collections installed manually. I wanted to get those moved into ansible so we would always know what we have installed and what version it was. So, simple right? Just put them in requirements.txt. I added them in there and... it said they were just not found.

The problem turned out to be that we had roles in there, but no collections anymore so I had not added a 'collections:' section, and it was trying to find 'roles' with those collection names. The error "not found" was 100% right, but it took me a few to realize why they were not found. :)

More A.I. Scrapers

AI scrapers hitting open source projects is getting a lot of buzz. I hope that some of these scraper folks will realize it's counterproductive to scrape things at a rate that makes them not work, but I'm not holding my breath.

We ran into some very heavy traffic and I had to end up blocking brazil for a while to pagure.io. We also added some CPU's and adjusted things to handle higher load. So far we are handling things ok now and I removed the brazil blockage. But no telling when they will be back. We may well have to look at something like anubis, but I fear the scrapers would just adjust to not be something it can catch. Time will tell.

Thats it for this week folks...

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114207270082200302

Mid March infra bits 2025

nirik

2025-03-15 17:52

AI Scraper scourge

The AI scraper (I can only assume thats what they are) scourge continued, and intensified in the last week. This time they were hitting pagure.io really quite hard. We blocked a bunch of subnets, but it's really hard to block everything without inpacting legit users, and indeed, we hit several cases where we blocked legit users. Quickly reverted, but still troublesome. On thursday and friday it got even worse. I happened to notice that most of the subnets/blocks were from .br (Brazil). So, in desperation, I blocked .br entirely and that brought things back to being more responsive. I know thats not a long term solution, so I will lift that block as soon as I see the traffic diminish (which I would think it would once they realize it's not going to work). We definitely need a better solution here. I want to find the time to look into mod_qos where we could at least make sure important networks aren't blocked and other networks get low priority. I also added a bunch more cpus to the pagure.io vm. That also seemed to help some.

F42 Beta on the way

Fedora 43 Beta is going to be released tuesday! Shaping up to be another great release. Do download and test if you wish.

Datacenter Move

The datacenter move we are going to be doing later this year has moved a bit later in the year. Due to some logistics we are moving to a mid June window from the May window. That does give us a bit more time, but it's still going to be a lot of work in a short window. It's also going to be right after flock. We hope to have access to new hardware in a few weeks here so we can start to install and setup things. The actual 'switcharoo' in June will be over 3 or so days, then fixing anything that was broken by the move and hopefully all set before the F43 Mass rebuild.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114167827757899998