bolha.us is one of the many independent Mastodon servers you can use to participate in the fediverse.
We're a Brazilian IT Community. We love IT/DevOps/Cloud, but we also love to talk about life, the universe, and more. | Nós somos uma comunidade de TI Brasileira, gostamos de Dev/DevOps/Cloud e mais!

Server stats:

251
active users

#homelab

55 posts38 participants1 post today

I have a large pile of old laptop 2.5" HDD drives (NOT SSD) in varying GB sizes. I think I might do something mad with them and build a 20 drive (or more) RAID NAS.

The only thing I need really, is 20 port (or more) SATA PCI-E card... Does such a thing exist?

Phew. Okay. That was an expensive evening. But now I've got 3x Raspberry Pi 5 8 GB as a replacement for my control plane Pi 4 on the way, plus the NVMe SSDs to hopefully get past my I/O issues.

In addition, I also got a 16 GB Pi 5, also with an SSD, for some future plans.

One of the reasons I didn't want to wait much longer is the possible supply chain issues coming our way. And the last time the Pis were unobtanium.

Today, at least Vault and the MONs stayed up on the controller nodes, but everything else was restarted multiple times. I'm not looking forward to what's going to happen on Friday, when Homelab service update day rolls around and I put some real load on the control plane.

Plan for now is go and get myself some Pi 5, NVMe hats and a couple NVMe SSDs. It was clear the day would come at some point, but I can then just put the Pi4 into the cluster as workers.

uuuuh was just checking my #dockerCompose services and read that the #Matrix SlidingSync Proxy is not required anymore when using the latest #Synapse and #ElementX

Time to cleanup some things and free up ressources.. 🧹

matrix.org/blog/2024/11/14/mov

You can securely message me via matrix.to/#/@stefan:stefanberg

🏷️ #HomeLab

matrix.org · Sunsetting the Sliding Sync Proxy: Moving to Native SupportBy Will Lewis
Continued thread

Also, it's worth saying: While yes, my control plane has been crashing quite a bit, the cluster always self-healed. The only reason I even know something is wrong is because of the Vault pods also running on the CP nodes. They need to be manually unsealed, so they won't become ready again until I can personally intervene. But that's pretty much on purpose. Plus, because there are three of them, the Vault cluster itself is still perfectly fine.

Difficult decision: Do I start writing the blog post on the controller migration, or do I go digging into the logs of this morning's 05:20 am control plane crash?

Ah who am I kidding. Logs digging it is.

But there's a pattern emerging. All of the crashes I investigated yesterday happened around xx:20, and so did the one this morning. At least the crashes I didn't trigger by re-deploying a large Helm chart or running "journalctl -ef".

It took a couple of weeks learning how the the values yaml files for rook ceph are supposed to be written. But I have finally managed to get a rook ceph storage cluster configured on a Talos kubernetes cluster!

I can do this stuff!

Next step is getting it to work with an ArgoCD CI/CD pipeline.

Who boy. I can now bring down my k8s control plane node with a simple "journalctl -ef". 😒

On the other hand, when I don't do anything with the cluster, it seems to be reasonably stable.

Also, the cilium operator seems to be particularly sensitive, the Pod had 44 crashes since Saturday afternoon.

Continued thread

Alright, I've been trawling through logs for quite a while now, and I think it's not a specific issue, but just "all this stuff together is too much for a Pi 4". The Ceph MONs are regularly failing their liveness probes, so does Vault due to the local CSI plugins losing their MONs. The etcd and kubeapiserver logs are full of timeouts. It looks like my setup is not actually sustainable.

Now to decide what to do about it.

Replied in thread

Next I run #syncthing on my laptop, and on my #homelab #intelN100 #n100 mini pc / server that runs in the cupboard and is very #lowpower I run #proxmox and this also has a #samba share which allows any other network devices to see the media.

With syncthing running, I always have two copies of the media, but for backup I was using #rclone to send an encrypted copy to #googledrive - which I am in the process of switching over to #nextcloud running on #hetzner

🧵 3/4

Continued thread

Aha. The first issue was that for some reason, I did not install kube-vip on 2 of the 3 CP nodes. So there was only ever one host holding/serving the k8s VIP. I'm also thinking about getting some load balancing going for the k8s API. Right now, whoever holds the VIP for the API via kube-vip gets all of the requests, if I understand it correctly. Perhaps I could improve the stability by load-balancing the VIP. That's of course not possible with ARP mode, so some more reading necessary.

Continued thread

Okay, from initial investigations it looks like the crash this morning at 10:17 AM was due to kube-vip failing to do its leader election due to timeouts when contacting the k8s API, and consequently the k8s API IP going offline. That wreaked havoc in the cluster for a bit. I'm still not 100% whether the I/O overload I'm seeing on the CP node was created by the k8s API going down, or whether it caused the API to go down.