And this is how it looks if a VM is going wild. We still need to find good values to limit the impact a single VM could make.

And even better, this means we can remove an old VM!

But sometimes this sounds like a crime in our chat :) Poor Egon 😋

Show thread

A day you can disable a legacy service is a good day! 🎉

Today we moved some old ugly dhcp stuff for network booting during installation to the new setup. :)

And in the meantime we've build really beautiful visualizations of our incoming queue :)

Show thread

It took us a while but we managed to get some output from the PDU on our console server 😅

It took us some time, but you can (finally!) move your U6 account to U7 without changing your username.

Have a look! 🎉

wiki.uberspace.de/uberspace2ub

SNMP is such a nightmare...over and over again. But there is no choice, we need it to get our metrics :(

Our support team asked us for some insights from there ticket system so we build them a nice dashboard (to be extended in the future).

(And no, we won't analyze the individual performance of our team members.)

Wenn ihr noch einen U6 Uberspace habt und uns ein bisschen helfen wollt, migriert ihn doch auf U7 :)
U6 läuft Ende des Jahres aus und eigentlich hätten wir euch alle bis dahin gerne auf der neuen Plattform :) :uberspace:

Wenn ihr noch einen U6 Uberspace habt und uns ein bisschen helfen wollt, migriert ihn doch auf U7 :)
U6 läuft Ende des Jahres aus und eigentlich hätten wir euch alle bis dahin gerne auf der neuen Plattform :) :uberspace:

At the DC he found out that one of our power feeds was broken. Luckily we had enough capacity on the other PDUs and the power feeds to move the systems. After that we had to boot the systems again.

On a few hosts the VMs or the DRBD needed some manual steps to recover but at 03:12 CEST the rack at FRA2 was also back up again.

Thats the story from this night and we hope that such a big outage is not waiting again for us in the near future ;)

Show thread

While we are waiting for the recovery at FRA3 we got alerted for another outage. This time at FRA2, where most of the U6 system are located. The hosts in one rack weren't reachable at all. The other racks were not affected.

So we knew that this should be a power issue or another broken switch.

It was luck that our colleague was still in Frankfurt so he could drive to the other DC with a spare switch.

Show thread

You can see the problem with the high IOPS on this graph. The second peek is our backup :)

At 03:15 CEST most of the services where green again. This issue was done.

Show thread

After that nearly 70% of our 83 VMs on the cluster rebooted automatically. For the other 30% the proxmox HA manager stuck so we had to start them manually. And now we had 83 VMs booting at the same time. Not a good idea at spinning rust. We maxed out our read performance on the HDD pool and it took us over an hour and some manual poking until everything was green again.

Show thread

Here we are again with the second part :)

Network was OK at 0:34 CEST.

But our monitoring still alerted many VMs and services as not reachable. All of these VMs are located at our new proxmox cluster and after a short look we saw that the ceph was in an unhealthy state and blocked all requests failing at auto recovery.

This was caused by the fact that 50% of our OSDs (disks) were not online. So we had to start them manually. After that the cluster recovered in < 5 minutes. That was amazing!

Show thread

So we isolated everything on our core switches. The load was going down and the packet loss to. Then we reenabled our network step by step and found a broken switch we need to reboot hard by our remote hands. After that the network was fully reachable again. We still need to investigate why a redundant access switch was capable of bricking our whole network.

What happened after that will follow in a few hours :)

Show thread

After some time of waiting for technician to arrive at the DC they informed us that the route we though should run was disabled but the one we thought was broken was running. Nearly at the same time (at 00:06 CEST) one of us tested the OOBM connection to our second router and found it working perfectly (but with enormous load and 99% packet loss). From the debugging we now knew that the problem was in our internal network.

Show thread

The problem was due to the outage both routers weren't reachable from the outside. And a design issue we had on our todo list resulted in the problem that we can't reach our power supply to trigger a remote reboot. So we send one of our team to the data center and also asked our provider for remote hands (In the end the technician did a great job providing "crossing fingers as a service" as we could fix everything remotely).

Show thread
Show more
Uberspace Mastodon

This server is for internal use only at the moment