And even better, this means we can remove an old VM!
But sometimes this sounds like a crime in our chat :) Poor Egon 😋
And in the meantime we've build really beautiful visualizations of our incoming queue :)
It took us some time, but you can (finally!) move your U6 account to U7 without changing your username.
Have a look! 🎉
This Uberspace-to-Uberspace-thread is great. 😄
At the DC he found out that one of our power feeds was broken. Luckily we had enough capacity on the other PDUs and the power feeds to move the systems. After that we had to boot the systems again.
On a few hosts the VMs or the DRBD needed some manual steps to recover but at 03:12 CEST the rack at FRA2 was also back up again.
Thats the story from this night and we hope that such a big outage is not waiting again for us in the near future ;)
While we are waiting for the recovery at FRA3 we got alerted for another outage. This time at FRA2, where most of the U6 system are located. The hosts in one rack weren't reachable at all. The other racks were not affected.
So we knew that this should be a power issue or another broken switch.
It was luck that our colleague was still in Frankfurt so he could drive to the other DC with a spare switch.
You can see the problem with the high IOPS on this graph. The second peek is our backup :)
At 03:15 CEST most of the services where green again. This issue was done.
After that nearly 70% of our 83 VMs on the cluster rebooted automatically. For the other 30% the proxmox HA manager stuck so we had to start them manually. And now we had 83 VMs booting at the same time. Not a good idea at spinning rust. We maxed out our read performance on the HDD pool and it took us over an hour and some manual poking until everything was green again.
Here we are again with the second part :)
Network was OK at 0:34 CEST.
But our monitoring still alerted many VMs and services as not reachable. All of these VMs are located at our new proxmox cluster and after a short look we saw that the ceph was in an unhealthy state and blocked all requests failing at auto recovery.
This was caused by the fact that 50% of our OSDs (disks) were not online. So we had to start them manually. After that the cluster recovered in < 5 minutes. That was amazing!
So we isolated everything on our core switches. The load was going down and the packet loss to. Then we reenabled our network step by step and found a broken switch we need to reboot hard by our remote hands. After that the network was fully reachable again. We still need to investigate why a redundant access switch was capable of bricking our whole network.
What happened after that will follow in a few hours :)
After some time of waiting for technician to arrive at the DC they informed us that the route we though should run was disabled but the one we thought was broken was running. Nearly at the same time (at 00:06 CEST) one of us tested the OOBM connection to our second router and found it working perfectly (but with enormous load and 99% packet loss). From the debugging we now knew that the problem was in our internal network.
The problem was due to the outage both routers weren't reachable from the outside. And a design issue we had on our todo list resulted in the problem that we can't reach our power supply to trigger a remote reboot. So we send one of our team to the data center and also asked our provider for remote hands (In the end the technician did a great job providing "crossing fingers as a service" as we could fix everything remotely).
We are the operations team of uberspace. We will entertain you with insights of our daily work.
Check out @hallo for support!
This server is for internal use only at the moment