Migrating MySQL onto SSDs to help with performance. Let's see how this works out.
42% of U7 hosts, which are on the new cluster, got their MySQL migrated onto SSDs now! 🎉 will probably be done today.
Dear >150 followers, you all realize this account is mostly going to be _very_ random bits out of our work life, which means mostly screenshots of terminals, code and exceptions, yes? yes?
fun fact: each run contains progressively larger hosts. The first ones were ~3GB of MySQL data, the current ones are ~17GB and the largest one is >80GB. That one will take a while.
Some explanation on what we're even up to. The process for this is fairly straight forward:
1. mount SSD storage
2. rsync w/ MariaDB running
3. stop MariaDB
4. rsync w/o MariaDB running
5. move /var/lib/mysql away
6. bind-mount SSD storage to there
7. start MariaDB
If we're lucky and no data changes between 2. and 3., step 4 is pretty much instant. That means almost no downtime. If we're unlucky, a bunch of stuff changed which leads to >20 minutes of downtime :(
Admittedly, you can build the whole thing way more efficient and without any downtime using a primary/secondary setup. But any run-time we would have saved doing that, we'd have probably lost in dev-time. The latter one is more limited right now, so here we are.
Admittedly-admittedly it's not very smart to have MySQL on the application hosts in the first place. Changing that is a bigger (but planned!) project, though.
aaand about 7 hours ago at 4:30 reality kicked in.
Linux has got these /dev/sdb things you can use to talk to your disks, right? Well, their names can randomly change or reorder sometimes. If you're handling multiple disks, that's a thing you should know, right? Despite knowing that, I still managed to use /dev/sdb and .../sdc directly. Two colleges colleagues even signed off on the related ansible playbook.
So I got the whole thing deployed and went to sleep at 2am.
Guess what happened next.
After a reboot at 4:30, sdb and sdc switched places on machholz.uberspace.de. MySQL was subsequently very unhappy about suddenly not seeing any data anymore. That made our monitoring very unhappy. Which in turn lead to two of us trying to figure out what the heck happened.. at 4:45 am. So 30 minutes of debugging and MySQL downtime later, we fixed the problem on that host and all other ones by using UUIDs instead of /dev/sdc. The obvious way to go in the first place.
So, what did we learn?
You can throw two seasoned and one kinda-seasoned admin at an almost trivial task and still manage to make rookie mistakes. And that's okay. Everyone missteps sometimes. Sometimes people even mess up really, really badly. Try to not beat yourself up about it.
We fixed the problem quickly, nobody was blamed and luckily at 4am hardly anyone cares anyway.
... aand I even managed to get some sleep after 4:30!
Fixed the playbooks to use UUIDs in the future, migrated a few more hosts and even debugged the general storage performance a bit 🏎️ 🎉
We're off into an extended weekend now. Thanks for playing and see you next week 👋
aand finally moving MySQL on wirtanen.uberspace.de onto SSDs. That's the big 71GB one I was scared of earlier.
This will also mark 62 of 87 hosts migrated 🎉
@dev Is there a list of which hosts got migrated already?
@suruatoel not publicly, no. Which one(s) are you on?
@dev I’m on “wolf”. Measured by the speed of Nextcloud it has been migrated already 👍
@suruatoel wolf isn't on the new cluster yet so it won't get migrated in the near future. Once we moved it, it will get the upgrade as well.
@dev and we are observing the impact on our infrastructure :)
The write peaks you can see are the nextcloud cron jobs running every 15 minutes.
@ops SSD go brrrrr?
@dev maybe we should update the manual and add a little random to the cron job :D
@dev yes! 😍
@dev yes, and these random bits will then be connected to a story, and this will be the story of what you are doing everyday, and you will eventually read or hear this story and then you'll look back onto your work week and you'll think: "who is this person tooting there and how did they gain access to our account...?"
@dev Add a tad of explanation and mention the tools your are using and the crowd gets hooked.
@dev To be honest, that toot only has a Shannon entropy of roughly 4.37bits per character which is just _fairly_ random.
Your warning not to use your toots for cryptographic purposes is very much appreciated though. We love your entropy!
@dev WHAT WHAT HAPPENED???
@Sonnenbrand A SERVER WENT BELLY UP :( but we managed to turn it around!
@dev story of my linux installations. Did not find disks on boot or needed VEEEEERY long to resolve something on boot. Then I put on verbose, switched in as consequences of the log output to uuids and are a happy dev since then. It can happen everyone. And I am still wondering why uuids arent the default when creating fstab (in this case)
@dev Thx! 😊
@dev Their names can change? Is this another systemd fuckup?
@tux0r well, I wouldn't call it that. Referring to devices by UUID seems safer and more stable. The partitions have unique IDs, we should use them. I'm not sure if systemd made the change, but it changed at some point in the "recent" past, yes.
@dev Danke Luto!
This server is for internal use only at the moment