First of all great news:
we are running now with round about 350 hosts on Ubuntu Lucid (10.04 LTS) Server Flavour on Bare Metal (HP Rackmounts DL360/DL365/DL380/DL385 from G5 via G5P, G6 and G7 , HP BladeServers BL465c G5 and G7 with the Flex10 Fabric) and VMWare Machines.
This was not the case until the last weekend.
In the past, we were running Ubuntu Jaunty (9.04) and that had to change, because 9.04 was EOL when Ubuntu Maverick.
Well, normally it would be easy to follow the non LTS releases with do-release-upgrade or apt-get update / apt-get dist-upgrade, but during our tests we found some really strange things.
We are running on Ubuntu many different services and some of them are involving DRBD setups. Especially this DRBD setup gave us problems.
First, in 10.04 LTS no Heartbeat1/2 was existing anymore, so we had to replace all our puppet recipes which are dealing with HA1/2 to pacemaker. This was one of the serious buggers
Second, while we were test-upgraing from 9.04 to 9.10 to 1.04 we found out that during this update all DRBD devices were horribly broken (we don't know why, but they were, and we had no time to investigate).
Therefore, we decided that we have to totally redeploy our Servers during Operational Times from Scratch.
What does this mean:
Well, the problem with all that, we only had 8 months of time, without interrupting the daily operations.
Result: Many days with too many hours and a lot of brainfck involved.
At this time, when we started this adventure, we were 4 team members, and everybody got a share of the work.
My special topic was: Rewrite the FAIManager I wrote in 2008/2009. The result was DC².
I want to spare you the technical details of this adventure, but it was hard work. Especially when you get new hardware which was really untested, and you find problems during Network Boot Setups.
In the last 5 days, before the big bang started, I had to replace klibcs ipconfig network setup in live-initramfs overlay with udhcpc. This was a success, but it costed work time.
Anyhow, last weekend was the high time for us. We started on Saturday, around 10am (UTC+1) and after 36 hours we were finished.
All of our services are redundant. So, we deployed from scratch the second line of our machines. We tested the product on this second line and when we were sure, that everything worked, we switched from old Ubuntu 9.04 First Line Machines to the newly deployed Ubuntu 10.04 LTS line.
After the switch we re-checked the product services, so we were really sure that everything worked as before.
After the final test, we started to deploy the first line. Sunday evening we were then ready to bring up the newly deployed machines as redundancy.
The last action on this sunday was to drink some beer and smoke a cigar to celebrate our success.
All in all, it was a success, everything worked as expected and the downtime was not more then 30 minutes.
Coming to an end, this project wouldn't have worked out without many people involved.
we are running now with round about 350 hosts on Ubuntu Lucid (10.04 LTS) Server Flavour on Bare Metal (HP Rackmounts DL360/DL365/DL380/DL385 from G5 via G5P, G6 and G7 , HP BladeServers BL465c G5 and G7 with the Flex10 Fabric) and VMWare Machines.
This was not the case until the last weekend.
In the past, we were running Ubuntu Jaunty (9.04) and that had to change, because 9.04 was EOL when Ubuntu Maverick.
Well, normally it would be easy to follow the non LTS releases with do-release-upgrade or apt-get update / apt-get dist-upgrade, but during our tests we found some really strange things.
We are running on Ubuntu many different services and some of them are involving DRBD setups. Especially this DRBD setup gave us problems.
First, in 10.04 LTS no Heartbeat1/2 was existing anymore, so we had to replace all our puppet recipes which are dealing with HA1/2 to pacemaker. This was one of the serious buggers
Second, while we were test-upgraing from 9.04 to 9.10 to 1.04 we found out that during this update all DRBD devices were horribly broken (we don't know why, but they were, and we had no time to investigate).
Therefore, we decided that we have to totally redeploy our Servers during Operational Times from Scratch.
What does this mean:
- Setup the whole infrastructure, or update the existing infrastructure to deploy Ubuntu 10.04
- Test Deploy VMWare Machines and Bare Metal Test Machines
- Test new hardware, especially the BL465C G7 blade servers from HP, because of the new Flex10 Fabric NIC
- Test Database Setups with Replications for our Production Services. From 5.0 to 5.1 many things changed. This was crucial for us, because some of our databases are running under high load (IO, CPU and Memory wise)
- Test many pacemaker setups, and write puppet recipes for them (pacemaker + ipvs + ldirectord, pacemaker+drbd+mysql, pacemaker+apache2, pacemaker + bind, pacemaker + postfix etc.)
- Test FAI Deployment of Bare Metal
Well, the problem with all that, we only had 8 months of time, without interrupting the daily operations.
Result: Many days with too many hours and a lot of brainfck involved.
At this time, when we started this adventure, we were 4 team members, and everybody got a share of the work.
My special topic was: Rewrite the FAIManager I wrote in 2008/2009. The result was DC².
I want to spare you the technical details of this adventure, but it was hard work. Especially when you get new hardware which was really untested, and you find problems during Network Boot Setups.
In the last 5 days, before the big bang started, I had to replace klibcs ipconfig network setup in live-initramfs overlay with udhcpc. This was a success, but it costed work time.
Anyhow, last weekend was the high time for us. We started on Saturday, around 10am (UTC+1) and after 36 hours we were finished.
All of our services are redundant. So, we deployed from scratch the second line of our machines. We tested the product on this second line and when we were sure, that everything worked, we switched from old Ubuntu 9.04 First Line Machines to the newly deployed Ubuntu 10.04 LTS line.
After the switch we re-checked the product services, so we were really sure that everything worked as before.
After the final test, we started to deploy the first line. Sunday evening we were then ready to bring up the newly deployed machines as redundancy.
The last action on this sunday was to drink some beer and smoke a cigar to celebrate our success.
All in all, it was a success, everything worked as expected and the downtime was not more then 30 minutes.
Coming to an end, this project wouldn't have worked out without many people involved.
- All OPS team members involved. Without their energy to work day and night this wouldn't have worked out nicely.
- All people working for Ubuntu, Debian and especially my dear friends from the FAI project.
- A special thanks to Stéphane Graber and the people from the LTSP project, who had already UDHCPd in their initramfs setup, from where I got the idea and parts of the implementation.
- The people from the Puppetlabs for their great software, FAI + Puppet are great!
- The people from the Qooxdoo Project, this is really a nifty piece of javascript framework
- The people from the Django Project, the backend application runs with it
- David Fischer for his great rpc4django project, really a cool implementation for xmlrpc and json-rpc
- The developer of Googles Chromium Browser, Mozilla Firefox and Firebug
- Hewlett-Packard for the great hardware