Omega Owners Forum
Chat Area => General Discussion Area => Topic started by: TheBoy on 08 October 2017, 12:35:38
-
Unfortunately we suffered a bizarre incident last night that caused everything here to fail... ...including the alerting system, which meant it went undetected until early this morning.
Clearly we are back up and running, but we will have to go down again for an hour or so later to resync the databases - not quite sure when, as I still have a lot to do getting all the infrastructure up and running.
It looks like we had a near simultaneous failure of 2 of our 3 storage solutions, one being a hyperconverged, high availability one, and the other being a standalone SAN solution. Trying to work back through the logs to see if it genuinely was 3 simultaneous storage server failures (unlikely) or 2 simultaneous hypervisor failures (unlikely), or something common that I haven't yet found (likely).
Sorry, as always, for the outage, and the fact you've probably had to talk to wives/partners or go down the pub!
-
I missed you, obviously, but considering how long the Forum has been online and how few failures there have been, you do a great job. Thanks. :y 8)
Ron.
-
You shouldn’t really be hyperconverging on a Sunday, Jaime, try sitting in the bath till you go all wrinkly, much more relaxing. ;D
-
I missed you, obviously, but considering how long the Forum has been online and how few failures there have been, you do a great job. Thanks. :y 8)
Ron.
I agree sterling job :y
-
Well done for getting it sorted, :y :y :y not the sort of stress you need, especially at the weekend. :( :( :(
-
Nice one :y
-
We owe you again TB! :-* :-* :y :y
-
The outage to resync the databases will likely be early tomorrow morning now, which hopefully will minimise inconvenience :)
-
I wondered where it'd gone.Good job someone knows what they're doing :y Don't know how you do it for the money :D ;D
-
I wondered where it'd gone.Good job someone knows what they're doing :y Don't know how you do it for the money :D ;D
I think it's a labour of love, sweat and tears. ::) ::) ::)
-
Mind you it could be TB's cull list that's overloading the system in the first place ;D
-
As soon as it went off line, I said to myself, I bet that's the hyperconvergence again. Great to be proved right. ::) ::) ::)
-
As soon as it went off line, I said to myself, I bet that's the hyperconvergence again. Great to be proved right. ::) ::) ::)
It was down last night when I got in just before 11pm and the same this morning at about 8:30am and the thought went through my mind, does he know it's down and then gas bottles and garages. :o :o :o
-
does he know it's down and then gas bottles and garages. :o :o :o
That very thing was mentioned at work on one of our conf calls, as one of the guys couldn't make it due to having leccy and gas meters changed. "TB is good with gas and leccy" was the smart alec comment >:(
-
does he know it's down and then gas bottles and garages. :o :o :o
That very thing was mentioned at work on one of our conf calls, as one of the guys couldn't make it due to having leccy and gas meters changed. "TB is good with gas and leccy" was the smart alec comment >:(
Well, to be fair :-X :-X ::) :D
-
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.
-
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.
Of course... I'll never mention it again. Or tree shaped air fresheners :-X ::) :D
-
???
-
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.
Of course... I'll never mention it again. Or tree shaped air fresheners :-X ::) :D
Punishment will be swift if you do...
-
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.
Of course... I'll never mention it again. Or tree shaped air fresheners :-X ::) :D
Punishment will be swift if you do...
.
Mmm newbie docker again
-
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.
Of course... I'll never mention it again. Or tree shaped air fresheners :-X ::) :D
Punishment will be swift if you do...
.
Mmm newbie docker again
Probably... Been a while :-X ::) ;D
-
Moving the webserver and primary database off the flash storage onto a traditional HDD raid 10 storage device.
There should be no outage (its moving right now), but obviously spinning media is slower than flash, so page load times may increase slightly.
This is to run diags on the flash storage that crashed at the weekend.
-
^^^
Wot he said. ::) ::) ::)
-
I would have done exactly the same. :y
;D ;D
-
Blimey, WTF is happening lately >:(
One of the hypervisors has sorta crashed, following a minor health alert this morning. What should happen is if the hardware health is suspect, all the load moves over to another, healthy hypervisor. In this case its just shat it pants. The VMs are still running, but unmanageable.
So, no choice but to power cycle the underlying hardware, which will ungracefully power cycle the VMs running on it.
From an OOF prespective, this includes the primary webserver, the primary database, and the secondary database (which affinity rules say should never be on same hypervisor as primary, so something else has gone wrong there).
I shall do this when I finish work today.
-
Blimey, WTF is happening lately >:(
One of the hypervisors has sorta crashed, following a minor health alert this morning. What should happen is if the hardware health is suspect, all the load moves over to another, healthy hypervisor. In this case its just shat it pants. The VMs are still running, but unmanageable.
So, no choice but to power cycle the underlying hardware, which will ungracefully power cycle the VMs running on it.
From an OOF prespective, this includes the primary webserver, the primary database, and the secondary database (which affinity rules say should never be on same hypervisor as primary, so something else has gone wrong there).
I shall do this when I finish work today.
Good luck, and thank you in advance :y :y
-
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)
No outage needed. Happy days. Sometimes simple victories give me a real smile :)
-
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)
No outage needed. Happy days. Sometimes simple victories give me a real smile :)
Does that mean TB your cull is postponed?? :-\ :-\ ;D ;D ;)
-
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)
No outage needed. Happy days. Sometimes simple victories give me a real smile :)
You need to upgrade from win95 ;) ;D ;D
Well done TB :y
-
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)
No outage needed. Happy days. Sometimes simple victories give me a real smile :)
Does that mean TB your cull is postponed?? :-\ :-\ ;D ;D ;)
No. I'll just be happier executing it.
-
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)
No outage needed. Happy days. Sometimes simple victories give me a real smile :)
You need to upgrade from win95 ;) ;D ;D
Well done TB :y
I suspect part of the issue is quite the opposite, going too cutting edge. Already been bitten this month by applying a patch that kills the hypervisor if running Intel based 10GE cards... ...fortunately in that case the hypervisor had no VMs do to the planned patching :).
-
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)
No outage needed. Happy days. Sometimes simple victories give me a real smile :)
Does that mean TB your cull is postponed?? :-\ :-\ ;D ;D ;)
No. I'll just be happier executing it.
Ooooooo! You are so heartless TB! :o :o :o
;D ;D ;D ;D ;D ;D ;)