Omega Owners Forum

Chat Area => General Discussion Area => Topic started by: TheBoy on 08 October 2017, 12:35:38

Title: OOF Outage
Post by: TheBoy on 08 October 2017, 12:35:38: Unfortunately we suffered a bizarre incident last night that caused everything here to fail... ...including the alerting system, which meant it went undetected until early this morning.

Clearly we are back up and running, but we will have to go down again for an hour or so later to resync the databases - not quite sure when, as I still have a lot to do getting all the infrastructure up and running.

It looks like we had a near simultaneous failure of 2 of our 3 storage solutions, one being a hyperconverged, high availability one, and the other being a standalone SAN solution. Trying to work back through the logs to see if it genuinely was 3 simultaneous storage server failures (unlikely) or 2 simultaneous hypervisor failures (unlikely), or something common that I haven't yet found (likely).

Sorry, as always, for the outage, and the fact you've probably had to talk to wives/partners or go down the pub!
Title: Re: OOF Outage
Post by: Bigron on 08 October 2017, 12:56:13: I missed you, obviously, but considering how long the Forum has been online and how few failures there have been, you do a great job. Thanks. :y 8)

Ron.
Title: Re: OOF Outage
Post by: STEMO on 08 October 2017, 13:17:25: You shouldn’t really be hyperconverging on a Sunday, Jaime, try sitting in the bath till you go all wrinkly, much more relaxing. ;D
Title: Re: OOF Outage
Post by: b4ndit on 08 October 2017, 13:24:53: Quote from: Bigron on 08 October 2017, 12:56:13
I missed you, obviously, but considering how long the Forum has been online and how few failures there have been, you do a great job. Thanks. :y 8)

Ron.
I agree sterling job :y
Title: Re: OOF Outage
Post by: Rods2 on 08 October 2017, 14:32:48: Well done for getting it sorted, :y :y :y not the sort of stress you need, especially at the weekend. :( :( :(
Title: Re: OOF Outage
Post by: VXL V6 on 08 October 2017, 17:36:05: Nice one :y
Title: Re: OOF Outage
Post by: Lizzie Zoom on 08 October 2017, 18:15:14: We owe you again TB! :-* :-* :y :y
Title: Re: OOF Outage
Post by: TheBoy on 08 October 2017, 18:21:25: The outage to resync the databases will likely be early tomorrow morning now, which hopefully will minimise inconvenience :)
Title: Re: OOF Outage
Post by: BazaJT on 08 October 2017, 18:42:20: I wondered where it'd gone.Good job someone knows what they're doing :y Don't know how you do it for the money :D ;D
Title: Re: OOF Outage
Post by: Rods2 on 08 October 2017, 18:58:26: Quote from: BazaJT on 08 October 2017, 18:42:20
I wondered where it'd gone.Good job someone knows what they're doing :y Don't know how you do it for the money :D ;D

I think it's a labour of love, sweat and tears. ::) ::) ::)
Title: Re: OOF Outage
Post by: BazaJT on 08 October 2017, 19:02:01: Mind you it could be TB's cull list that's overloading the system in the first place ;D
Title: Re: OOF Outage
Post by: Shackeng on 08 October 2017, 19:32:02: As soon as it went off line, I said to myself, I bet that's the hyperconvergence again. Great to be proved right. ::) ::) ::)
Title: Re: OOF Outage
Post by: Rods2 on 08 October 2017, 21:42:47: Quote from: Shackeng on 08 October 2017, 19:32:02
As soon as it went off line, I said to myself, I bet that's the hyperconvergence again. Great to be proved right. ::) ::) ::)

It was down last night when I got in just before 11pm and the same this morning at about 8:30am and the thought went through my mind, does he know it's down and then gas bottles and garages. :o :o :o
Title: Re: OOF Outage
Post by: TheBoy on 09 October 2017, 17:30:40: Quote from: Rods2 on 08 October 2017, 21:42:47
does he know it's down and then gas bottles and garages. :o :o :o
That very thing was mentioned at work on one of our conf calls, as one of the guys couldn't make it due to having leccy and gas meters changed. "TB is good with gas and leccy" was the smart alec comment >:(
Title: Re: OOF Outage
Post by: Lazydocker on 09 October 2017, 17:46:10: Quote from: TheBoy on 09 October 2017, 17:30:40
Quote from: Rods2 on 08 October 2017, 21:42:47
does he know it's down and then gas bottles and garages. :o :o :o
That very thing was mentioned at work on one of our conf calls, as one of the guys couldn't make it due to having leccy and gas meters changed. "TB is good with gas and leccy" was the smart alec comment >:(

Well, to be fair :-X :-X ::) :D
Title: Re: OOF Outage
Post by: TheBoy on 09 October 2017, 17:56:31: Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.
Title: Re: OOF Outage
Post by: Lazydocker on 09 October 2017, 18:31:12: Quote from: TheBoy on 09 October 2017, 17:56:31
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.

Of course... I'll never mention it again. Or tree shaped air fresheners :-X ::) :D
Title: Re: OOF Outage
Post by: Bigron on 09 October 2017, 18:39:45: ???
Title: Re: OOF Outage
Post by: TheBoy on 09 October 2017, 18:53:05: Quote from: Lazydocker on 09 October 2017, 18:31:12
Quote from: TheBoy on 09 October 2017, 17:56:31
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.

Of course... I'll never mention it again. Or tree shaped air fresheners :-X ::) :D
Punishment will be swift if you do...
Title: Re: OOF Outage
Post by: biggriffin on 09 October 2017, 20:53:43: Quote from: TheBoy on 09 October 2017, 18:53:05
Quote from: Lazydocker on 09 October 2017, 18:31:12
Quote from: TheBoy on 09 October 2017, 17:56:31
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.

Of course... I'll never mention it again. Or tree shaped air fresheners :-X ::) :D
Punishment will be swift if you do...
.

Mmm newbie docker again
Title: Re: OOF Outage
Post by: Lazydocker on 09 October 2017, 22:34:43: Quote from: biggriffin on 09 October 2017, 20:53:43
Quote from: TheBoy on 09 October 2017, 18:53:05
Quote from: Lazydocker on 09 October 2017, 18:31:12
Quote from: TheBoy on 09 October 2017, 17:56:31
Cough. That was over 4yrs ago. Ancient history. To be forgotten by all.

Of course... I'll never mention it again. Or tree shaped air fresheners :-X ::) :D
Punishment will be swift if you do...
.

Mmm newbie docker again

Probably... Been a while :-X ::) ;D
Title: Re: OOF Outage
Post by: TheBoy on 11 October 2017, 18:08:50: Moving the webserver and primary database off the flash storage onto a traditional HDD raid 10 storage device.

There should be no outage (its moving right now), but obviously spinning media is slower than flash, so page load times may increase slightly.

This is to run diags on the flash storage that crashed at the weekend.
Title: Re: OOF Outage
Post by: Shackeng on 11 October 2017, 18:29:45: ^^^
Wot he said. ::) ::) ::)
Title: Re: OOF Outage
Post by: Migv6 le Frog Fan on 11 October 2017, 18:33:34: I would have done exactly the same. :y

;D ;D
Title: Re: OOF Outage
Post by: TheBoy on 16 October 2017, 14:18:50: Blimey, WTF is happening lately >:(

One of the hypervisors has sorta crashed, following a minor health alert this morning. What should happen is if the hardware health is suspect, all the load moves over to another, healthy hypervisor. In this case its just shat it pants. The VMs are still running, but unmanageable.

So, no choice but to power cycle the underlying hardware, which will ungracefully power cycle the VMs running on it.

From an OOF prespective, this includes the primary webserver, the primary database, and the secondary database (which affinity rules say should never be on same hypervisor as primary, so something else has gone wrong there).

I shall do this when I finish work today.
Title: Re: OOF Outage
Post by: Entwood on 16 October 2017, 14:26:19: Quote from: TheBoy on 16 October 2017, 14:18:50
Blimey, WTF is happening lately >:(

One of the hypervisors has sorta crashed, following a minor health alert this morning. What should happen is if the hardware health is suspect, all the load moves over to another, healthy hypervisor. In this case its just shat it pants. The VMs are still running, but unmanageable.

So, no choice but to power cycle the underlying hardware, which will ungracefully power cycle the VMs running on it.

From an OOF prespective, this includes the primary webserver, the primary database, and the secondary database (which affinity rules say should never be on same hypervisor as primary, so something else has gone wrong there).

I shall do this when I finish work today.

Good luck, and thank you in advance :y :y
Title: Re: OOF Outage
Post by: TheBoy on 16 October 2017, 17:43:29: Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)

No outage needed. Happy days. Sometimes simple victories give me a real smile :)
Title: Re: OOF Outage
Post by: Lizzie Zoom on 16 October 2017, 17:45:33: Quote from: TheBoy on 16 October 2017, 17:43:29
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)

No outage needed. Happy days. Sometimes simple victories give me a real smile :)

Does that mean TB your cull is postponed?? :-\ :-\ ;D ;D ;)
Title: Re: OOF Outage
Post by: TD on 16 October 2017, 17:49:46: Quote from: TheBoy on 16 October 2017, 17:43:29
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)

No outage needed. Happy days. Sometimes simple victories give me a real smile :)

You need to upgrade from win95 ;) ;D ;D

Well done TB :y
Title: Re: OOF Outage
Post by: TheBoy on 16 October 2017, 18:03:03: Quote from: Lizzie Zoom on 16 October 2017, 17:45:33
Quote from: TheBoy on 16 October 2017, 17:43:29
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)

No outage needed. Happy days. Sometimes simple victories give me a real smile :)

Does that mean TB your cull is postponed?? :-\ :-\ ;D ;D ;)
No. I'll just be happier executing it.
Title: Re: OOF Outage
Post by: TheBoy on 16 October 2017, 18:05:30: Quote from: TD on 16 October 2017, 17:49:46
Quote from: TheBoy on 16 October 2017, 17:43:29
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)

No outage needed. Happy days. Sometimes simple victories give me a real smile :)

You need to upgrade from win95 ;) ;D ;D

Well done TB :y
I suspect part of the issue is quite the opposite, going too cutting edge. Already been bitten this month by applying a patch that kills the hypervisor if running Intel based 10GE cards... ...fortunately in that case the hypervisor had no VMs do to the planned patching :).
Title: Re: OOF Outage
Post by: Lizzie Zoom on 16 October 2017, 18:23:36: Quote from: TheBoy on 16 October 2017, 18:03:03
Quote from: Lizzie Zoom on 16 October 2017, 17:45:33
Quote from: TheBoy on 16 October 2017, 17:43:29
Panic over - managed to finally get sufficient control of the bastid hypervisor to allow all the VMs to be move gracefully, and then rebooted the hypervisor :)

No outage needed. Happy days. Sometimes simple victories give me a real smile :)

Does that mean TB your cull is postponed?? :-\ :-\ ;D ;D ;)
No. I'll just be happier executing it.

Ooooooo! You are so heartless TB! :o :o :o

;D ;D ;D ;D ;D ;D ;)