Status update: 28 Nov, 3PM PST – All systems are now back at full speed. We’ve added extra measures to reduce the risk of this happening again, and we’ll continue monitoring all systems.

As some of you know, we’ve switched our hosting provider from Rackspace to Google. Soon after we’ve started experiencing occasional issues where people would get logged out of SYNCbits. The initial issue was caused by an automatic upgrade of the SYNCbits systems that caused incompatibility between modules. That on its way resulted in the encryption library not working properly, being unable to decrypt data and wrongly telling MoneyWiz that the password has changed.

We’ve fixed that issue back in August when it first happened. But then it happened again a month later and again today. So we wanted to update you and let you know what we found out…

The way the SYNCbits infrastructure is built is this:

Traffic from MoneyWiz goes to CloudFlare
CloudFlare routes this traffic to a so called Load Balancer at Google
The Load Balancer then sees which server is least busy and sends the request to that server (we’re talking about servers only serving HTTPS requests, not the actual databases).

The Load Balancer has another function though – it monitors all servers and if they get too busy it starts creating new ones so they can handle the extra traffic. Then when they calm down it deletes some servers to get to normal level of usage.

The way the Load balancer creates a new server is through an image of that server. Every time we’ve made changes to the HTTPS servers, we’ve updated the image. It seems though that this was not enough and it’s possibly a bug at Google’s side. Today we found out that even though the image was up to date with the latest changes on our end, every time our servers experienced super high load, the load balancer would create new ones using the first image – the one that caused us an issue back in August. Google did that even though that image is technically deleted on our end to prevent that particular issue. Based on our investigation so far though it seems that Google has cached that first image and every time we’ve updated it, it did not reflect the new image.

So, basically the last 2 outages were during peak traffic, the result of which led to SYNCbits using an old image with deprecated libraries and the whole August issue just repeated itself again and again.

Today we’ve completely rebuilt the load balancing infrastructure and set it so that Google won’t auto size it, instead we’ll be doing it manually when alerted of peak traffic, to prevent this from happening again. Based on everything we’ve found so far, we believe that should resolve these issues once and for all. Of course if they re-appear we’ll have to dig deeper and we’re also considering switching away from Google to AWS because of the recent issues we’ve had with them.

The cherry on top today, which caused extra delays, was that when we rebuilt the entire system we needed to point CloudFlare to the new system. Apparently during that time (and still valid as of writing this) CloudFlare has an issue on their end too, that results in DNS changes not being propagated so basically even though we’ve built a new system and told CloudFlare to send traffic to that new system, they’d still send it to the old one that we’ve deleted.

 

I must apologize for all these inconveniences you must have experienced being logged out. I also want to emphasize that none of this poses a security risk, it’s simply a system misbehavior.

Thank you for your continued support and patience! As of now, SYNCbits is operational but slow as we’ve used the old system’s IP to assign it to one particular server. We’re still waiting for CloudFlare to resolve their issue so we can properly forward all traffic to the new Load Balancer, until then SYNCbits will work, just slower than usual.

 

Status update: 28 Nov, 3PM PST – All systems are now back at full speed. We’ve added extra measures to reduce the risk of this happening again, and we’ll continue monitoring all systems.

9 Comments. Leave new

  • Is my data safe? No other question?

    Reply
  • Kieron Mitchell
    November 29, 2018 12:49 am

    It is rare for a company to take the tine and trouble to give such a detailed description of the problem/s that have caused an outage. Working in the technology sector, all that has been written makes pretty good sense to me and is certainly not a unique occurrence. Well done to all concerned for the investigations and fixes and an additional well done for writing this piece.

    Reply
  • Great RCA. Thank you.

    Reply
  • Good effort MoneyWiz team – your quick response, transparency and technical update is top notch!,

    Reply
  • Just a suggestion. Can you make it in such a way that when there is an issue with SyncBits, our apps will use the local information stored in our devices? It is such a hassle every time we get logged out and can’t log back in to access our data. The current setup means that we cannot use MoneyWiz practically forever, and that when you later on decide to shut down SyncBits (I really hope not), we will lose everything. Make it so that we can truly continue to use this product offline.

    Reply
    • Nelio Possobom
      April 3, 2019 6:28 pm

      Continuo sem acesso, problemas no SYNCbits, perdi tudo ? Que posso fazer para recuperar ? Sou usuário a vários anos

      Reply
  • The Syncbits is crashed again today. It is so unstable and inconvenient.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu
Black Friday is here!
Use code BF2019 at checkout for a 15% discount!