Incidents | Boomerang Incidents reported on status page for Boomerang https://status.bmrg.app/ https://d1lppblt9t2x15.cloudfront.net/logos/18838c8e6d6cf6c504631dbc82ba1629.png Incidents | Boomerang https://status.bmrg.app/ en Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 13:02:00 -0000 https://status.bmrg.app/incident/776194#312f6d0c0cbf85fa858751e72f2f296f4ee2944de72ffb27754140a5a265d676 ## Summary On 29 November 2025 at approximately 1:00 pm AEDT, our monitoring and synthetic health checks detected a complete outage affecting the Boomerang dashboard and app. All customers were impacted: users could not access the website, change settings, or use Boomerang within their communities. We currently serve over 1,500 communities and more than 400,000 users, with over 30,000 having actively interacted with Boomerang, so this incident had broad reach. During the outage, we saw a spike in support demand and expect there was financial impact, although there was no data loss, data corruption, or duplication. Services were restored around 12:00 am AEDT on 30 November, resulting in approximately eleven hours of downtime. ## Timeline and Response Our engineers engaged as soon as monitoring alerted, and standard remediation steps were attempted, including restarting the affected virtual private server (VPS). During these efforts we discovered that we could neither SSH into the machine nor make changes via the provider’s web panel, as both had been locked by our upstream cloud host. All Boomerang services were running on this machine, so this lockout effectively removed our direct operational control. Our upstream provider subsequently advised that a scheduled, provider-managed server-level backup, configured to run daily during off-peak hours, had encountered a fault. Rather than completing or cleanly failing, the backup entered a delayed state that locked the machine and associated control functions. This backup process is operated by the provider and was not triggered by any deployment, configuration change, or operational action on our side. At around midnight the provider cleared the fault and the machine returned to a healthy state; they did not proactively notify us of the final fix, and we are still awaiting their detailed root cause analysis. ## Root Cause and Contributing Factors While the immediate technical cause sat with our upstream provider, this incident exposed several clear weaknesses in our own architecture and operations. Boomerang was effectively single-homed on one high-performance machine with no practical failover to an alternative provider or region. Environment variables, configuration, and firewall rules were tightly coupled to that machine and its IP address, and we did not have a ready, equivalent secondary host standing by. In theory we could have attempted an emergency migration, but in practice this would have meant re-provisioning infrastructure, re-creating configuration and secrets, and reworking firewall dependencies under time pressure, with a high risk of extended downtime or secondary failures. In that context, we made the call that waiting for the provider to resolve the backup fault was the least risky option, even though it prolonged customer impact. Operationally, we also saw that SSH access was not broadly and appropriately distributed among engineers, and we did not have a formal incident command structure in place; the response relied on ad-hoc coordination rather than a defined playbook. From a communication standpoint, we used Discord announcements and our public status page, but updates lacked an ETA and provided limited technical detail, which we recognise was frustrating for affected communities. ## Customer Impact All customers were affected during the incident: * Dashboard and website were unavailable. * Discord App commands could not be executed. * Customers were unable to update settings or manage Boomerang for their communities. * There was a spike in support contact, and likely financial impact, although we did not observe any data loss or corruption. ## Corrective and Preventative Actions Since the incident, we have begun strengthening our resilience and recovery posture. Key actions include: * Increasing both the frequency and retention of customer database backups, independent of provider-managed machine snapshots, with a target recovery point of approximately fifteen minutes to enable faster migration to a different provider if required. * Designing a fallback architecture that reduces dependence on a single hosting provider, including the ability to stand up services on an alternative platform more rapidly and reducing IP-specific coupling in firewall and network configuration. * Formalising an incident response protocol, including clearer roles and improved access management so the right engineers have the permissions they need during an outage. * Reviewing and improving our customer communication approach for major incidents so that, even when a precise ETA is not available, we can be more transparent about what has failed, what we know, and which options we are actively considering. ## Conclusion This outage left all Boomerang communities without service for an unacceptable length of time, and our dependency on a single provider without adequate redundancy was a key contributing factor. While the immediate trigger was a provider-side backup failure, it is our responsibility to design and operate Boomerang so that a single external fault does not take the entire platform offline. The corrective and preventative actions above are intended to materially reduce the likelihood and impact of similar incidents in future, and we will continue to refine our approach once we receive the final, detailed report from the provider. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 13:02:00 -0000 https://status.bmrg.app/incident/776194#312f6d0c0cbf85fa858751e72f2f296f4ee2944de72ffb27754140a5a265d676 ## Summary On 29 November 2025 at approximately 1:00 pm AEDT, our monitoring and synthetic health checks detected a complete outage affecting the Boomerang dashboard and app. All customers were impacted: users could not access the website, change settings, or use Boomerang within their communities. We currently serve over 1,500 communities and more than 400,000 users, with over 30,000 having actively interacted with Boomerang, so this incident had broad reach. During the outage, we saw a spike in support demand and expect there was financial impact, although there was no data loss, data corruption, or duplication. Services were restored around 12:00 am AEDT on 30 November, resulting in approximately eleven hours of downtime. ## Timeline and Response Our engineers engaged as soon as monitoring alerted, and standard remediation steps were attempted, including restarting the affected virtual private server (VPS). During these efforts we discovered that we could neither SSH into the machine nor make changes via the provider’s web panel, as both had been locked by our upstream cloud host. All Boomerang services were running on this machine, so this lockout effectively removed our direct operational control. Our upstream provider subsequently advised that a scheduled, provider-managed server-level backup, configured to run daily during off-peak hours, had encountered a fault. Rather than completing or cleanly failing, the backup entered a delayed state that locked the machine and associated control functions. This backup process is operated by the provider and was not triggered by any deployment, configuration change, or operational action on our side. At around midnight the provider cleared the fault and the machine returned to a healthy state; they did not proactively notify us of the final fix, and we are still awaiting their detailed root cause analysis. ## Root Cause and Contributing Factors While the immediate technical cause sat with our upstream provider, this incident exposed several clear weaknesses in our own architecture and operations. Boomerang was effectively single-homed on one high-performance machine with no practical failover to an alternative provider or region. Environment variables, configuration, and firewall rules were tightly coupled to that machine and its IP address, and we did not have a ready, equivalent secondary host standing by. In theory we could have attempted an emergency migration, but in practice this would have meant re-provisioning infrastructure, re-creating configuration and secrets, and reworking firewall dependencies under time pressure, with a high risk of extended downtime or secondary failures. In that context, we made the call that waiting for the provider to resolve the backup fault was the least risky option, even though it prolonged customer impact. Operationally, we also saw that SSH access was not broadly and appropriately distributed among engineers, and we did not have a formal incident command structure in place; the response relied on ad-hoc coordination rather than a defined playbook. From a communication standpoint, we used Discord announcements and our public status page, but updates lacked an ETA and provided limited technical detail, which we recognise was frustrating for affected communities. ## Customer Impact All customers were affected during the incident: * Dashboard and website were unavailable. * Discord App commands could not be executed. * Customers were unable to update settings or manage Boomerang for their communities. * There was a spike in support contact, and likely financial impact, although we did not observe any data loss or corruption. ## Corrective and Preventative Actions Since the incident, we have begun strengthening our resilience and recovery posture. Key actions include: * Increasing both the frequency and retention of customer database backups, independent of provider-managed machine snapshots, with a target recovery point of approximately fifteen minutes to enable faster migration to a different provider if required. * Designing a fallback architecture that reduces dependence on a single hosting provider, including the ability to stand up services on an alternative platform more rapidly and reducing IP-specific coupling in firewall and network configuration. * Formalising an incident response protocol, including clearer roles and improved access management so the right engineers have the permissions they need during an outage. * Reviewing and improving our customer communication approach for major incidents so that, even when a precise ETA is not available, we can be more transparent about what has failed, what we know, and which options we are actively considering. ## Conclusion This outage left all Boomerang communities without service for an unacceptable length of time, and our dependency on a single provider without adequate redundancy was a key contributing factor. While the immediate trigger was a provider-side backup failure, it is our responsibility to design and operate Boomerang so that a single external fault does not take the entire platform offline. The corrective and preventative actions above are intended to materially reduce the likelihood and impact of similar incidents in future, and we will continue to refine our approach once we receive the final, detailed report from the provider. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 13:02:00 -0000 https://status.bmrg.app/incident/776194#312f6d0c0cbf85fa858751e72f2f296f4ee2944de72ffb27754140a5a265d676 ## Summary On 29 November 2025 at approximately 1:00 pm AEDT, our monitoring and synthetic health checks detected a complete outage affecting the Boomerang dashboard and app. All customers were impacted: users could not access the website, change settings, or use Boomerang within their communities. We currently serve over 1,500 communities and more than 400,000 users, with over 30,000 having actively interacted with Boomerang, so this incident had broad reach. During the outage, we saw a spike in support demand and expect there was financial impact, although there was no data loss, data corruption, or duplication. Services were restored around 12:00 am AEDT on 30 November, resulting in approximately eleven hours of downtime. ## Timeline and Response Our engineers engaged as soon as monitoring alerted, and standard remediation steps were attempted, including restarting the affected virtual private server (VPS). During these efforts we discovered that we could neither SSH into the machine nor make changes via the provider’s web panel, as both had been locked by our upstream cloud host. All Boomerang services were running on this machine, so this lockout effectively removed our direct operational control. Our upstream provider subsequently advised that a scheduled, provider-managed server-level backup, configured to run daily during off-peak hours, had encountered a fault. Rather than completing or cleanly failing, the backup entered a delayed state that locked the machine and associated control functions. This backup process is operated by the provider and was not triggered by any deployment, configuration change, or operational action on our side. At around midnight the provider cleared the fault and the machine returned to a healthy state; they did not proactively notify us of the final fix, and we are still awaiting their detailed root cause analysis. ## Root Cause and Contributing Factors While the immediate technical cause sat with our upstream provider, this incident exposed several clear weaknesses in our own architecture and operations. Boomerang was effectively single-homed on one high-performance machine with no practical failover to an alternative provider or region. Environment variables, configuration, and firewall rules were tightly coupled to that machine and its IP address, and we did not have a ready, equivalent secondary host standing by. In theory we could have attempted an emergency migration, but in practice this would have meant re-provisioning infrastructure, re-creating configuration and secrets, and reworking firewall dependencies under time pressure, with a high risk of extended downtime or secondary failures. In that context, we made the call that waiting for the provider to resolve the backup fault was the least risky option, even though it prolonged customer impact. Operationally, we also saw that SSH access was not broadly and appropriately distributed among engineers, and we did not have a formal incident command structure in place; the response relied on ad-hoc coordination rather than a defined playbook. From a communication standpoint, we used Discord announcements and our public status page, but updates lacked an ETA and provided limited technical detail, which we recognise was frustrating for affected communities. ## Customer Impact All customers were affected during the incident: * Dashboard and website were unavailable. * Discord App commands could not be executed. * Customers were unable to update settings or manage Boomerang for their communities. * There was a spike in support contact, and likely financial impact, although we did not observe any data loss or corruption. ## Corrective and Preventative Actions Since the incident, we have begun strengthening our resilience and recovery posture. Key actions include: * Increasing both the frequency and retention of customer database backups, independent of provider-managed machine snapshots, with a target recovery point of approximately fifteen minutes to enable faster migration to a different provider if required. * Designing a fallback architecture that reduces dependence on a single hosting provider, including the ability to stand up services on an alternative platform more rapidly and reducing IP-specific coupling in firewall and network configuration. * Formalising an incident response protocol, including clearer roles and improved access management so the right engineers have the permissions they need during an outage. * Reviewing and improving our customer communication approach for major incidents so that, even when a precise ETA is not available, we can be more transparent about what has failed, what we know, and which options we are actively considering. ## Conclusion This outage left all Boomerang communities without service for an unacceptable length of time, and our dependency on a single provider without adequate redundancy was a key contributing factor. While the immediate trigger was a provider-side backup failure, it is our responsibility to design and operate Boomerang so that a single external fault does not take the entire platform offline. The corrective and preventative actions above are intended to materially reduce the likelihood and impact of similar incidents in future, and we will continue to refine our approach once we receive the final, detailed report from the provider. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 13:00:00 -0000 https://status.bmrg.app/incident/776194#3847d99a291d85a7bb799c7f7156da9b001c66232e6d58c380275cf1c2fc2063 The issue has been fully resolved and all services are operating normally. A post-mortem outlining the cause and planned preventative measures will be published within the next four business days. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 13:00:00 -0000 https://status.bmrg.app/incident/776194#3847d99a291d85a7bb799c7f7156da9b001c66232e6d58c380275cf1c2fc2063 The issue has been fully resolved and all services are operating normally. A post-mortem outlining the cause and planned preventative measures will be published within the next four business days. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 13:00:00 -0000 https://status.bmrg.app/incident/776194#3847d99a291d85a7bb799c7f7156da9b001c66232e6d58c380275cf1c2fc2063 The issue has been fully resolved and all services are operating normally. A post-mortem outlining the cause and planned preventative measures will be published within the next four business days. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 05:00:00 -0000 https://status.bmrg.app/incident/776194#e40b8e7048bdcfb689ba0ba7dee4be72938e3c522032dc08206d78eb3b448fdd The incident is being treated as a priority by the upstream provider. No further ETA has been provided at this stage. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 05:00:00 -0000 https://status.bmrg.app/incident/776194#e40b8e7048bdcfb689ba0ba7dee4be72938e3c522032dc08206d78eb3b448fdd The incident is being treated as a priority by the upstream provider. No further ETA has been provided at this stage. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 05:00:00 -0000 https://status.bmrg.app/incident/776194#e40b8e7048bdcfb689ba0ba7dee4be72938e3c522032dc08206d78eb3b448fdd The incident is being treated as a priority by the upstream provider. No further ETA has been provided at this stage. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 03:00:00 -0000 https://status.bmrg.app/incident/776194#18c1a8711607dd6ccb8d16f07480913fb8d726c70da46d29440fe821c5738e8e Our upstream provider has escalated the issue to its engineering team, advising that the behaviour is abnormal and requires deeper analysis. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 03:00:00 -0000 https://status.bmrg.app/incident/776194#18c1a8711607dd6ccb8d16f07480913fb8d726c70da46d29440fe821c5738e8e Our upstream provider has escalated the issue to its engineering team, advising that the behaviour is abnormal and requires deeper analysis. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 03:00:00 -0000 https://status.bmrg.app/incident/776194#18c1a8711607dd6ccb8d16f07480913fb8d726c70da46d29440fe821c5738e8e Our upstream provider has escalated the issue to its engineering team, advising that the behaviour is abnormal and requires deeper analysis. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 02:30:00 -0000 https://status.bmrg.app/incident/776194#25c9e294dd3b37c5648555958f3c9fdf08dd69c0c3a82b612a4a8f6b6fdbc530 The fault has been isolated and referred to our upstream provider’s support team for remediation. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 02:30:00 -0000 https://status.bmrg.app/incident/776194#25c9e294dd3b37c5648555958f3c9fdf08dd69c0c3a82b612a4a8f6b6fdbc530 The fault has been isolated and referred to our upstream provider’s support team for remediation. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 02:30:00 -0000 https://status.bmrg.app/incident/776194#25c9e294dd3b37c5648555958f3c9fdf08dd69c0c3a82b612a4a8f6b6fdbc530 The fault has been isolated and referred to our upstream provider’s support team for remediation. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 02:00:00 -0000 https://status.bmrg.app/incident/776194#b5f41aa621b63c132b72357d76ad10d54c3a15926f7bf38f1a3c846fbb636337 An outage affecting the dashboard and Discord App has been detected. Initial investigation is under way. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 02:00:00 -0000 https://status.bmrg.app/incident/776194#b5f41aa621b63c132b72357d76ad10d54c3a15926f7bf38f1a3c846fbb636337 An outage affecting the dashboard and Discord App has been detected. Initial investigation is under way. Service Disruption https://status.bmrg.app/incident/776194 Sat, 29 Nov 2025 02:00:00 -0000 https://status.bmrg.app/incident/776194#b5f41aa621b63c132b72357d76ad10d54c3a15926f7bf38f1a3c846fbb636337 An outage affecting the dashboard and Discord App has been detected. Initial investigation is under way. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 19:28:00 -0000 https://status.bmrg.app/incident/768847#fe4b46278a7badf96a921d1ed1e597ab418997bb567e839005644969dea823de Our upstream provider, Cloudflare, has resolved the incident. All previously affected services have been restored to normal operation. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 19:28:00 -0000 https://status.bmrg.app/incident/768847#fe4b46278a7badf96a921d1ed1e597ab418997bb567e839005644969dea823de Our upstream provider, Cloudflare, has resolved the incident. All previously affected services have been restored to normal operation. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 19:28:00 -0000 https://status.bmrg.app/incident/768847#fe4b46278a7badf96a921d1ed1e597ab418997bb567e839005644969dea823de Our upstream provider, Cloudflare, has resolved the incident. All previously affected services have been restored to normal operation. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 11:48:00 -0000 https://status.bmrg.app/incident/768847#6752906e5a15e39d3bd1812ca0b300eb76abaa25681ba72ba03bd996218b2db3 You can track the upstream incident here: https://www.cloudflarestatus.com/incidents/8gmgl950y3h7. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 11:48:00 -0000 https://status.bmrg.app/incident/768847#6752906e5a15e39d3bd1812ca0b300eb76abaa25681ba72ba03bd996218b2db3 You can track the upstream incident here: https://www.cloudflarestatus.com/incidents/8gmgl950y3h7. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 11:48:00 -0000 https://status.bmrg.app/incident/768847#6752906e5a15e39d3bd1812ca0b300eb76abaa25681ba72ba03bd996218b2db3 You can track the upstream incident here: https://www.cloudflarestatus.com/incidents/8gmgl950y3h7. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 11:46:00 -0000 https://status.bmrg.app/incident/768847#9494a19048c45d990a4ba8d898359b7626ca0289e8964c8cf8a7f3549a439ff3 Our upstream provider, Cloudflare, is experiencing a widespread outage. The duration of this disruption remains uncertain. Users may encounter intermittent issues accessing the dashboard. Our Discord App is unaffected and will continue to operate normally. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 11:46:00 -0000 https://status.bmrg.app/incident/768847#9494a19048c45d990a4ba8d898359b7626ca0289e8964c8cf8a7f3549a439ff3 Our upstream provider, Cloudflare, is experiencing a widespread outage. The duration of this disruption remains uncertain. Users may encounter intermittent issues accessing the dashboard. Our Discord App is unaffected and will continue to operate normally. Cloudflare Outage https://status.bmrg.app/incident/768847 Tue, 18 Nov 2025 11:46:00 -0000 https://status.bmrg.app/incident/768847#9494a19048c45d990a4ba8d898359b7626ca0289e8964c8cf8a7f3549a439ff3 Our upstream provider, Cloudflare, is experiencing a widespread outage. The duration of this disruption remains uncertain. Users may encounter intermittent issues accessing the dashboard. Our Discord App is unaffected and will continue to operate normally.