What is BGP? The Reason Behind Facebook’s Outage
By Nick Anderson 5 minutes
Yesterday, at around 8:30 AM Pacific Time, social media users noticed that Facebook was acting weird. The platform was inaccessible through the app, and accessing facebook.com returned an error. It wasn’t just Facebook; its sibling platforms Instagram and WhatsApp, also stopped working.
There were many speculations. Some thought that Facebook Inc. suffered from a massive cyberattack that took its systems offline. But as hours went by, things became clearer. It turned out the reason for the outage was on a technical level, something that occurred on Facebook’s end, which disappeared Facebook from the internet for some time. Meet BGP: the reason behind Facebook’s outage.
Unfamiliar with the term? Allow us to explain.
What is BGP?
BGP stands for Bridge Routing Protocol, and it’s the backbone that makes the internet as we know it work.
The internet is literally a network of computers sprawling across continents, interconnected through various standardized protocols. It is made of networks that communicate with other networks to form a web. If you wanted to communicate with a computer on Network C, it would traverse Network B, which is connected to that network. It’s a harmonious implementation that exists to make the internet work.
One of the things that serve as a key pillar of the internet is BGP. It exists to let routers that connect various networks, known as Autonomous Systems (AS), exchange routing information. These routers keep an updated list of the routing information to provide data an optimal path to reach its destination. One network can let the other network know of its presence and allow it to connect.
Every autonomous system has an Autonomous System Number (ASN) that it needs to share with other networks and its routing policy that contains the list of IP addresses it controls.
BGP is the protocol that lets one ASN share its routing policy with another ASN. Routers within an ASN refer to their routing tables to determine the best route for data to reach its destination. Consider five ASNs interconnected: A, B, C, D, E. If data from A wants to reach C, the BGP routers will determine that the fastest path is A > B > C, instead of A > B > D > C.
The routing tables are updated when ASNs update their routing policy.
What Happened During the Facebook Outage?
Facebook made some changes internally, and the new configurations they pushed affected their DNS services. This was confirmed by a Reddit user who was allegedly working at Facebook and part of the team of engineers responding to the outage. The post, which was deleted later, said that it is “very likely due to a configuration change that went shortly before the outages happened.”
As an ASN, Facebook stopped advertising itself to the rest of the internet. Any attempt to reach Facebook resulted in a “Cannot find website” error because the DNS resolvers could not reach the IP address to facebook.com. And while Instagram’s and WhatsApp’s routing was intact, the DNS service outage kept them inaccessible. It’s not surprising since Facebook has gradually brought its platforms together over the years.
Cloudflare also confirmed that its DNS resolver 220.127.116.11 was unable to return Facebook’s IP address.
The Facebook engineer on Reddit also explained that fixes needed to be performed physically at the data center. They couldn’t do it remotely via a web interface now that they had no access to external networks. It was also reported that Facebook employees checking into the building could not log in due to the outage. Some employees resorted to using platforms like Outlook and Discord for communication as Facebook’s internal communications platform “Workplace” was down.
In a follow-up post later, Facebook explained that a bug in their audit tool that is built to prevent bad commands from initiating allowed a command to go through. As a result, the connection between Facebook’s DNS servers was severed from its data centers. And because Facebook’s approach prevents any BGP advertisements to the internet if DNS servers cannot speak to the data centers, all of Facebook’s properties were inaccessible. The only fix was to send a team of engineers to push the fix on-site.
Is Facebook Back Online?
Once Facebook’s team of engineers rushed to the data center to try and fix the issue, the platform started coming back online. Approximately six hours from the outage, the social media platforms slowly became accessible, with the overall services taking a bit longer.
As of right now, Facebook, Instagram, and WhatsApp are fully functional and accessible.
The whole outage is a reminder of how Facebook has integrated into the lives of many. Facebook Inc. owns the two biggest social media platforms as well as the top chat platform. Small businesses rely on these platforms to make an everyday sale.
It wasn’t just social media platforms, but Facebook-owned Oculus was also affected. Its VR headsets also stopped working because they could not log in to Facebook’s platform.Facebook apologized for the outage and confirmed that customer data was not compromised during the outage.