In our previous blog, we discussed how started using network validation before launching. By implementing network validation for our core network, we maintained complete control of the network’s performance at scale.
Among other things, the publication summarizes the use of to validate key aspects of the network. This time, we will go into more detail about how uses Suzieq to perform network validation and a more detailed description of during the evaluation.
To give you some numbers, we have 9 data centers (DCs) around the world with more coming soon. Every data center is different in terms of size – it can range from a couple of racks to tens of racks per data center. Using automation on top of that doesn’t make any noticeable difference, despite the speed with which changes are introduced to production. For the end customer, using the services provided by a company that continuously contributes and performs network validation adds to building the foundation of trust and reliability of products.
suzieq
Continuously Running Probe Vs. Snapshot
One of the first decisions we had to make with any tool we use to perform network validation is whether to run the probe in standalone mode or in continuous mode.
A continuously running probe has a higher engineering cost, regardless of the tool, but it is the right approach. In this approach, the poller must be running all the time and must be highly available, i.e. the poller must recover from failures…
Running the probe in “Snapshot” mode is trivial from a maintainability perspective. It can be run independently in any environment: on a local machine (workstation) or on CI/CD without the need to have any running services in mind. In our case, we poll the data once and then run the Python tests. At , we have deployments spread across many geographic regions: Asia, Europe, the US, and we have several developing countries in each of these regions. We use Jenkins for our CI/CD pipeline. To make sure we run the same tests in all regions, we launch multiple Jenkins slaves. If we had used a continuously running probe, the engineering cost would have been higher to set up and maintain.
An example of running sq-poller (which runs in a loop for each DC or region).
for DC in “${DATACENTERS}” do python generate_hosts_for_suzieq.py –datacenter “$DC” ../bin/sq-poller –devices-file “hosts-$DC.yml” \ –ignore-known-hosts \ –run-once gather \ –exclude-services devconfig ../bin/sq-poller –input-dir ./sqpoller-output python -m pytest -s -v –no-header “test_$DC. py” || exit 5 done
You may be wondering if this combination of commands is necessary.
generate_hosts_for_suzieq.py serves as a container for generating hosts from the Ansible inventory, but with more sugar inside, like omitting specific hosts, configuring ansible_host dynamically (because our OOB network is highly available, it means we have multiple ports to access it).
The generated file is similar to:
– namespace: xml hosts: – url: ssh://root@xml-oob.example.org:2232 keyfile=~/.ssh/id_rsa – url: ssh://root@xml-oob.example.org:2223 keyfile=~/.ssh/id_rsa
Why bundle run-once and sq-poller? There is already an open that will solve this problem. Eventually, you only need to add a single option snapshotand that’s it.
Workflow to validate changes
Each new pull request (PR) creates a new, clean Python virtual environment (Pyenv) and starts testing. The same thing happens when PR is merged.
The simplified workflow was:
- Make changes.
- Commit changes, create public relations on Github.
- Probe and run Pytest tests with Suzieq ( /tests/run-tests.sh
). - We require tests to be green before merging is allowed.
- Merge PR.
- Repeat all our domain controllers one by one: Deploy and re-run post-deploy Pytests.
Something like:
stage(‘Run pre-flight production tests’) { when { expression { env.BRANCH_NAME != ‘master’ && !(env.DEPLOY_INFO ==~ /skip-suzieq/) } } parallel { stage(‘EU’) { steps { sh ‘./tests/prepare-tests-env.sh && ./tests/run-tests.sh ${EU_DC}’ } } stage(‘Asia’) { agent { label ‘deploy-sg’ } }
Handling false positives
Every test has the possibility of a false positive, that is, the test reveals a problem that is not real. This is true whether it is a test to detect a disease or a test to check for a change. At , we assume that false positives will occur, and that’s normal. So how do we handle them and when?
In our environment, false positives mostly occur due to timeouts, connection errors during the scrape (probe) phase, or when booting a new device. In such a case, we re-run the tests until it is fixed (green in the Jenkins pipeline). But if we have a permanent failure (probably a real one), the tests always stay red. This means that the RP is not merged and the changes are not implemented.
However, in the case of a false positive, we use a Git commit tag Deploy-Info: skip-suzieq to tell Jenkins pipelines to ignore tests.
Add new tests
We first test new or changed tests locally before they hit the Git repository. To add a useful test, it should be tested multiple times unless it’s really trivial. For example:
def bgp_sessions_are_up(self): # Test if all BGP sessions are UP assert ( get_sqobject(“bgp”)().get(namespace=self.namespace, state=”NotEstd”).empty )
But if we are talking about something like:
def uniq_asn_per_fabric(self): # Test if we have a unique ASN per fabric asns = {} for spine in self.spines.keys(): for asn in ( get_sqobject(“bgp”)() .get(hostname=, query_str =”afi == ‘ipv4’ and safi == ‘unicast'”) .peerAsn ): if asn == 65030: continue if asn not in asns: asns = 1 else: asns += 1 assert len(asns) > 0 for asn in asns: assert asns == len(self.spines.keys())
This should be carefully reviewed. Here we check if we have a unique AS number per DC. The skip of 65030 is used to route the instances of host to advertise some anycast services like DNS, load balancers, etc. This is the test output snippet (summary):
test_phx.py::test_bgp_sessions_are_up PASSED test_phx.py::test_loopback_ipv4_is_uniq_per_device PASSED test_phx.py::test_loopback_ipv6_is_uniq_per_device PASSED test_phx.py::test_uniq_asn_per_fabric PASSED test_phx.py::test_upstream_ports_are_in_correct_state PASSED test_phx.py::test_evpn_fabric_links PASSED test_phx.py::test_default_route_ipv4_from_upstreams PASSED test_phx .py::test_ipv4_host_routes_received_from_hosts PASSED test_phx.py::test_ipv6_host_routes_received_from_hosts PASSED test_phx.py::test_evpn_fabric_bgp_sessions PASSED test_phx.py::test_vlan100_assigned_interfaces PASSED test_phx.py::test_evpn_fabric_arp PASSED test_phx.py::test_no_failed_interface PASSED test_phx.py::test_no_failed_bgp PASSED test_phx. py::test_no_active_critical_alerts_firing PASSED test_imm.py::test_bgp_sessions_are_up PASSED test_imm.py::test_loopback_ipv4_is_uniq_per_device PASSED test_imm.py::test_loopback_ipv6_is_uniq_per_device PASSED test_imm.py::testAI_LED:test_imm_per_fabric st_upstream_ports_are_in_correct_state PASSED test_imm.py::test_default_route_ipv4_from_upstreams PASSED test_imm.py::test_ipv4_host_routes_received_from_hosts PASSED test_imm.py::test_ipv6_host_routes_received_from_hosts PASSED test_imm.py::test_no_failed_bgp PASSED test_imm.py::test_no_active_critical_alerts_firing PASSED
Here, we realize that this test DC test_imm.py::test_uniq_asn_per_fabric failure. Since we use auto-derived ASNs per switch (no static AS numbers in Ansible inventory), a race could occur that could have duplicate ASNs, which is bad.
Or something like:
def loopback_ipv6_is_uniq_per_device(self): # Test if we don’t have duplicate IPv6 loopback address addresses = get_sqobject(“address”)().unique( namespace=, columns=, count=True, type=”loopback”, ) addresses = addresses assert (addresses.numRows == 1).all()
To check if we don’t have a duplicate IPv6 loopback address per device for the same data center. This rule is valid and has been tested at least a couple of times. Most of the time it happens when we start a new switch and the Ansible host file gets copied/pasted.
Mainly, new tests are added when a failure occurs, and some actions must be taken to detect it quickly or mitigate it early in the future. For example, if we switch from L3-only design to EVPN design, we might be surprised when ARP/ND exhaustion hits the wall, or L3 routes drop from several thousand to just a few.
bat fish
We have already evaluated twice. The first was kind of an overview and essay to see your opportunities for us. The first impression was something like “What about my configuration?Because, at the time, Batfish didn’t support any configuration syntax for . Cumulus Linux and many other massive projects use . It becomes the de facto best open source routing suite. And that’s why Batfish also lists FRR as a provider. Only FRR as a model needs more changes before being used in production (at least in our environment).
Later, a month or two ago, we started researching the product again to see what could actually be done. From an operational perspective, it’s a really cool product because it allows the operator to build the network model by analyzing the configuration files. Besides that, you can create Snapshots, make some changes and see how your network behaves. For example, it shuts down a BGP link or peer and predicts changes before they are published.
We also started considering Batfish as an open source project to push changes to the community. A couple of examples of missing behavioral models for our cases:
But many more are missing. We’re big fans of IPv6, but unfortunately, IPv6 isn’t well (yet?) covered in the FRR model in Batfish.
This isn’t the first time we’ve missed out on IPv6 support, and we’re guessing it’s not the last. Waiting and hoping that Batfish will get IPv6 support soon.
