Cloudprober explained: how we use it at

Cloudprober is software used to monitor the availability and performance of various system components. Here at , we use it to monitor the loading time of our clients’ websites. Initially, it started as Google’s free open source application, which was started to help customers monitor their projects or infrastructures.

The main task of Cloudprober is to run probes, which are intended to probe protocols like HTTP, Ping, UDP, DNS to verify that systems work as expected from the clients’ point of view. It is even possible to have a specific custom probe (eg Redis or MySQL) via an external probe. focuses on the HTTP probe.

Font:

Probe configuration

Each probe is defined as the combination of these particulars:

  • Guy: For example, HTTP, PING, or UDP.
  • Name: Each probe must have a unique name.
  • Interval_msec: Describes how often the probe runs (in milliseconds).
  • Timeout_msec: probe timeout (in milliseconds)
  • Objectives: Targets to run the probe against.
  • validator: probe validators.
  • _probe: the settings specific to the type of probe.

surfacers

The are built-in mechanisms designed to export data to multiple monitoring systems. Multiple Surfacers can be configured at the same time. Cloudprober is primarily intended to run probes and create standard usable metrics based on the results of those probes. Therefore, it provides an easy-to-use interface that makes sonar data available to systems that offer ways to consume monitoring data.

Cloudprober currently supports the following types of rigs: (Google Cloud Monitoring), , (AWS Cloud Monitoring), , and .

Validators

Cloudprober’s allow you to run checks on the output of the probe request, if any. More than one validator can be configured, but they must all succeed for the probe to be marked as successful.

The Regex validator is the most common and works for most types of probe. When you load the site and expect there to be a string inside, the Regular Expression Validator helps you make it dynamic.

The HTTP validatorwhich is only applicable for one type of HTTP probe, helps to check the header (success/error) and status code (success/error).

Lastly, the data integrity validator is mainly used for UDP or PINGS when we expect data in some repetitive pattern (for example, 1,2,3,1,2,3,1,2,3 in the payload).

Target Discovery

As it is a cloud-based software, Cloudprober has support for the automatic target discovery. It is considered one of the most critical features in today’s dynamic environments, as with it, Cloudprober can touch data from Kubernetes, Google Cloud Engine, AWS EC2, file discovery, and more. If that’s not enough, it also has an internal discovery service, so you can integrate other discoveries into your infrastructure.

The core idea behind Clouprober’s target discovery is to use an independent source to clarify the targets that are supposed to be monitored. You can find more information about Cloudprober target discovery highlights.

Reasons why chooses Cloudprober

In October 2020, was looking for an external monitoring system to collect uptime and speed statistics for all user websites. Consul (Blackbox Consul website) was considered as one of the main alternatives to monitor sites. However, Cloudprober seemed like a promising lightweight option that integrated with Stackdriver, allowing it to store logs easily, had no performance restrictions, and was accessible by the data team with no additional requirements.

Numerous factors have been distinguished that explain why we have chosen Cloudprober as the preferred alternative:

  • Headless and light. Most of the alternatives we’ve looked at had a complete solution to the custom problem they’re trying to solve: web interface, user management, custom charts, forced backup/database solution, etc. Cloudprober only does one thing: it launches and measures probes. The workflow is designed to be simple and lightweight to keep resource usage low. The implementation is just a statically linked binary (thanks to Golang).
  • Composable. Advantageous built-in tools are included in this monitoring software, however additional settings can be adjusted to do more.
  • Extensible. The extensible nature of Cloudprober allows users to add features to the tool as needed to better suit their individual needs. In addition, extensive support documentation and a user community are available.
  • Alive and maintainable. Before committing to a technology, it’s wise to determine if its Github projects are still active. Another factor is determining how community-oriented you are: PR issue and count, outside contributors, and general activity. Cloudprober passed all of these.
  • Supports all modern ecosystems. Cloudprober, as its name suggests, was designed for cloud-native applications from day one. It can be run as a container (k8s), is compatible with most public cloud providers for target and metadata discovery, and is easily integrable with modern tools such as Prometheus and Grafana. IPv6 is also not a problem for Coudprober.

Test to check if it works for

Cloudprober testing was an ongoing process at . To decide if Cloudprober fits our needs, we check the fidelity of the metric and possible installation/configuration scenarios for our scale.

We tried to change the Cloudprober code to add some basic concurrency control. Different patterns were tried to maintain a moderate load during the latency measurement: a concurrency of 5+5 (HTTP + HTTPS). On heavily loaded servers, it took about 30 minutes to crawl about 3,900 HTTPS sites and about 70 minutes to crawl about 7,100 HTTP sites.

The main challenge we recognized was probe extension: Cloudprober waits for a configured verification interval and starts all probes at the same time. We didn’t see this as a big issue for Cloudprober itself, as Consul, Prometheus and Blackbox Exporter share the same functionality, but this can have an impact on the entire hosting server.

Cloudprober was later released on about 1.8 million sites, and we found that a GCP instance with 8 cores and 32GiB or RAM can handle it just fine (60% CPU idle).

How we apply Cloudprober at

Here at , HTTP metrics are sent to PostgreSQL (technically CloudSQL on GCP). Metric filtering is used and internal Cloudprober metrics are exported to the Prometheus rig. To check if the sites are actually hosted with us, we send a specific header to each site and wait for another header response.

Metric output (Shallow)

Initially we thought we would use a Prometheus rig. However, the entire metric collected was around 1 GB in size. This was too much for our Prometheus + M3DB system. While it’s possible to get it working, it’s not worth it. Therefore, we decided to go ahead with PostgreSQL. We also evaluated Stackdriver, but PostgreSQL was a better fit for our tools and purposes.

By default, the Cloudprober PostgreSQL rigger expects this type of table:

CREATE TABLE metrics ( time TIMESTAMP WITH TIME ZONE, metric_name text NOT NULL, value DOUBLE PRECISION, labels jsonb, PRIMARY KEY (time, metric_name, labels) );

There are some drawbacks with this type of storage:

  1. All tags are placed on the type jsonb.
  2. The type jsonb It is not compatible with indexes nor is it easy to query.
  3. More data is stored than we need.
  4. All data is placed in one big table which is not easy to maintain.
  5. All data stored as strings which take up a lot of storage space.

At first, we smash all the inserts into a table. PostgreSQL (and many others) introduces a powerful technique: . Another notable technique is called and allows you to store “string-like” data in a compact way (4 bytes per element). By combining these two with , we solved all the drawbacks mentioned above.

We create two custom data types:

CREATE TYPE http_scheme AS ENUM ( ‘http’, ‘https’ ); CREATE TYPE metric_names AS ENUM ( ‘success’, ‘timeouts’, ‘latency’, ‘resp-code’, ‘total’, ‘validation_failure’, ‘external_ip’, ‘goroutines’, ‘hostname’, ‘uptime_msec’, ‘cpu_usage_msec ‘, ‘instance’, ‘instance_id’, ‘gc_time_msec’, ‘mem_stats_sys_bytes’, ‘instance_template’, ‘mallocs’, ‘frees’, ‘internal_ip’, ‘nic_0_ip’, ‘project’, ‘project_id’, ‘region’, ‘start_timestamp’, ‘version’, ‘machine_type’, ‘zone’ );

We create the insert data function for the trigger:

CREATE OR REPLACE FUNCTION insert_fnc() RETURNS trigger AS $$ BEGIN IF new.labels->>’dst’ IS NULL THEN RETURN NULL; END IF; new.scheme = new.labels->>’scheme’; new.vhost = rtrim(new.labels->>’dst’, ‘.’); new.server = new.labels->>’server’; IF new.labels ? ‘code’ THEN new.code = new.labels->>’code’; END IF; new.labels = NULL; RETURNnew; END; $$ LANGUAGE ‘plpgsql’;

And the main table:

CREATE TABLE metrics ( time TIMESTAMP WITH TIME ZONE, metric_name metric_names NOT NULL, scheme http_scheme NOT NULL, vhost text NOT NULL, server text NOT NULL, value DOUBLE PRECISION, labels jsonb, code smallint ) PARTITION BY RANGE (time);

For partitioning, we can use the following script (creates partitions for the next 28 days and attaches the trigger):

DO $$ DECLARE f record; i interval := ‘1 day’; BEGIN FOR f IN SELECT t as int_start, t+i as int_end, to_char(t, ‘”y”YYYY”m”MM”d”DD’) as table_name FROM generate_series (date_trunc(‘day’, now() – interval ‘0 days’), now() + interval ’28 days’ , i) t LOOP RAISE notice ‘table: % (from % to % )’, f.table_name, f.int_start, f.int_end, i; EXECUTE ‘CREATE TABLE IF NOT EXISTS m_’ || f.table_name || ‘ PARTITION OF m FOR VALUES FROM (”’ || f.int_start || ”’) TO (”’ || f.int_end || ”’)’; EXECUTE ‘CREATE TRIGGER m_’ || f.table_name || ‘_ins BEFORE INSERT ON m_’ || f.table_name || ‘FOR EACH ROW EXECUTE FUNCTION insert_fnc()’; END LOOP; END; $$ LANGUAGE ‘plpgsql’;

We are currently in the process of performing host monitoring automatically by taking all the hosts and information from the Consul website and using the Consul template to generate the configuration…

See also  How to Fix DNS_PROBE_FINISHED_NXDOMAIN Error? - Explanatory video
Loading Facebook Comments ...
Loading Disqus Comments ...