Member replication lag metric can report wrong value when nodes go away #697

nick-jones · 2023-08-11T15:16:11Z

Describe the bug

Specifically the mongodb_mongod_replset_member_replication_lag metric, which is only exposed if compatible mode is enabled with v2.

The wrong lag value is basically due to some part of the lag calculation being zero, resulting in the value being the current unix timestamp (the absolute value of the result is used, so either side of the substraction could be zero).

When the lag is being reported incorrectly, the state label is set to (not reachable/healthy) - so overall this is easy to mitigate. Equally it's possible to just not use this metric considering it's only available via compatible mode - the calculations can be done via mongodb_rs_members_optimeDate instead. But either way, it does seem like a potential bug that the lag is falsely reported as high as it is.

To Reproduce

docker-compose.yaml

version: "3.9"
services:
  mongo0:
    image: mongo:6
    container_name: mongo0
    command: --bind_ip_all --replSet=rs
    ports:
      - "27000:27017"
    expose:
      - "27017"
    volumes:
      - /data/db
  mongo0_exporter:
    image: percona/mongodb_exporter:2.37.0
    container_name: mongo0_exporter
    command:
      - --mongodb.uri=mongodb://mongo0:27017
      - --discovering-mode
      - --collect-all
      - --compatible-mode
      - --web.listen-address=:8080
    ports:
      - "18080:8080"
  mongo1:
    image: mongo:6
    container_name: mongo1
    command: --bind_ip_all --replSet=rs
    expose:
      - "27017"
    volumes:
      - /data/db
  mongo1_exporter:
    image: percona/mongodb_exporter:2.37.0
    container_name: mongo1_exporter
    command:
      - --mongodb.uri=mongodb://mongo1:27017
      - --discovering-mode
      - --collect-all
      - --compatible-mode
      - --web.listen-address=:8080
    ports:
      - "18081:8080"
  mongo2:
    image: mongo:6
    container_name: mongo2
    command: --bind_ip_all --replSet=rs
    expose:
      - "27017"
    volumes:
      - /data/db
  mongo2_exporter:
    image: percona/mongodb_exporter:2.37.0
    container_name: mongo2_exporter
    command:
      - --mongodb.uri=mongodb://mongo2:27017
      - --discovering-mode
      - --collect-all
      - --compatible-mode
      - --web.listen-address=:8080
    ports:
      - "18082:8080"
    expose:
      - "18082"

run.sh

#!/usr/bin/env bash

docker-compose rm --volumes --stop --force mongo0 mongo1 mongo2 mongo0_exporter mongo1_exporter mongo2_exporter

echo "Starting containers"
docker-compose up --detach

sleep 5

echo "Initialising replica set"
mongosh 'mongodb://localhost:27000/' --quiet --eval '
  rs.initiate(
     {
        _id: "rs",
        version: 1,
        members: [
           { _id: 0, host : "mongo0:27017" },
           { _id: 1, host : "mongo1:27017" },
           { _id: 2, host : "mongo2:27017" }
        ]
     }
  )
'

sleep 15

check() {
  name=$1
  port=$2
  start=$(date +%s)
  check_for_seconds="20"
  while true; do
    res=$(curl -s "http://localhost:${port}/metrics" | grep '^mongodb_mongod_replset_member_replication_lag')
    hit=0
    while IFS= read -r line; do
      lag=$(($(echo "${line}" | cut -d'}' -f2 | tr -d ' +' | bc)))
      if [ "${lag}" -gt "1000" ]; then
        echo "Invalid lag observed via ${name}:"
        echo " ${line}"
        hit=1
        break
      fi
    done <<< "${res}"
    current=$(date +%s)
    elapsed=$(($current-$start))
    if [ "${hit}" -eq "1" ] || [ "${elapsed}" -gt "${check_for_seconds}" ]; then
      break
    fi
  done
}

check "mongo0_exporter" 18080 &
check1_pid=$!
check "mongo1_exporter" 18081 &
check2_pid=$!
check "mongo2_exporter" 18082 &
check3_pid=$!

echo "Restarting container"
docker-compose --ansi never restart mongo0 2> /dev/null & # could be any member
restart_pid=$!

wait $restart_pid $check1_pid $check2_pid $check3_pid

docker-compose rm --volumes --stop --force mongo0 mongo1 mongo2 mongo0_exporter mongo1_exporter mongo2_exporter

Look for "Invalid lag observed via ..." in the output. You may need to run the script a few times before the condition hits.

Expected behavior

I'm wondering, if lag cannot be calculated due to data availiblity issues, should the metric be omitted perhaps? I'm slightly cautious about the fact people may leverage the state label to keep an eye on health, though I suspect there are better metrics for that. If there is a desire to still expose the metric even when lag cannot be calculated, then I'm unsure - it could just be left as it is.

Logs

Output from running the above script:

./run.sh
No stopped containers
Starting containers
[+] Running 6/6
 ⠿ Container mongo0           Started                                                                                                                                                                      1.0s
 ⠿ Container mongo1_exporter  Started                                                                                                                                                                      1.3s
 ⠿ Container mongo2_exporter  Started                                                                                                                                                                      1.1s
 ⠿ Container mongo1           Started                                                                                                                                                                      1.3s
 ⠿ Container mongo0_exporter  Started                                                                                                                                                                      1.0s
 ⠿ Container mongo2           Started                                                                                                                                                                      1.0s
Initialising replica set
{ ok: 1 }
Restarting 
Invalid lag observed via mongo2_exporter:
 mongodb_mongod_replset_member_replication_lag{name="mongo0:27017",set="rs",state="(not reachable/healthy)"} 1.69176624e+09
[+] Running 6/6
 ⠿ Container mongo0           Stopped                                                                                                                                                                     10.8s
 ⠿ Container mongo1_exporter  Stopped                                                                                                                                                                      0.7s
 ⠿ Container mongo0_exporter  Stopped                                                                                                                                                                      0.5s
 ⠿ Container mongo2_exporter  Stopped                                                                                                                                                                      0.7s
 ⠿ Container mongo2           Stopped                                                                                                                                                                     10.7s
 ⠿ Container mongo1           Stopped                                                                                                                                                                     10.9s
Going to remove mongo1, mongo2, mongo2_exporter, mongo0, mongo0_exporter, mongo1_exporter
[+] Running 6/0
 ⠿ Container mongo1_exporter  Removed                                                                                                                                                                      0.0s
 ⠿ Container mongo1           Removed                                                                                                                                                                      0.0s
 ⠿ Container mongo0           Removed                                                                                                                                                                      0.0s
 ⠿ Container mongo0_exporter  Removed                                                                                                                                                                      0.0s
 ⠿ Container mongo2           Removed                                                                                                                                                                      0.0s
 ⠿ Container mongo2_exporter  Removed

1.69176624e+09 = 1691766240 = 2023-08-11T15:04:00Z

Environment

OS: OSX
environment: Docker
MongoDB version: 6

The text was updated successfully, but these errors were encountered:

nick-jones added bug community triage labels Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Member replication lag metric can report wrong value when nodes go away #697

Member replication lag metric can report wrong value when nodes go away #697

nick-jones commented Aug 11, 2023 •

edited

Member replication lag metric can report wrong value when nodes go away #697

Member replication lag metric can report wrong value when nodes go away #697

Comments

nick-jones commented Aug 11, 2023 • edited

nick-jones commented Aug 11, 2023 •

edited