Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fetching multiple modules in one scrape #945

Merged
merged 8 commits into from
Aug 23, 2023

Conversation

servak
Copy link
Contributor

@servak servak commented Aug 14, 2023

I implemented the ability to scrape in multiple modules.
Parallel processing depends on user requirements, so I made it possible to specify the number of parallels. #731

> ./snmp_exporter -h
usage: snmp_exporter [<flags>]

Flags:
  -h, --[no-]help               Show context-sensitive help (also try --help-long and --help-man).
...
      --[no-]dry-run            Only verify configuration is valid and exit.
      --concurrency=1           Specify the number of modules to fetch concurrently <- Added

test

generator.yml

modules:
  sysUpTime:
    walk:
    - sysUpTime
  sysDescr:
    walk:
    - sysDescr

output

curl 'http://localhost:9116/snmp?target=cumulus&module=sysUpTime&module=sysDescr'
# HELP snmp_packet_duration_seconds A histogram of latencies for SNMP packets.
# TYPE snmp_packet_duration_seconds histogram
snmp_packet_duration_seconds_bucket{le="0.0001"} 0
snmp_packet_duration_seconds_bucket{le="0.0002"} 0
snmp_packet_duration_seconds_bucket{le="0.0004"} 0
snmp_packet_duration_seconds_bucket{le="0.0008"} 0
snmp_packet_duration_seconds_bucket{le="0.0016"} 0
snmp_packet_duration_seconds_bucket{le="0.0032"} 0
snmp_packet_duration_seconds_bucket{le="0.0064"} 0
snmp_packet_duration_seconds_bucket{le="0.0128"} 0
snmp_packet_duration_seconds_bucket{le="0.0256"} 0
snmp_packet_duration_seconds_bucket{le="0.0512"} 0
snmp_packet_duration_seconds_bucket{le="0.1024"} 0
snmp_packet_duration_seconds_bucket{le="0.2048"} 0
snmp_packet_duration_seconds_bucket{le="0.4096"} 0
snmp_packet_duration_seconds_bucket{le="0.8192"} 0
snmp_packet_duration_seconds_bucket{le="1.6384"} 0
snmp_packet_duration_seconds_bucket{le="+Inf"} 0
snmp_packet_duration_seconds_sum 0
snmp_packet_duration_seconds_count 0
# HELP snmp_packet_retries_total Number of SNMP packet retries.
# TYPE snmp_packet_retries_total counter
snmp_packet_retries_total 0
# HELP snmp_packets_total Number of SNMP packet sent, including retries.
# TYPE snmp_packets_total counter
snmp_packets_total 0
# HELP snmp_scrape_duration_seconds Total SNMP time scrape took (walk and processing).
# TYPE snmp_scrape_duration_seconds gauge
snmp_scrape_duration_seconds{module="sysDescr"} 0.081744083
snmp_scrape_duration_seconds{module="sysUpTime"} 0.0794535
# HELP snmp_scrape_packets_retried Packets retried for get, bulkget, and walk.
# TYPE snmp_scrape_packets_retried gauge
snmp_scrape_packets_retried{module="sysDescr"} 0
snmp_scrape_packets_retried{module="sysUpTime"} 0
# HELP snmp_scrape_packets_sent Packets sent for get, bulkget, and walk; including retries.
# TYPE snmp_scrape_packets_sent gauge
snmp_scrape_packets_sent{module="sysDescr"} 1
snmp_scrape_packets_sent{module="sysUpTime"} 1
# HELP snmp_scrape_pdus_returned PDUs returned from get, bulkget, and walk.
# TYPE snmp_scrape_pdus_returned gauge
snmp_scrape_pdus_returned{module="sysDescr"} 1
snmp_scrape_pdus_returned{module="sysUpTime"} 1
# HELP snmp_scrape_walk_duration_seconds Time SNMP walk/bulkwalk took.
# TYPE snmp_scrape_walk_duration_seconds gauge
snmp_scrape_walk_duration_seconds{module="sysDescr"} 0.081711667
snmp_scrape_walk_duration_seconds{module="sysUpTime"} 0.079426875
# HELP snmp_unexpected_pdu_type_total Unexpected Go types in a PDU.
# TYPE snmp_unexpected_pdu_type_total counter
snmp_unexpected_pdu_type_total 0
# HELP sysDescr A textual description of the entity - 1.3.6.1.2.1.1.1
# TYPE sysDescr gauge
sysDescr{sysDescr="Cumulus-Linux **********"} 1
# HELP sysUpTime The time (in hundredths of a second) since the network management portion of the system was last re-initialized. - 1.3.6.1.2.1.1.3
# TYPE sysUpTime gauge
sysUpTime 1.387417375e+09

Confirmation of Significant Changes

Since the change will add module information to the metrics output by snmp_exporter as standard, could you please confirm that this is the right direction for the change?

collector/collector.go Outdated Show resolved Hide resolved
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we'll also need to update the README documentation.

We should make it clear in the docs that this implementation of multi-module handling does not do any de-duplication of walks between different modules.

@servak
Copy link
Contributor Author

servak commented Aug 15, 2023

Thanks for review.
Understood. I will try to add the documentation as well.
By the way, I was thinking of doing a comma delimiter as well since it will be done soon, what do you think?
#731 (comment)

@SuperQ
Copy link
Member

SuperQ commented Aug 16, 2023

I think supporting both multiple module URL params and comma separation would be just fine.

main.go Show resolved Hide resolved
collector/collector.go Outdated Show resolved Hide resolved
collector/collector.go Outdated Show resolved Hide resolved
@servak
Copy link
Contributor Author

servak commented Aug 16, 2023

Thank you for confirming some points.
I will make some changes to the implementation and test it, so please confirm again when I do so.

@SuperQ
Copy link
Member

SuperQ commented Aug 16, 2023

Pro tip, git config --global format.signoff true for DCO signoffs.

@candlerb
Copy link
Contributor

Question: do we actually want the multiple module scrapes to be in parallel? It ought to work of course, but it makes me a bit uncomfortable as I worry about tickling bugs in vendor implementations.

Taking this to an extreme, why doesn't snmp_exporter do all its snmpwalks and snmpgets in parallel for a single module (or does it in fact do this?)

@SuperQ
Copy link
Member

SuperQ commented Aug 16, 2023

Question: do we actually want the multiple module scrapes to be in parallel

Probably not most of the time. But the default concurrency of 1 is fine. If users know what they're doing, they can do more.

Really, we'll eventually want this to be a config option. I'm going to say this ahead of time, not a scrape parameter. That's very dangerous.

Taking this to an extreme, why doesn't snmp_exporter do all its snmpwalks and snmpgets in parallel for a single module (or does it in fact do this?)

I think this has been requested previously. I think if we implemented a module walk de-duplication we would maybe implement it at that time. For now, the current behavior of serial SNMP ops is fine.

@servak
Copy link
Contributor Author

servak commented Aug 17, 2023

If the changes look OK to some extent, I will rebase but also put the SIGNOFF in all commits.
I will also undraft at that time.

collector/collector.go Outdated Show resolved Hide resolved
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, otherwise, looks good.

@SuperQ SuperQ requested a review from beorn7 August 17, 2023 14:16
@servak
Copy link
Contributor Author

servak commented Aug 17, 2023

Describe the log of the test.
You can see that the two run in parallel and the heavy if_mib starts first but the light other module completes first.

generator.yml

modules:
  sysUpTime:
    walk:
    - sysUpTime

  sysDescr:
    walk:
    - sysDescr

  # Default IF-MIB interfaces table with ifIndex.
  if_mib:
    walk: [interfaces, ifXTable]
    lookups:
      - source_indexes: [ifIndex]
        lookup: ifAlias
      - source_indexes: [ifIndex]
        # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
        lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
      - source_indexes: [ifIndex]

http request

> curl -s 'localhost:9116/snmp?target=192.168.64.3&module=if_mib,sysDescr,sysUpTime,sysDescr' | head
# HELP ifAdminStatus The desired state of the interface - 1.3.6.1.2.1.2.2.1.7
# TYPE ifAdminStatus gauge
ifAdminStatus{ifAlias="eth0",ifDescr="Intel Corporation 82540EM Gigabit Ethernet Controller",ifIndex="2",ifName="eth0"} 1
ifAdminStatus{ifAlias="lo",ifDescr="lo",ifIndex="1",ifName="lo"} 1

snmp-exporter log

> go run ./ --concurrency=2 --log.level=debug --config.file generator/snmp.yml
ts=2023-08-17T14:15:00.717Z caller=main.go:176 level=info msg="Starting snmp_exporter" version="(version=, branch=, revision=unknown)" concurrency=2
ts=2023-08-17T14:15:00.717Z caller=main.go:177 level=info build_context="(go=go1.20.5, platform=darwin/arm64, user=, date=, tags=unknown)"
ts=2023-08-17T14:15:00.723Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9116
ts=2023-08-17T14:15:00.723Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9116
ts=2023-08-17T14:15:09.516Z caller=collector.go:512 level=debug auth=public_v2 target=192.168.64.3 module=if_mib msg="Starting scrape"
ts=2023-08-17T14:15:09.516Z caller=collector.go:512 level=debug auth=public_v2 target=192.168.64.3 module=sysDescr msg="Starting scrape"
ts=2023-08-17T14:15:09.516Z caller=collector.go:193 level=debug auth=public_v2 target=192.168.64.3 module=sysDescr msg="Getting OIDs" oids=1
ts=2023-08-17T14:15:09.516Z caller=collector.go:227 level=debug auth=public_v2 target=192.168.64.3 module=if_mib msg="Walking subtree" oid=1.3.6.1.2.1.2
ts=2023-08-17T14:15:09.634Z caller=collector.go:203 level=debug auth=public_v2 target=192.168.64.3 module=sysDescr msg="Get of OIDs completed" oids=1 duration_seconds=118.315333ms
ts=2023-08-17T14:15:09.635Z caller=collector.go:516 level=debug auth=public_v2 target=192.168.64.3 module=sysDescr msg="Finished scrape" duration_seconds=0.118678584
ts=2023-08-17T14:15:09.635Z caller=collector.go:512 level=debug auth=public_v2 target=192.168.64.3 module=sysUpTime msg="Starting scrape"
ts=2023-08-17T14:15:09.635Z caller=collector.go:193 level=debug auth=public_v2 target=192.168.64.3 module=sysUpTime msg="Getting OIDs" oids=1
ts=2023-08-17T14:15:09.639Z caller=collector.go:203 level=debug auth=public_v2 target=192.168.64.3 module=sysUpTime msg="Get of OIDs completed" oids=1 duration_seconds=4.183417ms
ts=2023-08-17T14:15:09.640Z caller=collector.go:516 level=debug auth=public_v2 target=192.168.64.3 module=sysUpTime msg="Finished scrape" duration_seconds=0.005313333
ts=2023-08-17T14:15:09.645Z caller=collector.go:241 level=debug auth=public_v2 target=192.168.64.3 module=if_mib msg="Walk of subtree completed" oid=1.3.6.1.2.1.2 duration_seconds=129.05475ms
ts=2023-08-17T14:15:09.645Z caller=collector.go:227 level=debug auth=public_v2 target=192.168.64.3 module=if_mib msg="Walking subtree" oid=1.3.6.1.2.1.31.1.1
ts=2023-08-17T14:15:09.664Z caller=collector.go:241 level=debug auth=public_v2 target=192.168.64.3 module=if_mib msg="Walk of subtree completed" oid=1.3.6.1.2.1.31.1.1 duration_seconds=19.145375ms
ts=2023-08-17T14:15:09.667Z caller=collector.go:516 level=debug auth=public_v2 target=192.168.64.3 module=if_mib msg="Finished scrape" duration_seconds=0.151105834

@SuperQ
Copy link
Member

SuperQ commented Aug 17, 2023

@candlerb How do you like this compared to your version?

main.go Outdated Show resolved Hide resolved
@candlerb
Copy link
Contributor

It's a bigger change - for example it creates per-module scrape stats - but if that's OK with you then it's OK with me :-)

I can't see any circumstance in which I'd use concurrency>1 (as it can't be controlled on a target-by-target basis), but it doesn't add much complexity to the code, so it's fine as-is.

@SuperQ
Copy link
Member

SuperQ commented Aug 17, 2023

I would have been previously more objecting to the per-module stats. But with the new separated auths, the number of modules will likely be greatly reduced in real-world installs.

main.go Outdated Show resolved Hide resolved
Signed-off-by: Kakuya Ando <fservak@gmail.com>
Signed-off-by: Kakuya Ando <fservak@gmail.com>
Signed-off-by: Kakuya Ando <fservak@gmail.com>
Signed-off-by: Kakuya Ando <fservak@gmail.com>
Signed-off-by: Kakuya Ando <fservak@gmail.com>
Signed-off-by: Kakuya Ando <fservak@gmail.com>
README.md Outdated Show resolved Hide resolved
Signed-off-by: Kakuya Ando <fservak@gmail.com>
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with full codebase, but asked to review concurrency go - one nit, but overall looks good!

}()
}

go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't think we need this goroutine, it does not give us anything. We could use "Collect(" (this method) scope to distribute the work and close channel afterwards and then wait. Only readability/complexity nit though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the last Go routine seems unnecessary. Could you please confirm if my understanding is correct?

--- a/collector/collector.go
+++ b/collector/collector.go
@@ -527,12 +527,10 @@ func (c Collector) Collect(ch chan<- prometheus.Metric) {
                }()
        }

-       go func() {
-               for _, module := range c.modules {
-                       workerChan <- module
-               }
-               close(workerChan)
-       }()
+       for _, module := range c.modules {
+               workerChan <- module
+       }
+       close(workerChan)
        wg.Wait()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Do you mind writing quick unit test for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I think that unit test for Collector is difficult.
Because Collector does not have an SNMP Fetcher interface, but calls GoSNMP directly in ScrapeTargets.
Please let me know if I am not aware of this and if there is another way to make this easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, writing a test for this is a bit difficult. I'm OK without an additional test here.

prometheus.GaugeValue,
time.Since(start).Seconds())
}

// Collect implements Prometheus.Collector.
func (c Collector) Collect(ch chan<- prometheus.Metric) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is a unit test for this somewhere?

Signed-off-by: Kakuya Ando <fservak@gmail.com>
@SuperQ SuperQ changed the title Implemented a feature to fetch from multiple modules Support fetching multiple modules in one scrape Aug 23, 2023
@SuperQ SuperQ merged commit f4d7c3f into prometheus:main Aug 23, 2023
6 checks passed
@servak servak deleted the multi-module branch August 23, 2023 12:00
SuperQ added a commit that referenced this pull request Aug 29, 2023
* [CHANGE] Sanitize invalid UTF-8 #968
* [FEATURE] Support fetching multiple modules in one scrape #945
* [FEATURE] Support loading multiple configuration files #970

Signed-off-by: SuperQ <superq@gmail.com>
@SuperQ SuperQ mentioned this pull request Aug 29, 2023
SuperQ added a commit that referenced this pull request Aug 29, 2023
* [CHANGE] Sanitize invalid UTF-8 #968
* [FEATURE] Support fetching multiple modules in one scrape #945
* [FEATURE] Support loading multiple configuration files #970

Signed-off-by: SuperQ <superq@gmail.com>
stephan-windischmann-sky pushed a commit to stephan-windischmann-sky/snmp_exporter that referenced this pull request Oct 27, 2023
* Implemented a feature to fetch from multiple modules
* Changed snmp_collection_duration_seconds from Summary to Histogram

---------

Signed-off-by: Kakuya Ando <fservak@gmail.com>
Signed-off-by: Stephan Windischmann <windi@Stephans-MacBook-Pro.local>
stephan-windischmann-sky pushed a commit to stephan-windischmann-sky/snmp_exporter that referenced this pull request Oct 27, 2023
* [CHANGE] Sanitize invalid UTF-8 prometheus#968
* [FEATURE] Support fetching multiple modules in one scrape prometheus#945
* [FEATURE] Support loading multiple configuration files prometheus#970

Signed-off-by: SuperQ <superq@gmail.com>
Signed-off-by: Stephan Windischmann <windi@Stephans-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants