Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Query Analytics (QAN) in PMM. #2517

Open
1 task done
artemsafiyulin opened this issue Oct 3, 2023 · 10 comments
Open
1 task done

Problem with Query Analytics (QAN) in PMM. #2517

artemsafiyulin opened this issue Oct 3, 2023 · 10 comments
Assignees
Labels
bug Bug report

Comments

@artemsafiyulin
Copy link

Description

Hello! I use PMM (2.37.0) installed with docker for monitoring and backup mongodb (6.0.6).
When I open tab with query analytics I sometimes get error undefined is not an object (evaluating 'e.active')
Снимок экрана 2023-10-03 в 15 35 35

After few page refreshes I get normal QAN window, but with one important problem for me.
When QAN page loaded correctly - I don't see queries from my database (only default databases).
Снимок экрана 2023-10-03 в 15 36 14

But if I change time period my database will finded in filters.
After finding this problem I go to the PMM server and open container with pmm, found in this container clickhouse with database pmm and table metrics. And I found that in this table for all database contains last metrics for today (03.10), but for 'my_database' last metrics contain for 28.09.

58f15d02fc35 :) select period_start from pmm.metrics ORDER BY period_start desc limit 1;

SELECT period_start
FROM pmm.metrics
ORDER BY period_start DESC
LIMIT 1

Query id: 18309b89-ed09-4dbc-a419-332c1c27862a

┌────────period_start─┐
│ 2023-10-03 09:45:00 │
└─────────────────────┘

1 rows in set. Elapsed: 0.019 sec. Processed 2.93 million rows, 11.71 MB (156.44 million rows/s., 625.76 MB/s.) 

58f15d02fc35 :) select period_start from pmm.metrics where database = 'my_database' ORDER BY period_start desc limit 1;

SELECT period_start
FROM pmm.metrics
WHERE database = 'my_database'
ORDER BY period_start DESC
LIMIT 1

Query id: b4388425-337b-4af2-8291-ef306bdcc8e1

┌────────period_start─┐
│ 2023-09-28 01:03:00 │
└─────────────────────┘

1 rows in set. Elapsed: 0.026 sec. Processed 2.93 million rows, 7.38 MB (114.78 million rows/s., 289.21 MB/s.) 

58f15d02fc35 :)

Please help me found problem why metrics collecting for default databases, but not collect for my database.

Expected Results

I expect that PMM will collect metrics from all databases (including created myself databases) all time.

Actual Results

Right now PMM collect metrics from my database not all time (only random time ranges)

Version

PMM server and client version 2.37.0

Steps to reproduce

No response

Relevant logs

No response

Code of Conduct

  • I agree to follow Percona Community Code of Conduct
@BupycHuk
Copy link
Member

BupycHuk commented Oct 3, 2023

Hello @artemsafiyulin, could you run pmm-admin summary on PMM client and share logs from qan-mongodb-profiler

@artemsafiyulin
Copy link
Author

artemsafiyulin commented Oct 4, 2023

Hello @BupycHuk, when I was run pmm-admin summary I get error:

[root@mongo-01 asafiiulin]# pmm-admin summary
[GET /logs.zip][401] Logs default  &{Code:7 Error:Access denied. Message:Access denied.}
summary_mongo-01_2023_10_04_07_23_12.zip created.

But archive generated. I checked QAN_MONGODB_PROFILER_AGENT and find many same warnings:

WARN[2023-09-28T00:45:27.463+00:00] couldn't retrieve data from cursor (CappedPositionLost) Executor error during getMore :: caused by :: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(731735296)  agentID=/agent_id/84387772-8197-494a-b789-b7c314231b3a component=agent-builtin db=my_database type=qan_mongodb_profiler_agent

I was think that my oplog size is too small, checked it and it look ok:

rs0 [direct: primary] test> rs.printReplicationInfo()
actual oplog size
'17769.6298828125 MB'
---
configured oplog size
'17769.6298828125 MB'
---
log length start to end
'501441 secs (139.29 hrs)'
---
oplog first event time
'Thu Sep 28 2023 12:32:32 GMT+0000 (Coordinated Universal Time)'
---
oplog last event time
'Wed Oct 04 2023 07:49:53 GMT+0000 (Coordinated Universal Time)'
---
now
'Wed Oct 04 2023 07:49:53 GMT+0000 (Coordinated Universal Time)'
rs0 [direct: primary] test>

@BupycHuk BupycHuk changed the title Problem with Qery Analytics (QAN) in PMM. Problem with Query Analytics (QAN) in PMM. Oct 4, 2023
@BupycHuk
Copy link
Member

BupycHuk commented Oct 4, 2023

@artemsafiyulin could you please try to restart pmm-agent?

@artemsafiyulin
Copy link
Author

@BupycHuk Yes it help, so as I understand pmm-agent wasn't have time to get information from oplog before them rewrite. And after that pmm-agent wasn't try get new information. After restart I see in pmm clickhouse that we don't have information between time when pmm-agent firstly get this error and time when I restart agent. Can we change this action? For example - if pmm-agent wasn't get some information from oplog - just skip it and continue collect newest information.

@BupycHuk
Copy link
Member

BupycHuk commented Oct 4, 2023

Yes, we'll discuss internally to prioritize it.

@BupycHuk
Copy link
Member

BupycHuk commented Oct 4, 2023

@artemsafiyulin I checked the code and seems like it supposed to work the way you wrote. Could you check logs if there are any repeating error except the one you shared?

@artemsafiyulin
Copy link
Author

@BupycHuk great news! It means that I have problem on my side and it can were fixed.
And after yesterday restart i get this problem again. I investigated systemd logs of pmm-agent on mongodb master server and see next situation:

at 13:51 pmm-agent was restarted:

Oct 04 13:51:18 prod-mongodb-build-01 systemd[1]: Started pmm-agent.

from 13:54 to 14:31 I get multiple same errors that "couldn't add document to aggregator" and one error about "Action terminated":

Oct 04 13:54:27 prod-mongodb-build-01 pmm-agent[2321084]: WARN[2023-10-04T13:54:27.420+00:00] couldn't add document to aggregator           agentID=/agent_id/84387772-8197-494a-b789-b7c314231b3a component=agent-builtin type=qan_mongodb_profiler_agent
.
.
.
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]: WARN[2023-10-04T14:10:43.693+00:00] Action terminated with error: cannot explain this type of query
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]: (Location40413) BSON field 'findAndModify.$db' is a duplicate field
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]: github.com/percona/pmm/agent/runner/actions.(*mongodbExplainAction).Run
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]:         /tmp/go/src/github.com/percona/pmm/agent/runner/actions/mongodb_explain_action.go:99
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]: github.com/percona/pmm/agent/runner.(*Runner).handleAction.func1
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]:         /tmp/go/src/github.com/percona/pmm/agent/runner/runner.go:238
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]: runtime/pprof.Do
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]:         /usr/local/go/src/runtime/pprof/runtime.go:40
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]: runtime.goexit
Oct 04 14:10:43 prod-mongodb-build-01 pmm-agent[2321084]:         /usr/local/go/src/runtime/asm_amd64.s:1594  component=runner id=/action_id/28a169b1-90a7-4989-8ddc-9eb539649a38 type=mongodb-explain
.
.
.
Oct 04 14:31:18 prod-mongodb-build-01 pmm-agent[2321084]: WARN[2023-10-04T14:31:18.854+00:00] couldn't add document to aggregator           agentID=/agent_id/84387772-8197-494a-b789-b7c314231b3a component=agent-builtin type=qan_mongodb_profiler_agent

from 14:36 to 14:37 I get next few same errors:

Oct 04 14:36:08 prod-mongodb-build-01 pmm-agent[2321084]: WARN[2023-10-04T14:36:08.118+00:00] time="2023-10-04T14:36:08Z" level=warning msg="cannot create metrics for oplog: connection(127.0.0.1:27017[-4]) incomplete read of message header: context canceled"  agentID=/agent_id/dc060cf8-8221-4bbc-a774-4fb649c24873 component=agent-process type=mongodb_exporter
Oct 04 14:36:08 prod-mongodb-build-01 pmm-agent[2321084]: ERRO[2023-10-04T14:36:08.119+00:00] time="2023-10-04T14:36:08Z" level=error msg="Cannot get node type to check if this is a mongos: canceled while checking out a connection from connection pool: context canceled; maxPoolSize: 100, connections in use by cursors: 0, connections in use by transactions: 0, connections in use by other operations: 0" component=diagnosticDataCollector  agentID=/agent_id/dc060cf8-8221-4bbc-a774-4fb649c24873 component=agent-process type=mongodb_exporter

from 14:37 to 23:59:08 also many same errors about "couldn't add document to aggregator" and few errors from previously block "cannot create metrics for oplog" and "Cannot get node type to check if this is a mongos"

from 00:01:06 to 01:11:30 many same errors:

Oct 04 23:59:08 prod-mongodb-build-01 pmm-agent[2321084]: WARN[2023-10-04T23:59:08.803+00:00] couldn't add document to aggregator           agentID=/agent_id/84387772-8197-494a-b789-b7c314231b3a component=agent-builtin type=qan_mongodb_profiler_agent
Oct 05 00:01:06 prod-mongodb-build-01 pmm-agent[2321084]: WARN[2023-10-05T00:01:06.996+00:00] couldn't retrieve data from cursor (CappedPositionLost) Executor error during getMore :: caused by :: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(801320721)  agentID=/agent_id/84387772-8197-494a-b789-b7c314231b3a component=agent-builtin db=my_database type=qan_mongodb_profiler_agent
.
.
.
Oct 05 01:11:30 prod-mongodb-build-01 pmm-agent[2321084]: WARN[2023-10-05T01:11:30.557+00:00] couldn't retrieve data from cursor (CappedPositionLost) Executor error during getMore :: caused by :: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(801324404)  agentID=/agent_id/84387772-8197-494a-b789-b7c314231b3a component=agent-builtin db=my_database type=qan_mongodb_profiler_agent

After 01:11:30 log is empty. And as I can see in pmm clickhouse on PMM server in this time clickhouse has last metric from my database at same time:

58f15d02fc35 :) select period_start from pmm.metrics where database = 'my_database' ORDER BY period_start desc limit 1;

SELECT period_start
FROM pmm.metrics
WHERE database = 'my_database'
ORDER BY period_start DESC
LIMIT 1

Query id: cbb423e9-217f-4010-91fc-9abd9b7e3d2a

┌────────period_start─┐
│ 2023-10-05 01:11:00 │
└─────────────────────┘

1 rows in set. Elapsed: 0.030 sec. Processed 2.97 million rows, 8.16 MB (98.16 million rows/s., 269.30 MB/s.) 

58f15d02fc35 :)

@artemsafiyulin
Copy link
Author

@BupycHuk you haven't ideas ?

@BupycHuk
Copy link
Member

@artemsafiyulin, no, no ideas why it might happen. Just a question: how many DBs do you have?

@artemsafiyulin
Copy link
Author

@BupycHuk I have 3 mongodb instances (1 master and 2 replicas). In mongodb cluster i have 1 not default database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

3 participants