Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid message format #339

Closed
phips28 opened this issue Jul 23, 2022 · 17 comments · Fixed by #332
Closed

invalid message format #339

phips28 opened this issue Jul 23, 2022 · 17 comments · Fixed by #332

Comments

@phips28
Copy link
Contributor

phips28 commented Jul 23, 2022

Hi,

I am receiving odd pg errors after I upgraded to version 7.4 (from 6.x) "invalid message format" - and I have no ideas how to debug this, or where this is coming from.

It only happens rarely, and overall the job processing works.

I run 20 node instances with the pg-boss worker on each instance:

this.boss = new PgBoss({
	database: process.env.DATABASE,
	user: process.env.DATABASE_USER,
	password: process.env.DATABASE_PASSWORD,
	host: process.env.DATABASE_HOST,
	port: process.env.DATABASE_PORT,
	application_name: `xyz_${packageJson.version}_pgboss`,
	noSupervisor: !isWorker,
	noScheduling: !isWorker,
	max: 5,
	uuid: 'v4',
	newJobCheckInterval: 2000,
	monitorStateIntervalSeconds: 5,
	expireCheckIntervalSeconds: 10,
	archiveCompletedAfterSeconds: 300,
	deleteAfterDays: 90, // Delete archived jobs
});

Full error:

PgBossWorker: on error {
   length: 81,
   name: 'error',
   severity: 'ERROR',
   code: '08P01',
   detail: undefined,
   hint: undefined,
   position: undefined,
   internalPosition: undefined,
   internalQuery: undefined,
   where: undefined,
   schema: undefined,
   table: undefined,
   column: undefined,
   dataType: undefined,
   constraint: undefined,
   file: 'pqformat.c',
   line: '640',
   routine: 'pq_getmsgend',
   message: 'invalid message format (Queue: __pgboss__monitor-states, Worker: 1aaed893-107a-4cdf-8509-e8b2f1da09c8)',
   stack: 'error: invalid message format (Queue: __pgboss__monitor-states, Worker: 1aaed893-107a-4cdf-8509-e8b2f1da09c8)\n' +
     '    at Parser.parseErrorMessage (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:287:98)\n' +
     '    at Parser.handlePacket (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:126:29)\n' +
     '    at Parser.parse (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:39:38)\n' +
     '    at Socket.<anonymous> (/var/www/cointrackr/node_modules/pg-protocol/dist/index.js:11:42)\n' +
     '    at Socket.emit (node:events:527:28)\n' +
     '    at Socket.emit (node:domain:475:12)\n' +
     '    at addChunk (node:internal/streams/readable:315:12)\n' +
     '    at readableAddChunk (node:internal/streams/readable:289:9)\n' +
     '    at Socket.Readable.push (node:internal/streams/readable:228:10)\n' +
     '    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)',
   queue: '__pgboss__monitor-states',
   worker: '1aaed893-107a-4cdf-8509-e8b2f1da09c8'
 }

Also seeing this for other queues like: __pgboss__cron, and my own queues fetch_xyz.

Do you know where to look for to fix this?

@timgit
Copy link
Owner

timgit commented Jul 23, 2022

This error is happening at the driver level and not inside pg-boss. All references I've found searching the pg driver and other similar errors mention unicode characters. Does one of your queues have any backslash characters? If it's concurrency-related, is a connection pooler such as pgbouncer being used in front of postgres? I'd be interested in finding out if reducing your instance count makes this go away, for example.

@phips28
Copy link
Contributor Author

phips28 commented Jul 23, 2022

I only use queue names with underscore's.
There is no pgbouncer in place.
I could not reproduce this on my local machine with 1 and 2 instances. And I cant scale down our production workers to test this.

This happened just after upgrading from 6 to 7 of pg-boss, nothing else changed on our prod server nor postgres db. Thats why I wondered if there might be a code change that causes this. But couldnt find out.

@timgit
Copy link
Owner

timgit commented Jul 23, 2022

Out of the 20 instances, how many have isWorker=true? Also, does increasing the monitorStateIntervalSeconds value improve anything? One more thing that would be interesting is seeing if you could run 20-40 instances on your local machine to repro

@timgit
Copy link
Owner

timgit commented Jul 23, 2022

By the way, expireCheckIntervalSeconds is obsolete

@phips28
Copy link
Contributor Author

phips28 commented Jul 23, 2022

All 20 instances are workers (=true)

Increased monitorStateIntervalSeconds, lets wait for logs. => edit: didnt change anything

I am now running 25 instances on localhost. But no errors in sight 🙈

I also found another error on the prod server:

PgBossWorker: on error {
  length: 146,
  name: 'error',
  severity: 'ERROR',
  code: '22P02',
  detail: undefined,
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: "unnamed portal parameter $2 = '...'",
  schema: undefined,
  table: undefined,
  column: undefined,
  dataType: undefined,
  constraint: undefined,
  file: 'int8.c',
  line: '124',
  routine: 'scanint8',
  message: 'invalid input syntax for type bigint: "\\_\\_pgboss\\_\\_cron" (Queue: __pgboss__cron, Worker: c2b757e7-5536-4f4c-a0de-10e38f600b54)',
  stack: 'error: invalid input syntax for type bigint: "\\_\\_pgboss\\_\\_cron" (Queue: __pgboss__cron, Worker: c2b757e7-5536-4f4c-a0de-10e38f600b54)\n' +
    '    at Parser.parseErrorMessage (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:287:98)\n' +
    '    at Parser.handlePacket (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:126:29)\n' +
    '    at Parser.parse (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:39:38)\n' +
    '    at Socket.<anonymous> (/var/www/cointrackr/node_modules/pg-protocol/dist/index.js:11:42)\n' +
    '    at Socket.emit (node:events:527:28)\n' +
    '    at Socket.emit (node:domain:475:12)\n' +
    '    at addChunk (node:internal/streams/readable:315:12)\n' +
    '    at readableAddChunk (node:internal/streams/readable:289:9)\n' +
    '    at Socket.Readable.push (node:internal/streams/readable:228:10)\n' +
    '    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)',
  queue: '__pgboss__cron',
  worker: 'c2b757e7-5536-4f4c-a0de-10e38f600b54'
}

Also not sure where this is coming now. Never happened before with 6.x

@timgit
Copy link
Owner

timgit commented Jul 23, 2022

That error in interesting. 6.2.2 added an algo that uses a bigint for advisory locking to allow multiple instances to run on the same server and not block each other for cron processing. However, the internal timekeeper worker has no reference to this code and no other bigint data types are in the schema. This error looks like the queue name is trying to be explicitly cast to bigint for some reason. Is this error repeatedly thrown, or did it only happen during upgrade?

@phips28
Copy link
Contributor Author

phips28 commented Jul 23, 2022

There are multiple of these errors:

13|worker | 2022-07-23T19:55:15.937:   
message: 'invalid input syntax for type bigint: "\\_\\_pgboss\\_\\_send-it" (Queue: __pgboss__send-it, Worker: 400b1c56-594f-4b81-83fe-77a1236a3f62)',

8|worker  | 2022-07-23T19:13:49.380:   
message: 'invalid input syntax for type bigint: "fetch\\_mexc" (Queue: fetch_mexc, Worker: 198df7b9-1e62-4f34-b212-358d339050fd)',

6|worker  | 2022-07-23T17:47:41.894:   
message: 'invalid input syntax for type bigint: "{"fc4a41d7-6480-470a-bf89-2cef7e4393a7"}" (Queue: __pgboss__cron, Worker: cbdeba82-df8a-40ba-9514-eaf4ca3e3b8b)',

14|worker | 2022-07-23T17:46:44.664:   
message: 'invalid input syntax for type bigint: "fetch\\_ascendex" (Queue: fetch_ascendex, Worker: 6b223526-8dd3-4a94-bf21-fdc6d4b3a651)',

@timgit
Copy link
Owner

timgit commented Jul 24, 2022

Try passing in the connection as the db prop in the constructor to log each statement out.

    const options = {
        db: {
            executeSql: (text, values) => {
                console.log(JSON.stringify({text, values}))
                return pool.query(text,values)
            }
        }
    }

@phips28
Copy link
Contributor Author

phips28 commented Jul 24, 2022

Ok I am now logging the queries, lets wait for the next error.

BTW. I also got more funny errors 🙈 I dont understand where this is coming from. My other pg-boss instances (other projects, same DB host) running without those crazy errors...

PgBossWorker: on error {
   length: 137,
   name: 'error',
   severity: 'FATAL',
   code: '08P01',
   detail: 'Message length does not match input length.',
   hint: undefined,
   position: undefined,
   internalPosition: undefined,
   internalQuery: undefined,
   where: undefined,
   schema: undefined,
   table: undefined,
   column: undefined,
   dataType: undefined,
   constraint: undefined,
   file: 'auth-scram.c',
   line: '362',
   routine: 'pg_be_scram_exchange',
   message: 'malformed SCRAM message (Queue: fetch_phemex, Worker: 5d7bd58f-f13d-4662-8dce-40414b6fb228)',
   stack: 'error: malformed SCRAM message (Queue: fetch_phemex, Worker: 5d7bd58f-f13d-4662-8dce-40414b6fb228)\n' +
     '    at Parser.parseErrorMessage (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:287:98)\n' +
     '    at Parser.handlePacket (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:126:29)\n' +
     '    at Parser.parse (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:39:38)\n' +
     '    at Socket.<anonymous> (/var/www/cointrackr/node_modules/pg-protocol/dist/index.js:11:42)\n' +
     '    at Socket.emit (node:events:527:28)\n' +
     '    at Socket.emit (node:domain:475:12)\n' +
     '    at addChunk (node:internal/streams/readable:315:12)\n' +
     '    at readableAddChunk (node:internal/streams/readable:289:9)\n' +
     '    at Socket.Readable.push (node:internal/streams/readable:228:10)\n' +
     '    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)',
   queue: 'fetch_phemex',
   worker: '5d7bd58f-f13d-4662-8dce-40414b6fb228'
 }

PgBossWorker: on error {
   length: 121,
   name: 'error',
   severity: 'FATAL',
   code: '08P01',
   detail: undefined,
   hint: undefined,
   position: undefined,
   internalPosition: undefined,
   internalQuery: undefined,
   where: undefined,
   schema: undefined,
   table: undefined,
   column: undefined,
   dataType: undefined,
   constraint: undefined,
   file: 'auth-scram.c',
   line: '247',
   routine: 'pg_be_scram_init',
   message: 'client selected an invalid SASL authentication mechanism (Queue: fetch_cryptocom, Worker: be2631ab-9a70-4ff3-a4a3-d93c099b9b52)',
   stack: 'error: client selected an invalid SASL authentication mechanism (Queue: fetch_cryptocom, Worker: be2631ab-9a70-4ff3-a4a3-d93c099b9b52)\n' +
     '    at Parser.parseErrorMessage (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:287:98)\n' +
     '    at Parser.handlePacket (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:126:29)\n' +
     '    at Parser.parse (/var/www/cointrackr/node_modules/pg-protocol/dist/parser.js:39:38)\n' +
     '    at Socket.<anonymous> (/var/www/cointrackr/node_modules/pg-protocol/dist/index.js:11:42)\n' +
     '    at Socket.emit (node:events:527:28)\n' +
     '    at Socket.emit (node:domain:475:12)\n' +
     '    at addChunk (node:internal/streams/readable:315:12)\n' +
     '    at readableAddChunk (node:internal/streams/readable:289:9)\n' +
     '    at Socket.Readable.push (node:internal/streams/readable:228:10)\n' +
     '    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)',
   queue: 'fetch_cryptocom',
   worker: 'be2631ab-9a70-4ff3-a4a3-d93c099b9b52'
 }

@phips28
Copy link
Contributor Author

phips28 commented Jul 24, 2022

Now I could reproduce it on my localhost with one worker! Yey.

Simple: one of my sync jobs hit another URL (via axios) to trigger a sync on another machine.
But when that request failed (404, 502, 504) it threw the error from axios, and pg-boss couldnt handle this, because it includes the request/response object in that error, resulting in a:

 TypeError: Converting circular structure to JSON
    --> starting at object with constructor 'TLSSocket'
    |     property 'parser' -> object with constructor 'HTTPParser'
    --- property 'socket' closes the circle
    at JSON.stringify (<anonymous>)
    ..... and so on

I found this out while adding a try catch to the db.js =>

  async executeSql (text, values) {
    if (this.opened) {
      try {
        return await this.pool.query(text, values)
      } catch (e) {
        console.error('pgboss executeSql', text, values, e);
        throw e
      }
    }
  }

(I could not figure out how to do this in the constructor as you suggested here, because not sure how I get the pool)

This then showed this weird error message: 'invalid message format (Queue: __pgboss__cron, Worker: b22a2e8a-0d77-400c-95b9-ededf01507e8)', - and confused me, because it showed another Queue here?! so not sure how that made it into the log. (and also all the other errors I posted in this issue were caused by this)

Solution:
I am catching the AxiosError and building my own Error with the needed message.


Now is the question, what changed from 6.x to 7.x that broke that? Because that code was the same before and was running for months.

Maybe you can add a little check (or try/catch) if there is a circular dependency when stringify'ing the response before storing into the job table?
And/or improve the error handling a bit, to see the real error instead of the weird invalid message format and confusing queue.

Thanks for your time!

@phips28 phips28 closed this as completed Jul 24, 2022
@timgit
Copy link
Owner

timgit commented Jul 24, 2022

Glad you tracked it down! You can re-open this so we can track it for a fix.

@phips28 phips28 reopened this Jul 25, 2022
@timgit timgit mentioned this issue Jul 28, 2022
Merged
@rgharris
Copy link

I am seeing the same error as above and it seems to cause job.done to hang. Should this have been addressed in a737ae3?

I have created a test using an Axios error that fails via a time out master...rgharris:pg-boss:circular-error:

❯ npm test -- -g "circular"

> pg-boss@8.4.2 test
> standard && mocha -g circular


      failure via done() should pass error with circular payload to failed job (schema: pgboss53095a6204)...
created cronjob in queue some-queue
caught error, marking job as done
executeSql: TypeError: Converting circular structure to JSON
    --> starting at object with constructor 'Object'
    |     property 'httpsAgent' -> object with constructor 'Agent'
    |     property 'sockets' -> object with constructor 'Object'
    |     ...
    |     property 'errored' -> object with constructor 'Object'
    --- property 'config' closes the circle
executeSql: error: invalid message format
running job 649816dd-6864-4ed6-8c9a-e4dc3f8eba83

  failure
    1) failure via done() should pass error with circular payload to failed job


  0 passing (11s)
  1 failing

  1) failure
       failure via done() should pass error with circular payload to failed job:
     Error: Timeout of 10000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/Users/example/pg-boss/test/failureTest.js)
      at listOnTimeout (node:internal/timers:559:17)
      at processTimers (node:internal/timers:502:7)

Thank you!

@phips28
Copy link
Contributor Author

phips28 commented Mar 18, 2023

I also saw this lately in our logs. Where jobs were not done/failed correctly

@timgit
Copy link
Owner

timgit commented Mar 18, 2023

What version of pg-boss are you using? I pushed a test branch with these changes and it doesn't fail. I added a small type update to job.done(), but this sample reminds me that job.done() needs to be retired in the next semver major. It's not clear enough in the docs that job.done() is obsolete if you pass an async function to work()

@rgharris
Copy link

Thanks for looking at this! I am using node v16.19.1 and pg-boss 8.4.2. It looks like the difference is that my test uses a different version of axios that has a circular reference in the error. The test fails with your branch after changing the axios version:
npm uninstall axios && npm i axios@^0.27.2

The actual code where I ran into this isn't using job.done() but I followed the pattern of the other tests for simplicity.

@phips28
Copy link
Contributor Author

phips28 commented Mar 27, 2023

I am not using the done() function, just the normal Promise approach. And I am getting this error "invalid message format" due to the circular reference in the AxiosError.
In another project I did a try/catch (comment), but I cant/wont add it to the current project.
Is there a fix from pg-boss side, maybe to stringify the error (if circular ref) before storing to the job output column?

"axios": "^0.27.2",
"pg-boss": "^8.4.2",

@siddisking
Copy link

siddisking commented Apr 10, 2023

I had the exact error. I was using Axios 1.2.0 and PG boss 6.2.2.

It happens due to axios unable to handle some cases read here

Not sure if pg boss version upgrade solves it but as its on axios's end, a quick upgrade to axios 1.2.2 would fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants