Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge RSS usage retained (suspect JSONB fields) #21471

Open
leighman opened this issue Oct 12, 2023 · 23 comments
Open

Huge RSS usage retained (suspect JSONB fields) #21471

leighman opened this issue Oct 12, 2023 · 23 comments
Assignees
Labels
bug/2-confirmed Bug has been reproduced and confirmed. kind/bug A reported bug. team/client Issue for team Client. topic: Json Scalar type `Json` topic: memory leak topic: performance/memory

Comments

@leighman
Copy link

leighman commented Oct 12, 2023

Bug description

When using prisma for some queries we see rss memory usage vastly exceeding the size of our heap usage and this memory is not freed.

Screenshot 2023-10-10 at 13 50 04

The two spikes you can see at the start of the cliff in the graph represent the read and write steps of our operation (a worst case example). As you can see the rss memory is retained until pod restart whereas heap usage frees correctly.

We don't have a minimal test case yet but it seems to relate to JSONB columns.
Our tables have some JSONB fields that can be quite large (up to 0.5MB) and we may load 100s or 1000s of records, process and update.
We have tested various versions of node and prisma (back to 4.8) as well as splitting create and update rather than using upsert.
Adding a select to the read operation (since we don't need to read the larger fields most of the time) seemed to eliminate the first spike leading me to suspect something to do with json marshalling or size of data being fetched.

How to reproduce

Suspected.

  • Create a table with a JSONB column
  • Create records with json values around 20-500kb
  • Load, process and save a large number of records from that table

Should see heap reduce suggesting no memory leak in application and rss usage vastly exceed and be retained.

Expected behavior

Rss memory should also return to baseline after operation

Prisma information

// Add your schema.prisma
// Add your code using Prisma Client

Environment & setup

  • OS: node-slim (debian?)
  • Database: cockroach db v23.1
  • Node.js version: node 18

Prisma Version

locally

Environment variables loaded from .env
Environment variables loaded from prisma/.env
prisma                  : 5.3.1
@prisma/client          : 5.3.1
Current platform        : darwin-arm64
Query Engine (Node-API) : libquery-engine 61e140623197a131c2a6189271ffee05a7aa9a59 (at node_modules/@prisma/engines/libquery_engine-darwin-arm64.dylib.node)
Schema Engine           : schema-engine-cli 61e140623197a131c2a6189271ffee05a7aa9a59 (at node_modules/@prisma/engines/schema-engine-darwin-arm64)
Schema Wasm             : @prisma/prisma-schema-wasm 5.3.1-2.61e140623197a131c2a6189271ffee05a7aa9a59
Default Engines Hash    : 61e140623197a131c2a6189271ffee05a7aa9a59
Studio                  : 0.494.0
Preview Features        : metrics, tracing
@leighman leighman added the kind/bug A reported bug. label Oct 12, 2023
@janpio janpio added bug/0-unknown Bug is new, does not have information for reproduction or reproduction could not be confirmed. topic: performance/memory labels Oct 13, 2023
@janpio
Copy link
Member

janpio commented Oct 13, 2023

Thanks for opening an issue about this observation.

Let's dig in:

  • Load, process and save

Can you clarify a bit exactly how you are loading, processing and saving the data? I see you mentioned upsert somewhere, but I am not super sure if that would be needed for the reproduction.

Adding a select to the read operation

Same question here. What exactly are you reading how?

Preview Features : metrics, tracing

Are you actively using metrics and tracing in this setup?


Is the increases RSS usage causing any concrete problems? Would additional activity that needs more memory fail because of the current high retained memory? Or would the memory be freed on demand when needed?

@janpio janpio added the team/client Issue for team Client. label Oct 13, 2023
@leighman
Copy link
Author

leighman commented Oct 13, 2023

We have an event re-running operation based on inserting a new event in a timeline (insertNewEvent). (The implementation was initially quite naive with the intent to refine it)

Load:
We load a set of Foos at a point in time and all FooEvents that happened after that time for that set of Foos. The large JSON field snapshot on FooEvent is a snapshot at each event's point in time containing the set of Foos serialised to a json array.
We load them with prisma.fooEvent.findUnique().snapshot to get the Foos and prisma.fooEvent.findMany to get all subsequent events.

Process (no/very minimal prisma involved):
We insert a new event and replay all the events against the Foos in memory.

Save:
We then have a reconciliation step where we create or update Foos and FooEvent snapshots with prisma again.

We were saving everything with upsert with Promise.all with 10 request concurrency limit but since another bug mentioned memory leaks with upsert we switched to createMany and update which I think had some effect by reducing number of queries.

We were fetching the entire FooEvent including snapshot but it is regenerated after each event re-run so we added a select on the prisma.fooEvent.findMany NOT INCLUDING snapshot and that seems to have reduced the memory usage of the read almost entirely.

We are using metrics. We have a tracing setup but the env variable is not on in the environment in the screenshot so the @prisma/instrumentation library is installed but is not registered with registerInstrumentations.

After running enough of the operations the pod will OOM. Increased RSS usage is fine but the fact that it is not freed after the operation has ended is the problem. In the pictured screenshot the operation begins and ends at around 1/4 of the timeline when you see the small heap increase. The rest of the plateau is retained after the operation has ended (though with the same prisma object) until a pod restart due to a redeploy.

We are currently not in a position to easily try a raw query so we can't confirm that reducing all the updates helps.
We haven't yet tested recreating the prisma object to see whether that results in memory freeing.
We haven't tested fully removing the instrumentation.
We have created a synthetic example where we didn't really seem to see the memory retention but that was before the suspicion that it's linked to json size so the payloads were possibly too small.

@janpio
Copy link
Member

janpio commented Oct 13, 2023

If you could go to that reproduction again and add the suspected field type, that would be super useful.

We won't be able to do anything here before we see this ourselves, so have a reproduction that shows the problem, which is why I was asking for all these details. But it still seems like this would take a lot of time with unclear results if we can actually reproduce the problem. You are in a much better position to do that without missing the one tiny detail that might be relevant.

@leighman
Copy link
Author

The synthetic example uses a lot of application code so it's not something we're in a position to share currently.
Converting the snapshot field from Json to String in prisma and doing the parse/stringify on the application side seems to have resolved the problem so the reproduction given seems like it should show the problem and converting the field to TEXT will resolve it.

@janpio janpio added the topic: Json Scalar type `Json` label Oct 17, 2023
@janpio
Copy link
Member

janpio commented Oct 17, 2023

Good you found a workaround for now.
That indeed points further to the Json columns as the culprit here, so I already added that label to this issue.

(We are happy to sign an NDA if that would allow you to share the reproduction, email me at my first name @prisma.io and we can make that happen. If not, we'll have to find another reproduction somehow.)

@janpio janpio added bug/1-unconfirmed Bug should have enough information for reproduction, but confirmation has not happened yet. and removed bug/0-unknown Bug is new, does not have information for reproduction or reproduction could not be confirmed. labels Oct 28, 2023
@forehalo
Copy link

Hi @janpio, I met a similar memory leak but with postgresql bytea field.

Here is how I reproduce the leak:

// create a table with bytea field
model Data {
  id        Int      @id @default(autoincrement())
  data      Bytes    @db.ByteA
  createdAt DateTime @default(now())
  updatedAt DateTime @updatedAt

  @@map("data")
}
// detecting
async function main() {
  // feed the table a row with 1MB data.
  const prisma = new PrismaClient();
  await prisma.$connect();

  // do not read bytea field
  async function readNoBin() {
    for (let i = 0; i < 1000; i++) {
      const _ = await prisma.data.findFirst({
        select: {
          id: true,
          createdAt: true,
          updatedAt: true,
        },
        where: {
          id: 1,
        },
      });
    }
  }
  await readNoBin();
  console.log(process.memoryUsage());

  // read bytea field
  async function read() {
    for (let i = 0; i < 1000; i++) {
      const _ = await prisma.data.findFirst({
        where: {
          id: 1,
        },
      });
    }
  }
  await read();
  await prisma.$disconnect();
  console.log(process.memoryUsage());

  // hold the process
  while (true) {}
}

The result shows me that after reading 1k times of bytea field, the rss increased to 200+MB from 60MB(which is taking by the program initiating), and it will never goes down. At the same time, the 1k times of reading without bytea field doesn't make any change on rss usage.

@Jolg42 Jolg42 added bug/0-unknown Bug is new, does not have information for reproduction or reproduction could not be confirmed. and removed bug/1-unconfirmed Bug should have enough information for reproduction, but confirmation has not happened yet. labels Nov 23, 2023
@Jolg42
Copy link
Member

Jolg42 commented Nov 23, 2023

@leighman I attempted a reproduction and didn't find any memory leak so far on macOS or Ubuntu

Here is the setup I used:
https://github.com/Jolg42/repro-21471

  • Prisma 5.3.1
  • Node.js v18.16.0 (I didn't find exactly which v18 version you were using)
  • a JSON of 1MB in my test
  • I tried findFirst / findMany and then later createMany + update + findMany.
  • I removed previewFeatures = ["metrics", "tracing"] as it didn't change the results.
  • Important to note, I used NODE_OPTIONS="--max-old-space-size=200" to put a limit for V8, it helps to show that rss is reclaimed here, even if there is still free memory available on the machine.

Could you provide more information?

  • Is it still happening?
    • With which version of Prisma?
    • With which version of Node.js?
  • Can you reproduce it in a minimal reproduction? Feel free to fork my repository.

@Jolg42
Copy link
Member

Jolg42 commented Nov 23, 2023

@forehalo I attempted a separate reproduction for your case and didn't find any memory leak so far.

Please create a new issue and provide more information.
Can you reproduce it in a minimal reproduction? Feel free to fork my repository.
https://github.com/Jolg42/repro-21471-bytea

Let me know if you need help in the new issue.

@janpio
Copy link
Member

janpio commented Dec 7, 2023

@forehalo @leighman Did you see Joel's message? We need your help to reproduce this problem. Thanks!

@leighman
Copy link
Author

leighman commented Dec 7, 2023

@janpio Thanks. Yes, I will take a look at reproducing it with that repository but unfortunately just haven't had a chance.

@demurgos
Copy link

demurgos commented Dec 9, 2023

Hello,
I spent the last week investigating OOM crashes in one of my applications. It had a high RSS memory and low V8 heap usage, so similar to this issue. This app had a memory limit set to 16GiB, and could reach it in a few minutes (when expected memory usage is more around 1GiB).

We use Postgres and had a few tables with json columns. Replacing them with text and manually handling serialization at the application layer fixed the issue. The code is available and you can see the commit with the fix, but a full repro would require data.

I hope you figure out what's the cause of the problem, but there definitely seems to be something wrong with JSON support.

@Jolg42
Copy link
Member

Jolg42 commented Dec 11, 2023

@demurgos We're definitely interested in more information here 👀
Could you open a new issue? It would help us gather all the information needed to attempt a reproduction in your case.

For example, could you share with us some example data that you are storing as JSON?

For the following fields:

  • The model Fight
    • fighters
    • steps
  • The model WorkerJob
    • payload
  • The model BruteSpritesheet
    • json

Basically, I would use this to try a reproduction based on this template https://github.com/Jolg42/repro-21471, and it would help to have some sample data to attempt a minimal reproduction.

Also, please let us know which version of Prisma & Node.js you were running.

@Jolg42
Copy link
Member

Jolg42 commented Dec 14, 2023

@demurgos just checking, let us know if you want to provide more information, based on my previous comment. It would be helpful for us. We are especially curious about the Node.js version you used, because Node.js v16 might have some issues and upgrading to v18+ should fix that.

@Jolg42
Copy link
Member

Jolg42 commented Dec 14, 2023

@forehalo Did you see my previous message?

@leighman 🤞🏼 that you get a chance to provide more info related to my previous message.
It would help if you can provide an example JSON to be used, we could try it in the reproduction then.

@demurgos
Copy link

@Jolg42
Hey, I'm only managing the hosting part of the project but not very familiar with Prisma or the code using it. I'll ask the main dev if he can help me providing a repro. It's also something we're doing on our free time so I can't commit to giving you a repro quickly, but I'll try to work on this.

The issue really showed up when he added a table to keep track of background jobs in the DB, with the following model:

model WorkerJob {
  id      Int    @id @default(autoincrement())
  worker  String @db.VarChar(255)
  payload Json   @db.Json
}

worker is the job type to run, and payload some data passed to the job.

We initially used Node 16 and noticed the memory issue. We then update to Node 21 and the issue were still present. Since the commit I linked above (5 days ago, switch json to string) the server is stable with memory around 1GiB.

I'll open a new issue once we have a reproduction.

@Druue Druue self-assigned this Dec 22, 2023
@Druue
Copy link
Contributor

Druue commented Jan 2, 2024

Hey @demurgos @leighman do you have any further updates here? 👀

Cheers in advance!

@leighman
Copy link
Author

leighman commented Jan 4, 2024

Unfortunately we're extremely busy and I'm unlikely to look at a reproduction any time soon since we were able to work around the issue by using TEXT and parsing on the js side.
I would appreciate if the issue could remain open since it seems pretty clear to me that there is an issue that switching to TEXT solved completely.
I haven't forgotten it and will try to get a chance to play with a reproduction.

@janpio janpio changed the title Huge RSS usage retained (suspect JSONB fields) Huge RSS usage retained (suspect JSONB fields) Jan 9, 2024
@aqrln
Copy link
Member

aqrln commented Feb 2, 2024

High RSS and low V8 heap usage hint that the problem is likely on the engine side and not on the JavaScript side. The fact that changing json columns to text columns and parsing JSON on the client side fixed the issue for @demurgos and @leighman hints at that as well. The sharp stair pattern on the graph looks like heap fragmentation to me and not a memory leak.

While a runnable reproduction would certainly be ideal, I believe there may be almost enough information for us to attempt reproducing it ourselves again. What would really be helpful is if you can give us an idea of how your JSON objects look like (their shape, complexity, levels of nesting, and if they are complex, whether the individual objects they are comprised of are rather big or small), maybe an example of such object if possible.

What we could also try is to build the query engine with jemalloc allocator and have you check if it fixes or alleviates the issue for you, since jemalloc is designed to reduce fragmentation.

@leighman
Copy link
Author

leighman commented Feb 12, 2024

Yeh, my suspicion would be the string allocations in the engine somehow.

The JSON values are about 20-500kb in size. In the example above this object is a top-level array containing 10-10000 objects. The objects represent a whole entity so can vary in complexity depending on the type. The bulk of the object is a parameters field which is an object subset where all scalar fields are serialised as strings.

event: 'blah',
snapshot: '[{"id":"some-uuid","some":true,"fields":[{"else":1}],"parameters":{"field":"string","other":"string","nestedObj":{"nestedArray":[{"inHere":"shouldn't be too long string"}]}}}, {etc}, {etc}, {etc}]'

Individual objects are probably more likely to be small and numerous though some entities can be quite large. Generally the more numerous the entity, the smaller the parameters.
The growth in entity parameters are mostly just additional string fields.
Strings themselves shouldn't be too large.
Nesting within parameters is usually around 3 levels. Maybe up to 5.
Arrays within parameters usually have only around 10 objects.

As described we've since worked around this issue by casting the field so it wouldn't be trivial to check with a different allocator.

@aqrln
Copy link
Member

aqrln commented Feb 12, 2024

Yeh, my suspicion would be the string allocations in the engine somehow.

serde_json values generally, which includes the strings for keys (which are not interned in any way) as part of that too. We still need to produce a reproduction to confirm the theory, and most importantly so that we can also confirm fixing the issue and add a regression test to our memory tests on CI, but from the information in this thread it seems to be clear what causes the problem.

There are some architectural reasons why the engine currently deserializes the JSON values internally but that shouldn't be necessary and we should fix this. Until then, using String fields for large JSON objects and deserializing on the application side is a good workaround.

As described we've since worked around this issue by casting the field so it wouldn't be trivial to check with a different allocator.

Understood. I'm also not sure on the second thought that linking to a different allocator is a good idea anyway given that the engine is a shared library which would be loaded in the Node.js process which uses the system allocator. Good to know that the workaround works for you well, we'll keep you informed in this thread when we have any news.

@apolanc apolanc unassigned Jolg42 and Druue Mar 8, 2024
@pzgorzaly
Copy link

Hi, I've encountered a similar issue to the one described in the thread. While using Prisma in conjunction with PostgreSQL and a JSON column, the RSS memory usage is huge (reaching 1GB). Below is a picture for a single operation and multiple operations using Autocannon (https://www.npmjs.com/package/autocannon).

image
image

As a benchmark, I tested the same data with Kysely (https://kysely.dev/), and the memory usage does not exceed 300 MB.

image
image

The use of TEXT columns mitigates the issue, but this solution is not acceptable in the long term.

I've documented the described issue here: https://github.com/pzgorzaly/prisma-issue-minimal-reproducible-repo

@Jolg42
Copy link
Member

Jolg42 commented Mar 18, 2024

@pzgorzaly Thanks for the minimal reproduction, we'll try this out soon to confirm if we get the same results on our end.

@apolanc apolanc added bug/1-unconfirmed Bug should have enough information for reproduction, but confirmation has not happened yet. and removed bug/0-unknown Bug is new, does not have information for reproduction or reproduction could not be confirmed. labels Mar 18, 2024
@Druue Druue self-assigned this Apr 10, 2024
@Druue
Copy link
Contributor

Druue commented Apr 22, 2024

Hey, I've been looking into this per the repo you linked @pzgorzaly

Here are some clinic snapshots from my testing

Prisma w/ JSON

Node v20.12.2

image

Node v18.20.2

I didn't see anything massively different on node 18 in comparison other than higher event loop delay (1500 as opposed to 1200)

Prisma w/o JSON

Node v20.12.2

I can note a pretty significant memory usage decrease here both compared to JSONB and using Node 18, but the Node change doesn't seem to be related to this JSON issue. The change in shape is also pretty staggering; RSS here w/o JSON follows the heap much more closely compared to w/ JSON where it's a lot more floaty, some retention for sure
image

Node v18.20.2

image

I used the repro you provided to write a test in our memory test-suite which I think further signals that the issue lies in engines as well 👍

Next step is to investigate engines directly :)

@Druue Druue added bug/2-confirmed Bug has been reproduced and confirmed. and removed bug/1-unconfirmed Bug should have enough information for reproduction, but confirmation has not happened yet. labels Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/2-confirmed Bug has been reproduced and confirmed. kind/bug A reported bug. team/client Issue for team Client. topic: Json Scalar type `Json` topic: memory leak topic: performance/memory
Projects
None yet
Development

No branches or pull requests

9 participants