`copy_data` memory bloat in v1.4 #473

davidtaylorhq · 2022-08-08T09:52:33Z

We use Connection#copy_data to stream large volumes of data into a temporary table. We recently observed significant performance degradation and increased memory use for this system. Here's a minimal reproduction:

require "securerandom"
require "objspace"
require "bundler/inline"

PG_VERSION = "1.4.2"
gemfile do
  source 'https://rubygems.org'
  gem 'pg', PG_VERSION
end

puts "PG::version #{PG::VERSION}"

def memory_use
  3.times { GC.start }
  objspace_size_mb = ObjectSpace.memsize_of_all / 1024 / 1024
  rss_mb = `ps  -p #{Process.pid} -o rss`.split("\n")[1].to_i / 1024
  "objspace:#{objspace_size_mb}mb; rss:#{rss_mb}mb"
end

puts "Before: #{memory_use}"

start_at = Time.now()

connection = PG.connect(dbname: 'discourse_development')
table_name = "my_temp_table"
connection.exec("CREATE TEMP TABLE #{table_name}(url text UNIQUE)")
connection.copy_data("COPY #{table_name} FROM STDIN CSV") do
  1_000_000.times do |i|
    connection.put_copy_data("#{SecureRandom.hex(100)}\n")
  end
  puts "After loop, inside copy_data: #{memory_use}"
end

puts "After: #{memory_use}"
puts "Took #{Time.now - start_at}s"

With version 1.3.5, this script takes ~10s on my machine, and reports ~47mb RSS at the end. With version 1.4.0 (and 1.4.1, 1.4.2), it takes ~80s and reports ~182mb RSS at the end. The RSS appears to scale with the amount of data being copied.

The text was updated successfully, but these errors were encountered:

We had a blocking flush in pg-1.3.x at every call to put_copy_data. This made sure, that all data is sent until the next put_copy_data. In ged#462 (and pg-1.4.0 to .2) the behaviour was changed to rely on the non-blocking flushs libpq is doing internally. This makes a decent performance improvement especially on Windows. Unfortunately ged#473 proved that memory bloat can happen, when sending the data is slower than calls to put_copy_data happen. As a trade-off this proposes to do a blocking flush only every 100 calls. If libpq is running in blocking mode (PG::Connection.async_api = false) put_copy_data does a blocking flush every time new memory is allocated. Unfortunately we don't have this kind of information, since we don't have access to libpq's PGconn struct and the return codes don't give us an indication when this happens. So doing a flush at every fixed number of calls is a very simple heuristic. Fixes ged#473

larskanis · 2022-08-08T14:35:41Z

I can reproduce this issue. Not that dramatic as you measured, but also measurable. The root cause is #462. I made it when I noticed that put_copy_data is quite slow on Windows. Unfortunately it can result in a memory bloat when put_copy_data is called with more data than what actually can be transferred over the wire.

My proposal is to fix it like #474. Not ideal, but I think the only practical trade-off we can do.

CC @SamSaffron

davidtaylorhq · 2022-08-08T16:53:41Z

Thanks for the quick fix @larskanis - #474 does indeed fix the issue in my minimal repro script ❤️

it can result in a memory bloat when put_copy_data is called with more data than what actually can be transferred over the wire.

Aha I see! Indeed, adding a sleep 60 to the end of the repro script and then measuring again shows that the memory drops back to normal levels. 👍

We had a blocking flush in pg-1.3.x at every call to put_copy_data. This made sure, that all data is sent until the next put_copy_data. In ged#462 (and pg-1.4.0 to .2) the behaviour was changed to rely on the non-blocking flushs libpq is doing internally. This makes a decent performance improvement especially on Windows. Unfortunately ged#473 proved that memory bloat can happen, when sending the data is slower than calls to put_copy_data happen. As a trade-off this proposes to do a blocking flush only every 100 calls. If libpq is running in blocking mode (PG::Connection.async_api = false) put_copy_data does a blocking flush every time new memory is allocated. Unfortunately we don't have this kind of information, since we don't have access to libpq's PGconn struct and the return codes don't give us an indication when this happens. So doing a flush at every fixed number of calls is a very simple heuristic. Fixes ged#473

larskanis mentioned this issue Aug 8, 2022

Do a blocking flush every 100 calls to put_copy_data #474

Merged

davidtaylorhq changed the title ~~copy_data memory leak in v1.4~~ copy_data memory bloat in v1.4 Aug 8, 2022

larskanis closed this as completed in #474 Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`copy_data` memory bloat in v1.4 #473

`copy_data` memory bloat in v1.4 #473

davidtaylorhq commented Aug 8, 2022

larskanis commented Aug 8, 2022

Uh oh!

davidtaylorhq commented Aug 8, 2022

Uh oh!

copy_data memory bloat in v1.4 #473

copy_data memory bloat in v1.4 #473

Comments

davidtaylorhq commented Aug 8, 2022

larskanis commented Aug 8, 2022

Uh oh!

davidtaylorhq commented Aug 8, 2022

Uh oh!

`copy_data` memory bloat in v1.4 #473

`copy_data` memory bloat in v1.4 #473