Investigate bulk load performance #1263

hlinnaka · 2022-02-11T12:30:14Z

Follow-up on #1221. The first immediate cause of #1221 was that a GetPage request could not be interrupted. That was fixed. A second cause was that the backpressure was not enabled or not working correctly, which allowed a lot of unprocessed WAL to accumulate, so that a GetPage request with fresh LSN had to wait for the WAL backlog to be processed. Fixing backpressure will take care of that.

But the third cause was that the pageserver could not keep up with processing the WAL that was generated. If it could keep up, the backpressure would not be needed, and the GetPage request would've finished uickly. So why did the page server not keep up? A bug? Is WAL ingestion in page server too slow, and we need to speed it up? Was the production server overloaded?

knizhnik · 2022-02-17T15:15:24Z

I have provided some my results of investigations in ticket #553
Should I copy it here?
Briefly:

Insertion speed of pageserver is fast enough, comparable with vanilla postgres
It still slower than speed of saving data in safekeeper. So if back pressure limits are large enough, it can happen that transaction is committed by safekeeper much faster than data is stored by pageserver. As a result subsequent select can be blocked by substantial time (seconds) in wait_for_lsn request.
Speed of page reconstruction from WAL records is about 10 times slower than getting it from image layer. So checkpointing is important.
At local host performance of parallel scan is comparable with internal traversal of pages inside pageserver:

pageserver from delta layers: 220 sec
pageserver from image layers: 29 sec
zenith select count() (parallel plan): 42 sec
vanilla select count() (parallel plan): 15 sec

I tried to implement yet another prefecth: neondatabase/postgres#133
but at least at localhost it has negative impact on performance: the same select count(*) takes 64 seconds.

So at least at local host prefetch seems to have no sense for seqscan. It will be interesting to repeat this experiment in the could.

knizhnik · 2022-02-17T15:22:04Z

Also I noticeв that wal_log_hints significantly (~10 times) slowdown selects, because them are setting hint bits, them are walloged using FPI and as a result instead of just reading data from the disk we also have to writing the same amount of data.

knizhnik · 2022-02-18T07:33:50Z

Insertion speed depends mostly on number of records, not on actual size of written data.
Please compare result for:

create table t as select generate_series(1,100000000);
time: 230 sec
storage size: 7.9 Gb
WAL size: 6.1 Gb
create table t as select generate_series(1,100000000), repeat(' ', 256);
time: 366 sec
storage size: 33 Gb
WAL size: 31 Gb

CPU usage of pageserver is about 100%. So looks like is it bottleneck now.
Profile looks really strange:

   9,48%  WAL receiver th  libc.so.6          [.] pthread_rwlock_unlock@@GLIBC_2.34
   5,90%  WAL receiver th  libc.so.6          [.] pthread_rwlock_rdlock@GLIBC_2.2.5
   5,44%  WAL receiver th  pageserver         [.] pageserver::page_cache::PageCache::lock_for_write
   3,93%  WAL receiver th  libc.so.6          [.] pthread_rwlock_wrlock@@GLIBC_2.34
   3,28%  WAL receiver th  pageserver         [.] sharded_slab::pool::Pool<T,C>::get
   3,28%  WAL receiver th  pageserver         [.] <pageserver::layered_repository::ephemeral_file::EphemeralFile as std::os::imp::unix:
   3,17%  WAL receiver th  libc.so.6          [.] __memmove_avx_unaligned_erms
   2,54%  WAL receiver th  pageserver         [.] <tracing_subscriber::registry::sharded::Registry as tracing_core::subscriber::Subscri
   2,37%  WAL receiver th  pageserver         [.] <std::collections::hash::map::DefaultHasher as core::hash::Hasher>::write
   1,95%  WAL receiver th  pageserver         [.] <core::fmt::Formatter as core::fmt::Write>::write_fmt
   1,87%  WAL receiver th  pageserver         [.] hashbrown::map::make_hash
   1,62%  WAL receiver th  pageserver         [.] <tracing_subscriber::registry::sharded::Registry as tracing_core::subscriber::Subscri
   1,55%  WAL receiver th  pageserver         [.] core::fmt::write
   1,51%  WAL receiver th  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
   1,40%  WAL receiver th  pageserver         [.] <&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_struct
   1,38%  WAL receiver th  pageserver         [.] sharded_slab::page::Shared<T,C>::mark_clear
   1,37%  WAL receiver th  pageserver         [.] <tracing_subscriber::fmt::fmt_layer::Layer<S,N,E,W> as tracing_subscriber::layer::Lay
   1,25%  WAL receiver th  pageserver         [.] bytes::bytes_mut::shared_v_drop
   1,25%  WAL receiver th  libc.so.6          [.] pthread_mutex_unlock@@GLIBC_2.2.5
   1,21%  WAL receiver th  pageserver         [.] pageserver::walrecord::decode_wal_record
   1,19%  WAL receiver th  pageserver         [.] tracing_subscriber::filter::env::EnvFilter::cares_about_span
   1,10%  WAL receiver th  pageserver         [.] bytes::bytes_mut::shared_v_clone
   1,08%  WAL receiver th  pageserver         [.] crc32c::hw_x86_64::crc32c
   1,07%  WAL receiver th  pageserver         [.] pageserver::walreceiver::thread_main
   1,05%  WAL receiver th  pageserver         [.] pageserver::walingest::WalIngest::ingest_record

So most of time is spent in synchronization, despite to the fact that there are no local conflicts (GC and checkpointer are disabled).

knizhnik · 2022-02-18T09:44:36Z

Interesting result based on profile and random stack traces inestigation: just commenting one line:

                   //let _enter = info_span!("processing record", lsn = %lsn).entered();

Allows to increase insert speed almost twice: 136 seconds vs. 230.

knizhnik · 2022-02-22T08:29:00Z

I found bug in my prefetch implementation and after fixing the bug, it shows positive impact on performance. I used the following test to measure performance:

\timing
set max_parallel_workers_per_gather=0;
create table t as select generate_series(1,100000000),repeat(' ',256);
select count(*) from t;

seqscan_prefetch_buffers	Query execution time (sec)
0	456
4	256
8	234
16	221

So, prefetch allows to speed up seqscan up to 2 times.

With parallel plans improvement is not so large: 54 vs. 67 seconds.
But it may be caused by limited amount of resources at my notebook.

So, I updated PR for local prefetch neondatabase/postgres#133 and waiting for your review

knizhnik · 2022-02-22T09:01:51Z

Something more about network communication latencies (once again results were produced at my notebook). This is ping-pong test which simulates get_page_at_lsn requests to page_server (16 bytes requests and 8Kb response). This test loads 3703704 which corresponds to size of table t used in the above zenith test, so we can estimate overhead of network communication.

interface	request/response batch size	time (sec)
loopback	8/8	40
loopback	8/1	94
loopback	1/1	247
unix sockets	8/8	18
unix sockets	8/1	21
unix sockets	1/1	96

Please notice that 8/1 mode is currently using in prefetch implementation (to avoid changes in page server). I tried to play with libpq/socket rx/tx buffers sizes, but looks like them have no much effect on performance.

|

knizhnik · 2022-02-22T13:18:37Z

With small (3Gb) table which completely fits in memory, effect of prefetch is almost the same: 21 vs 42 sec.

Vanilla is able to sequentially scan this table in just 3 seconds (2807 msec).

Time of local scan of relation inside postgres is 8 seconds.

It is strange, but support of batch mode in page_server<->compute protocol (use pgb.write_message_noflush instead of pgb.write_message for all prefetched pages except last) has only negative impact on performance.

hlinnaka · 2022-10-24T12:18:47Z

I added a summary of all the ongoing investigations and patches to the issue description. Please update if I missed something.

hlinnaka · 2022-10-24T12:21:36Z

I added a summary of all the ongoing investigations and patches to the issue description. Please update if I missed something.

scratch that, I moved the summary to #2221 instead

knizhnik · 2022-10-24T14:56:43Z

But the third cause was that the pageserver could not keep up with processing the WAL that was generated. If it could keep up, the backpressure would not be needed, and the GetPage request would've finished uickly. So why did the page server not keep up? A bug? Is WAL ingestion in page server too slow, and we need to speed it up? Was the production server overloaded?

Backpressure is still needed, because we have one ore component processing WALs: safekeepers.
And it is actually first component which persists data on non-volatile media. So speed of producing new WAL by compute is limited only by CPU and network throughput. And speed of sagekeeper - by disk write rate, which is used to much smaller than network bandwidth and CPU speed. So without back-pressure a lot of WAL will be collected at compute site which can cause disk space exhaustion. And also can cause getpage_at_lsn timeout expiation.

shanyp · 2023-12-26T11:06:31Z

@jcsp same here with batch ingestion question

jcsp · 2024-01-02T09:28:31Z

I think this ticket is still valid: we don't have good testing in place for the raw ingest performance.

hlinnaka added the c/storage/pageserver Component: storage: pageserver label Feb 11, 2022

kelvich mentioned this issue Jul 5, 2022

Epic: resolve the pageserver backpressure problems #2028

Closed

15 tasks

neondatabase-bot bot added this to the 2022/07 milestone Jul 5, 2022

neondatabase-bot bot modified the milestones: 2022/07, 2022/08 Jul 25, 2022

neondatabase-bot bot modified the milestones: 2022/08, 2023/03 Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate bulk load performance #1263

Investigate bulk load performance #1263

hlinnaka commented Feb 11, 2022 •

edited

knizhnik commented Feb 17, 2022

knizhnik commented Feb 17, 2022 •

edited

knizhnik commented Feb 18, 2022

knizhnik commented Feb 18, 2022

knizhnik commented Feb 22, 2022 •

edited

knizhnik commented Feb 22, 2022 •

edited

knizhnik commented Feb 22, 2022 •

edited

hlinnaka commented Oct 24, 2022

hlinnaka commented Oct 24, 2022

knizhnik commented Oct 24, 2022

shanyp commented Dec 26, 2023

jcsp commented Jan 2, 2024

Investigate bulk load performance #1263

Investigate bulk load performance #1263

Comments

hlinnaka commented Feb 11, 2022 • edited

knizhnik commented Feb 17, 2022

knizhnik commented Feb 17, 2022 • edited

knizhnik commented Feb 18, 2022

knizhnik commented Feb 18, 2022

knizhnik commented Feb 22, 2022 • edited

knizhnik commented Feb 22, 2022 • edited

knizhnik commented Feb 22, 2022 • edited

hlinnaka commented Oct 24, 2022

hlinnaka commented Oct 24, 2022

knizhnik commented Oct 24, 2022

shanyp commented Dec 26, 2023

jcsp commented Jan 2, 2024

hlinnaka commented Feb 11, 2022 •

edited

knizhnik commented Feb 17, 2022 •

edited

knizhnik commented Feb 22, 2022 •

edited

knizhnik commented Feb 22, 2022 •

edited

knizhnik commented Feb 22, 2022 •

edited