Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate bulk load performance #1263

Open
Tracked by #2028
hlinnaka opened this issue Feb 11, 2022 · 12 comments
Open
Tracked by #2028

Investigate bulk load performance #1263

hlinnaka opened this issue Feb 11, 2022 · 12 comments
Labels
c/storage/pageserver Component: storage: pageserver
Milestone

Comments

@hlinnaka
Copy link
Contributor

hlinnaka commented Feb 11, 2022

Follow-up on #1221. The first immediate cause of #1221 was that a GetPage request could not be interrupted. That was fixed. A second cause was that the backpressure was not enabled or not working correctly, which allowed a lot of unprocessed WAL to accumulate, so that a GetPage request with fresh LSN had to wait for the WAL backlog to be processed. Fixing backpressure will take care of that.

But the third cause was that the pageserver could not keep up with processing the WAL that was generated. If it could keep up, the backpressure would not be needed, and the GetPage request would've finished uickly. So why did the page server not keep up? A bug? Is WAL ingestion in page server too slow, and we need to speed it up? Was the production server overloaded?

@hlinnaka hlinnaka added the c/storage/pageserver Component: storage: pageserver label Feb 11, 2022
@knizhnik
Copy link
Contributor

I have provided some my results of investigations in ticket #553
Should I copy it here?
Briefly:

  • Insertion speed of pageserver is fast enough, comparable with vanilla postgres
  • It still slower than speed of saving data in safekeeper. So if back pressure limits are large enough, it can happen that transaction is committed by safekeeper much faster than data is stored by pageserver. As a result subsequent select can be blocked by substantial time (seconds) in wait_for_lsn request.
  • Speed of page reconstruction from WAL records is about 10 times slower than getting it from image layer. So checkpointing is important.
  • At local host performance of parallel scan is comparable with internal traversal of pages inside pageserver:

pageserver from delta layers: 220 sec
pageserver from image layers: 29 sec
zenith select count() (parallel plan): 42 sec
vanilla select count(
) (parallel plan): 15 sec

I tried to implement yet another prefecth: neondatabase/postgres#133
but at least at localhost it has negative impact on performance: the same select count(*) takes 64 seconds.

So at least at local host prefetch seems to have no sense for seqscan. It will be interesting to repeat this experiment in the could.

@knizhnik
Copy link
Contributor

knizhnik commented Feb 17, 2022

Also I noticeв that wal_log_hints significantly (~10 times) slowdown selects, because them are setting hint bits, them are walloged using FPI and as a result instead of just reading data from the disk we also have to writing the same amount of data.

@knizhnik
Copy link
Contributor

Insertion speed depends mostly on number of records, not on actual size of written data.
Please compare result for:

  1. create table t as select generate_series(1,100000000);
    time: 230 sec
    storage size: 7.9 Gb
    WAL size: 6.1 Gb
  2. create table t as select generate_series(1,100000000), repeat(' ', 256);
    time: 366 sec
    storage size: 33 Gb
    WAL size: 31 Gb

CPU usage of pageserver is about 100%. So looks like is it bottleneck now.
Profile looks really strange:

   9,48%  WAL receiver th  libc.so.6          [.] pthread_rwlock_unlock@@GLIBC_2.34
   5,90%  WAL receiver th  libc.so.6          [.] pthread_rwlock_rdlock@GLIBC_2.2.5
   5,44%  WAL receiver th  pageserver         [.] pageserver::page_cache::PageCache::lock_for_write
   3,93%  WAL receiver th  libc.so.6          [.] pthread_rwlock_wrlock@@GLIBC_2.34
   3,28%  WAL receiver th  pageserver         [.] sharded_slab::pool::Pool<T,C>::get
   3,28%  WAL receiver th  pageserver         [.] <pageserver::layered_repository::ephemeral_file::EphemeralFile as std::os::imp::unix:
   3,17%  WAL receiver th  libc.so.6          [.] __memmove_avx_unaligned_erms
   2,54%  WAL receiver th  pageserver         [.] <tracing_subscriber::registry::sharded::Registry as tracing_core::subscriber::Subscri
   2,37%  WAL receiver th  pageserver         [.] <std::collections::hash::map::DefaultHasher as core::hash::Hasher>::write
   1,95%  WAL receiver th  pageserver         [.] <core::fmt::Formatter as core::fmt::Write>::write_fmt
   1,87%  WAL receiver th  pageserver         [.] hashbrown::map::make_hash
   1,62%  WAL receiver th  pageserver         [.] <tracing_subscriber::registry::sharded::Registry as tracing_core::subscriber::Subscri
   1,55%  WAL receiver th  pageserver         [.] core::fmt::write
   1,51%  WAL receiver th  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
   1,40%  WAL receiver th  pageserver         [.] <&mut bincode::de::Deserializer<R,O> as serde::de::Deserializer>::deserialize_struct
   1,38%  WAL receiver th  pageserver         [.] sharded_slab::page::Shared<T,C>::mark_clear
   1,37%  WAL receiver th  pageserver         [.] <tracing_subscriber::fmt::fmt_layer::Layer<S,N,E,W> as tracing_subscriber::layer::Lay
   1,25%  WAL receiver th  pageserver         [.] bytes::bytes_mut::shared_v_drop
   1,25%  WAL receiver th  libc.so.6          [.] pthread_mutex_unlock@@GLIBC_2.2.5
   1,21%  WAL receiver th  pageserver         [.] pageserver::walrecord::decode_wal_record
   1,19%  WAL receiver th  pageserver         [.] tracing_subscriber::filter::env::EnvFilter::cares_about_span
   1,10%  WAL receiver th  pageserver         [.] bytes::bytes_mut::shared_v_clone
   1,08%  WAL receiver th  pageserver         [.] crc32c::hw_x86_64::crc32c
   1,07%  WAL receiver th  pageserver         [.] pageserver::walreceiver::thread_main
   1,05%  WAL receiver th  pageserver         [.] pageserver::walingest::WalIngest::ingest_record

So most of time is spent in synchronization, despite to the fact that there are no local conflicts (GC and checkpointer are disabled).

@knizhnik
Copy link
Contributor

Interesting result based on profile and random stack traces inestigation: just commenting one line:

                   //let _enter = info_span!("processing record", lsn = %lsn).entered();

Allows to increase insert speed almost twice: 136 seconds vs. 230.

@knizhnik
Copy link
Contributor

knizhnik commented Feb 22, 2022

I found bug in my prefetch implementation and after fixing the bug, it shows positive impact on performance. I used the following test to measure performance:

\timing
set max_parallel_workers_per_gather=0;
create table t as select generate_series(1,100000000),repeat(' ',256);
select count(*) from t;
seqscan_prefetch_buffers Query execution time (sec)
0 456
4 256
8 234
16 221

So, prefetch allows to speed up seqscan up to 2 times.

With parallel plans improvement is not so large: 54 vs. 67 seconds.
But it may be caused by limited amount of resources at my notebook.

So, I updated PR for local prefetch neondatabase/postgres#133 and waiting for your review

@knizhnik
Copy link
Contributor

knizhnik commented Feb 22, 2022

Something more about network communication latencies (once again results were produced at my notebook). This is ping-pong test which simulates get_page_at_lsn requests to page_server (16 bytes requests and 8Kb response). This test loads 3703704 which corresponds to size of table t used in the above zenith test, so we can estimate overhead of network communication.

interface request/response batch size time (sec)
loopback 8/8 40
loopback 8/1 94
loopback 1/1 247
unix sockets 8/8 18
unix sockets 8/1 21
unix sockets 1/1 96

Please notice that 8/1 mode is currently using in prefetch implementation (to avoid changes in page server). I tried to play with libpq/socket rx/tx buffers sizes, but looks like them have no much effect on performance.

|

@knizhnik
Copy link
Contributor

knizhnik commented Feb 22, 2022

With small (3Gb) table which completely fits in memory, effect of prefetch is almost the same: 21 vs 42 sec.

Vanilla is able to sequentially scan this table in just 3 seconds (2807 msec).

Time of local scan of relation inside postgres is 8 seconds.

It is strange, but support of batch mode in page_server<->compute protocol (use pgb.write_message_noflush instead of pgb.write_message for all prefetched pages except last) has only negative impact on performance.

@neondatabase-bot neondatabase-bot bot added this to the 2022/07 milestone Jul 5, 2022
@neondatabase-bot neondatabase-bot bot modified the milestones: 2022/07, 2022/08 Jul 25, 2022
@hlinnaka
Copy link
Contributor Author

I added a summary of all the ongoing investigations and patches to the issue description. Please update if I missed something.

@hlinnaka
Copy link
Contributor Author

I added a summary of all the ongoing investigations and patches to the issue description. Please update if I missed something.

scratch that, I moved the summary to #2221 instead

@knizhnik
Copy link
Contributor

But the third cause was that the pageserver could not keep up with processing the WAL that was generated. If it could keep up, the backpressure would not be needed, and the GetPage request would've finished uickly. So why did the page server not keep up? A bug? Is WAL ingestion in page server too slow, and we need to speed it up? Was the production server overloaded?

Backpressure is still needed, because we have one ore component processing WALs: safekeepers.
And it is actually first component which persists data on non-volatile media. So speed of producing new WAL by compute is limited only by CPU and network throughput. And speed of sagekeeper - by disk write rate, which is used to much smaller than network bandwidth and CPU speed. So without back-pressure a lot of WAL will be collected at compute site which can cause disk space exhaustion. And also can cause getpage_at_lsn timeout expiation.

@neondatabase-bot neondatabase-bot bot modified the milestones: 2022/08, 2023/03 Dec 20, 2022
@shanyp
Copy link
Contributor

shanyp commented Dec 26, 2023

@jcsp same here with batch ingestion question

@jcsp
Copy link
Contributor

jcsp commented Jan 2, 2024

I think this ticket is still valid: we don't have good testing in place for the raw ingest performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

4 participants