Skip to content

Postgresql Database Backup Restore

Kuan Fan edited this page Oct 11, 2019 · 1 revision

page owner: TFRS DevOps Specialist

Backup and Restore

Postgresql provides various backup and restore options. We choose the following two solutions for TFRS project:

  • SQL Dump backup and restore
  • Continuous Archiving and Point-in-Time Recovery (PITR)

Common domain names for both solutions

  • Master Postgresql is the pod of dc\postgresql
  • Standby Postgresql is the pod of dc\postgresql-standby
  • [pgdatahome] is /var/lib/pgsql/data/userdata

Create storage and mount to master and standby

  • Create gluster file cns storage postgresql-backup-cns RWX (Read-Write-Many) with at least 5G
  • Mount above storage to /postgresql-backup on master
  • Mount above storage to /postgresql-backup on standby
  • Create the following folders under /postgresql-backup
    • SQLDump, include the SQLDump files named like tfrs-[DateTime].gz
    • basebackup, include the base backup files name tfrs-.tar
    • walbackup, include archived transaction log files from master /var/lib/pgsql/data/userdata/pg_xlog
    • walbackup-standby, include archived transaction log files from standby /var/lib/pgsql/data/userdata/pg_xlog
    • notes: if any above names need to include release number, please do

SQL Dump backup and restore

Backup on master

  • Scale dc/tfrs down to zero
  • Login master
    • Run command "pg_dump tfrs | gzip > /postgresql-backup/SQLDump/[env]/tfrs-[DateTime].gz"

Recover on standby

  • Login standby
    • Run "psql -c drop database tfrs" to remove the existing tfrs database on standby server
    • Run "psql -c create database tfrs" to create a new tfrs database on standby server
    • Run "gunzip -c /postgresql-backup/SQLDump/tfrs-[DateTime].gz | psql tfrs"
    • Connect to tfrs database and verify the credit_trade table having up to date records as master

Continuous Archiving and Point-in-Time Recovery (PITR)

PostgreSQL maintains write ahead logs (WAL) in the pg_xlog directory. The logs record every change made to the database. These logs exist primarily for crash-safety purpose: if the system crashes, the database can be restored to consistency by "replaying" the log entries made since the last checkpoint.

Continuous Archiving

  • Referenced materials 25.3.3. Making a Base Backup Using the Low Level API and 25.3.4. Recovering Using a Continuous Archive Backup at https://www.postgresql.org/docs/9.6/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP-DATA
  • Scale dc/tfrs down to zero
  • Login master
    • Update [pgdatahome]/postgresql.conf by run below commands
      • ALTER SYSTEM SET wal_level=replica
      • ALTER SYSTEM SET archive_mode=on
      • ALTER SYSTEM SET max_wal_senders=3
      • ALTER SYSTEM SET archive_command='test ! -f /postgresql-backup/walbackup/[env]/%f && cp %p /postgresql-backup/walbackup/[env]/%f'
    • Scale master down to 0, wait 10 seconds and scale it up to 1. Make sure /var/lib/pgsql/data/userdata/pg_xlog contains log files.
    • run "psql tfrs" command to connect to database
      • run "SELECT pg_start_backup('[backup-datetime]')"
      • open another terminal (don't quit pgsql session) to run the command "tar --version" to make sure it is after v1.23 an then "cd /var/lib/pgsql/data; tar --warning=no-file-changed --warning=no-file-removed -cf /postgresql-backup/basebackup/[env]/tfrs-[env]-[datatime].tar userdata"
      • run "SELECT pg_stop_backup()"
    • Make sure /postgresql-backup/walbackup/[env] having the base backup tar file, for example:
      -rw-------. 1 1000230000 root 16777216 Apr 9 21:56 000000010000000000000006
      -rw-------. 1 1000230000 root 299 Apr 9 21:56 000000010000000000000006.00000028.backup
      -rw-------. 1 1000230000 root 16777216 Apr 9 21:56 000000010000000000000007
      -rw-------. 1 1000230000 root 16777216 Apr 9 21:51 000000010000000000000008
    • Make sure /postgresql-backup/walbackup/[env] having the logs files transferred from [pgdatahome]/pg_xlog folder, for example:
      -rw-------. 1 1000230000 2178 16777216 Apr 9 21:46 000000010000000000000002
      -rw-------. 1 1000230000 2178 16777216 Apr 9 21:47 000000010000000000000003
      -rw-------. 1 1000230000 2178 16777216 Apr 9 21:47 000000010000000000000004
      -rw-------. 1 1000230000 2178 16777216 Apr 9 21:51 000000010000000000000005
      -rw-------. 1 1000230000 2178 16777216 Apr 9 21:56 000000010000000000000006
      -rw-------. 1 1000230000 2178 299 Apr 9 21:56 000000010000000000000006.00000028.backup
    • if you receive below errors when run tfrs=# SELECT pg_stop_backup(), verify the achieve command and if the file to be achieved already exists under the target folder NOTICE: pg_stop_backup cleanup done, waiting for required WAL segments to be archived WARNING: pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed) HINT: Check that your archive_command is executing properly. pg_stop_backup can be canceled safely, but the database backup will not be usable withoutall the WAL segments.

Point-in-Time Recovery (PITR)

The recovery has two steps including base backup restore and applying WAL log files.

  • Base backup restore
    • scale down both master and slave to 0
    • run "debug dc/postgresql-standby"
      • cd to /var/lib/pgsql/data and remove the directory userdata
      • restore the basebackup to current folder by running command "tar -xvf /postgresql-backup/basebackup/tfrs-[datatime].tar ."
      • remove all files in userdata/pg_xlog folder but keep the archive_status folder
      • rename postmaster.pid to postmaster.pid.frombackup and postmaster.opts to postmaster.opts.frombackup
      • if the master folder [pgdatahome]/pg_xlog contains log files, copy them to /postgresql-backup and then copy them to here. This way, we can have the up to date transaction logs which didn't get a chance to be archived.
  • Apply log files on top of base backup
    • still on the debug pod "debug dc/postgresql-standby"
      • create a file called recovery.conf under [pgdatahome] folder.
        • the file contains one line "restore_command='cp /postgresql-backup/walbackup/%f %p'". The command will copy all archived transaction logs over to [pgdatahome]/pg_xlog folder.
        • make sure this file has same ownership of other files under same fold. also it it better to run command "chmod 666 recovery.conf" to grant read and write permission for others.
      • open file userdata/postgresql.conf file, update archive_command='test ! -f /postgresql-backup/walbackup/%f && cp %p /postgresql-backup/walbackup-standby/%f' and save the file.
  • kick off the recovery
    • scale up standby to 1
    • verify the [pgdatahome]/postgresql.conf is renamed to be [pgdatahome]/postgresql.done
    • check if there are errors under /var/lib/pgsql/data/userdata/pg_log
    • rsh to the standby, connect to tfrs database and verify if the credit_trade table has the up to date records as the original one

Switch TFRS to use standby database

After the standby database has been recovered from master database. Here are the steps to have the standby database in working order with tfrs backend application:

  • Scale down dc/tfrs to 0 (this should be done already)
  • Scale down master to 0 (this should be done already)
  • Scale down standby to 0 (this should be done already)
  • Scale up standby to 1
  • Open service postgresql yaml file
    • update selector -> name from postgresql to postgresql-standby
    • after saving the yaml file and verify the server is selecting the standby
  • Scale up dc/tfrs to 1
  • At this point, the standby database has been created from backup. It is ready to be connected by TFRS backend.
  • Test login to make sure you can see the up to date credit trades.

File System Level Backup

This is the third option of backup. It backups the entire file system which is [pgdatahome]

  • Run rsync while * database is running
  • Run rsync --checksum after database is shut down, --checksum can only copy the changed files so that can reduce the database down time.
    Note, this option hasn't been tested a the time of writing this document.

Sample Postgresql Corruption Messages

  • Could not locate a valid checkpoint record 018-03-09 20:35:37 UTC LOG: invalid primary checkpoint record
    2018-03-09 20:35:37 UTC LOG: invalid secondary checkpoint record
    2018-03-09 20:35:37 UTC PANIC: could not locate a valid checkpoint record
    2018-03-09 20:35:37 UTC postgresFATAL: the database system is starting up
    2018-03-09 20:35:38 UTC postgresFATAL: the database system is starting up
    2018-03-09 20:35:39 UTC postgresFATAL: the database system is starting up
    2018-03-09 20:35:40 UTC postgresFATAL: the database system is starting up
    2018-03-09 20:35:41 UTC postgresFATAL: the database system is starting up
    2018-03-09 20:35:42 UTC postgresFATAL: the database system is starting up
    2018-03-09 20:35:43 UTC postgresFATAL: the database system is starting up
    2018-03-09 20:35:44 UTC LOG: startup process (PID 22) was terminated by signal 6: Aborted
    2018-03-09 20:35:44 UTC LOG: aborting startup due to startup process failure

Requested Postgresql Extensions

  • create extension pg_trgm
  • create extension hstore

Referenced Material