Administering any server-based application or service is normally a black art--and PostgreSQL is no different. This article discusses basic administration tasks in pgsql.
The purpose of this article is to discuss basic administration tasks that need to be performed by the budding PostgreSQL (pgsql) administrator. Administering any server-based application or service is normally a black art--and pgsql is no different. You should know how to do a number of things after the installation takes place, and this article is a good place to start.
Topics covered include making backups, the purpose of the template database, monitoring server resources, dump and restore operations, and security. Finally, the article will wrap up with a blurb on statistics gathering and pointers to further information.
Unless otherwise specified, all references to OS in this article should be treated as references to RedHat Linux 7.0.
A Blurb About Backups
As an administrator, you will undoubtedly be responsible for ensuring the safety of your data. Should this responsibility fall upon you, you should be aware of a number of items before you do a blanket save of your entire system.
The nature and frequency of backups are based on the type of databases you are housing on your server. For instance, if you are hosting a typical database application that is modified predominately during business hours, then you will most likely perform backups during downtime. A 24/7 operation such as a Web site is more problematic. Depending on site traffic, you may not have time to do a dedicated backup of the databases on the server.
Assuming you are running in a traditional business environment, the need to back up your data goes without saying: The real questions are how, and how often. The how is answered simply by analyzing the volatility of your data. Is the database a line-of-business application in which many hundreds or thousands of records are modified each day? Then that database is a good candidate for daily backups. Is it a database that is only updated once in a great while? Then a weekly or monthly backup schedule is appropriate.
At this point, you should load psql and create a database complete with a few tables and indexes. Then su to the pgsql superuser you created when you installed pgsql.
On the machine I'm using to write this article, my pgsql data directory is housed in /var/lib/pgsql/data. There was nothing special about this directory; it was just the default directory used by the RedHat 7.0 PostgreSQL RPM install.
Now that you are su'ed to the pgsql superuser and resting in the data directory, do an ls and look at what is there. The item you should look for at this point is a directory named base. This directory is the root data directory. Go ahead and cd into the base directory and take a peek. On my workstation config, I have two directories: template1 and damond. Damond is a database for my account on the machine. You should see a directory matching the name of the database you created earlier. Change into that directory and do an ls.
The files starting with pg_ are system files used by pgsql. We're interested in the files that don't start with pg. You'll notice that for each table or index you created, a matching file exists in the directory. Unlike other database systems, which use one large file for everything (I'm afraid to mention any particulars here, but I'm thinking of a few big league systems), pgsql stores each element in its own files. This allows you, as an administrator, to not only position individual files anywhere on your file system, but to selectively back up and restore individual tables in each database.
Now, before we get all giddy over this ability, I need to discuss reasons why you wouldn't want to reach directly into the data directory to do backup/restore processes.
Databases, by their very nature, are atomic: Operations either succeed or fail. Pgsql is no different in this regard. Developers can tell pgsql that they are starting a transaction, perform a unit of work, and then either commit or roll back their changes. Anything can be done during these transactions: records deleted, changed, added, etc. It's quite possible that before a change takes place in Table A, 15 other changes have occurred in Tables B and C. Because of this, it becomes a hairy business to do a full backup/restore on individual tables. But, that said, it is still possible to do so. As an administrator, you will have to work carefully with the application developers in order to plan and implement such a backup strategy.
In the end, it is much easier to do a full backup once a week on each database you care about and to do incremental backups on the remaining days. Many administrators do a full system save every night (which includes data and binaries--basically, whatever is on the disk). Backups such as these take a while, but allow for easy recovery of a system if it should fail completely.
Before leaving the subject of backing up your system, one last topic I'll discuss is testing the backups for validity. I ran into one customer who had been performing backups religiously for years. Turns out the company had been formatting backup tapes for years...it hadn't backed up a single byte of data from the beginning. Verify your tapes. Verify that the data is being saved. And, finally, verify that you can pull the data off the tape in case you need to. Disaster recovery is no joke. It will be made even less funny when you find that your backup tapes contain zero data.
The Purpose of template1
While poking around the pgsql data directory, you will notice a directory (or database) named template1. Template1 is used when users or developers do a CREATE DATABASE sql command, or when you do a createdb from the command line. Put simply, when either of the above commands are executed, a directory will be created named after the database being built; the contents of template1 will be copied to the new directory.
To see this process in action, change to the template1 directory and create a file from the command line (say, using cat). While still at the command line and still su'ed to the postgres superuser, type the following:
You should see the contents of test.txt spill out across your terminal.
$pwd/var/lib/pgsql/data/base/template1 (note: your data directory may be different)$cat > test.txtsample text<ctrl-d>$createdb test1CREATE DATABASE$cd ../test1$cat test.txt
From an administration viewpoint, you may be asking, What good is this? Well, to be honest, it probably isn't a major deal to you; but to the developers on the system, it's a nice place to deposit functions so that any new database they create has these same functions moved over for them.
Another good use for the template would be in the ASP space. Let's say, for example, you have a sales contact management system you are selling to remote users as a service. How do you separate client data from one client to another? One way would be to create fields in each record that hold the ID of the client in question. Although this approach may work, it's better to separate clients physically from one another. To do this, you would import the base contact management schema into template1. When you sign on a new customer, all you have to do is fire off a createdb command, and in a few seconds you have a brand new database for the new client.
Monitor Server Resource Usage
There will come a time during your administrative duties when you will have to assess the impact of adding more users, data, or applications to the database server. When the word performance comes up, you have to be aware of the context in which it is being used. Performance can be measured objectively or subjectively. A performance rating falls within one of two broad categories: user response time and processing time.
Processing time measures how long a series of instructions takes to execute from start to completion. Batch processing is typically measured using this term. It's a basic concept and won't be discussed much further than this.
User response time is a measure of time taken to perform an action initiated by the user at the keyboard. Response times in this category are very subjective--what might be slow to one user may be fine with another. This measure is a critical one, but fortunately it's easy to detect: Your users come to you crying that Step A takes 100 years to complete. When you start hearing comments like this, it's time to take a peek at the server to see if there is anything you can do to help your poor users out.
Unfortunately, no tool is packaged with the pgsql source distribution to help you nail performance problems. This isn't a major deal, in that the host operating system will most likely have at least basic tools to help you out. Under Unix-based systems, you will most likely have a tool like top or ps, which will start to give you clues. Items to look for are memory use (both physical and virtual), disk usage, and number of users hitting the machine at any given time.
The two most important factors when looking at pgsql performance are disk and RAM. Pgsql will serve data as fast as it can read it from the disk, but only if sufficient RAM is available. So the first thing you should consider is the speed of your disk and the location of the actual data. If pgsql is installed on the slowest disk you have on the system, then you should not be surprised if performance suffers. Is your primary swap file on the same partition as the data? This will cause drive head contentions as the system decides to either write to the database or to the swap file. I'm sure you realize which of the two would lose in that kind of contest. Are you sharing the data partition with any other disk-intensive tasks, such as Samba-based file sharing? What about print queues? All of these services take away from the primary task of answering requests for data.
The second item to look at is the amount of RAM you have installed on the machine. Each connection to pgsql requires a new postmaster task to drive the connection between the user and the backend data. These tasks require memory, and the amount of memory required will depend on the type of work you are performing during the connection. As a query is processed, pgsql will start to eat RAM. This isn't a memory leak: pgsql has to keep track of what it has done concerning the query you sent. As RAM slowly disappears, pgsql will start to eat swap space. This isn't a limitation of pgsql--it's a basic tenet of computer life.
Pgsql runs on a variety of systems that can generally be classified as either Windows NT or a Unix-like OS. Under Unix-based systems, there are a number of tools that come with the base system to allow the admin to take a peek at what is happening under the hood. Notable commands are: top and ps.
There will be times that it becomes necessary to move data from one version of pgsql to another, or to another database server altogether. To do this, you use a tool named pg_dump. This program will examine a database and generate a script to rebuild the components of the database. This script contains everything except the CREATE DATABASE command required to restore the database to the same point from which you dumped it. The output of pg_dump is in ASCII, so it's possible to use the output on a totally alien system: from your typical x86 box to a SPARC-based server, as one example.
You could use pg_dump as a poor man's backup utility, but this kind of usage isn't encouraged. Pg_dump works on entire tables or databases and there are no options to specify which records are to be dumped.
Pg_dump is typically used in two cases: transferring data and upgrades.
In the case of doing a transfer, you simply specify the command line options to pg_dump and redirect the output to a file on whatever file system you like. Then, on the other system, you simply run psql with the f command line option.
In the case of upgrades, the pgsql developers will sometimes make changes to the database that will require that you dump the data from the old server, compile and install the new software, and then import the dump back into the new database installation.
Using pg_dump is straightforward. You simply specify a database name on the command redirecting output to a file or another pgsql database. To dump the contents of a database named DAMOND you would type the following command:
pg_dump damond > damond.txt
That example was simple enough--now for a more useful one. Let's say you want to transfer the contents of database, DAMOND, to DAMOND2:
createdb damond2; pg_dump damond |psql d damond2
You can specify table names with pg_dump. The following command will just transfer table test1 from database DAMOND to DAMOND2 using the t (or --table) command line option:
createdb damond2; pg_dump t test1 damond |psql d damond2
You should be aware of one command line option, depending on how your developers create their applications. Every row of data in pgsql contains a hidden field named OID. An OID is a unique value assigned to every record in the database. Some developers will use the OID to locate individual records quickly using indexes. Pg_dump will not retain the OIDs in the current database when it starts to dump the data. To tell pg_dump to retain the OIDs, you must specify o (or oids) on the command line.
Pg_dump works with individual databases within pgsql. To dump all the databases on the server would require a separate pg_dump command for each one. The pgsql developers have provided a wrapper around pg_dump named pg_dumpall that will dump all the databases on a server with one command line.
Security: First Things First: Connecting to the Postmaster
When a user attempts to connect to pgsql, the first security gate they have to pass is defined by the entries in a file named pg_hba.conf. This file, located in /, tells pgsql both which clients can connect and what type of challenge or response to issue when those users attempt to make the connection.
When pgsql is installed (either via RPM or by manually building the source), the default pg_hba.conf file is set to allow trusted connections. Trusted connections, in this case, mean that pgsql will believe that you are whoever you say you are. If you have a new install of pgsql on your system, type the following at the command line:
psql U postgres d template1 User postgres, on my system, is the pgsql super user--able to create and destroy databases. If you have installed RedHat RPMs with the base, then you will have a postgres user defined, as well, in your /etc/passwd file. Otherwise, you should specify whatever user id you decided upon when you compiled and installed pgsql.
If you have an unmodified pgsql installation, you should be presented with the standard psql prompt. This isn't good. Luckily, the file itself is well documented, and the changes are easy to make. Su to the pgsql super user, load the file in an editor, and read on.
The basic line format of the pg_hba.conf file goes like this:
Every time a host tries to connect to pgsql, the pg_hba.conf file is scanned. The first entry found in the file that matches the connection type (LOCAL, HOST, and HOSTSSL) and IP address of the host is used to determine what happens next. Because of this, order of entries is very important. Put your most restrictive entries for any given database first.
|Name ||Description |
|TYPE ||The type of host you are specifying. Generally this item will be HOST or LOCAL. Another type, HOSTSSL, is discussed in the documentation. |
|DB ||The database this entry is for. You are allowed to have several entries per database. |
|IP ||The IP address of the host that is connecting to the server. |
|IP_MASK ||The IP MASK of the host connecting to the server. |
|AUTHENTICATION ||The type of authentication to be made between the client and server. There are several valid options, which are discussed below. |
|AUTHENTICATION_OPTIONS ||Any other options you specify based on the authentication type specified above. |
The best way to learn about pg_hba.conf is to see a few examples. Entries in my configuration file are as follows:
local all trust The above entry allows anyone to connect to the databases on the system as anybody, while logged onto the machine either via telnet or if they're actually sitting at the terminal. A handy entry for development purposes, but not something you would want on a production data server. At a minimum, you'd want to use the following line to enforce standard username/password authentication:
local all password The next entry in my configuration file is:
host all 192.168.99. 255.255.255.0 password The above line tells pgsql to challenge anyone from my local network for a password when they try to connect to the database.
The default connection activity, assuming no other entry describes the connection attempt, is to reject all hosts. The pgsql documentation states that if a situation isn't covered by any of the entries in pg_hba.conf, then the connection will be killed.
There are about a billion different permutations of the above that you can go through. Be aware that when specifying PASSWORD in pg_hba.conf, you are telling pgsql that clear-text usernames and passwords will be shipped over the wire. Instead of specifying PASSWORD, you will want to examine the CRYPT option (which, unsurprisingly, encrypts the password before it sends it to the server).
There are two reasons to run vacuum: to remove dead space and to update index statistics.
When a record is deleted in pgsql, the record is flagged as deleted but not removed from disk. Create a table with 100,000 records in it, and then delete all the data in the table. If you examine the file system, you will notice the file is still quite large. Also, when users modify data, instead of modifying the record as it stands, the record is deleted and a new record is added to the end of the table. Deletes and updates create dead spots, or holes, in your database. Vacuum, when run regularly, removes these dead spaces in your database. Depending on how volatile your database is, you may want to run vacuum weekly or evendaily.
The second reason to run vacuum is to analyze index coverage statistics of the database. These statistics are used to determine the best query plan to use when running queries against the database. Unlike other systems, which manage statistics on the fly, pgsql leaves the statistics gathering up to the administrator. To do this, you run vacuum with the --analyze command line option. This command should be run at least once a day, and possibly more, depending on what you do with your database. (Did you just import a million records? Analyze. Did you just delete 500,000 records? Analyze.)
You will know it's time to analyze your tables when a query suddenly takes a great while to process.
Links to Information
PostgreSQL Web site http://www.postgresql.org
PostgreSQL v7.0 Administration Documentation http://postgresql.readysetnet.com/users-lounge/docs/7.0/admin/
Damond Walker works at Black Oak Computer Services Inc, of Newark, Del., as a Lotus Notes/Domino and SQL Server consultant. When not pushing bytes for his employer, he pushes strollers with his wife and their twin daughters.