The Waiting ... Is the Hardest Part
Please wait ... and wait and wait again. One of the nagging problems many databases can't seem to shake is that at some point a user is going to have to wait while someone else updates the system. Whether the database system is using table-level, page-level, column-level, or row-level locking, the same problem arises: readers (SELECTs) wait for writers (UPDATEs) to finish, and writers (UPDATEs) wait for readers (SELECTs) to finish.
PostgreSQL, the open source database that is often compared to most advanced commercial databases, has a mechanism called MVCC (MultiVersion Concurrency Control). MVCC has the ability to perform row-level locking as well as lock rows for writing in one session while giving access to these rows unaffected in another session. It's been said that MySQL, a lightweight open source database, has a version of MVCC, but the two aren't really comparable since MySQL can do only table locking for reading. This is one of the many instances where PostgreSQL has the upper hand.
With PostgreSQL, readers never wait for writers, and writers never wait for readers. I can already hear the objections to the claim that there is no "no-locking" in PostgreSQL, so let me explain PostgreSQL's MVCC in further detail.
In other database systems, locks maintain concurrency control and data consistency. But PostgreSQL begins with a multi-version model. With PostgreSQL, a version is like a snapshot of the data at a distinct point in time. The current version of the data appears whenever users query a table. Naturally, a new version appears if they run the same query again on the table and any data has changed. Such changes happen in a database through UPDATE, INSERT, or DELETE statements.
The essential difference between traditional row-level locking and PostgreSQL's MVCC lies in what the user sees when he or she selects data from a particular table, as in the following example.
SELECT headlines FROM news_items
In this example, the statement reads data from a table called news_items and displays all the rows in the column called headlines. In data systems that use row-level locking, the SELECT statement will fail and the user will have to wait if another user is concurrently inserting (INSERT) or updating (UPDATE) data in the table news items. The transaction that modifies the data holds a lock on the row(s) and therefore all rows from the table cannot be displayed, forcing users to wait until the lock releases. Users who have encountered frequent locks when trying to read data know all too well the frustration this locking scheme can cause.
In contrast, PostgreSQL would allow all users to view the news_items table concurrently, eliminating the need to wait for a lock to be released. This is always the case, even if multiple users are inserting and updating data in the table at the same time. When a user issues the SELECT query, PostgreSQL displays a snapshot -- a version, actually -- of all the data that users have committed before the query began. Any data updates or inserts that are part of open transactions or that were committed after the query began will not be displayed. Makes a lot of sense, doesn't it?
A Closer Look
Database systems that use row-level locking do not retain old versions of the data, hence the need for locks to maintain data consistency. But a deeper look into how "no-locking" through MVCC works in PostgreSQL reveals how PostrgreSQL gets around this limitation. Each row in PostgreSQL has two transaction IDs. It has a creation transaction ID for the transaction that created the row, and an expiration transaction ID for the transaction that expired the row. When someone performs an UPDATE, PostgreSQL creates a new row and expires the old one. It's the same row, but in different versions. Unlike database systems that don't hold on to the old version, when PostgreSQL creates a new version of the row it also retains the old or expired version.
That's how PostgreSQL creates versions of the data, but how does it know which version to display? It bases its display on several criteria. At the start of a query, PostgreSQL records two things: 1) the current transaction ID and 2) all in-process transaction IDs. When someone accesses data, Postgres issues a query to display all the row versions that match the following criteria: the row's creation transaction ID is a committed transaction and is less than the current transaction counter, and the row lacks an expiration transaction ID or its expiration transaction ID was in process at query start.
And this is where MVCC's power resides. It enables PostgreSQL to keep track of transaction IDs to determine the version of the data, and thereby avoid having to issue any locks. It's a very logical and efficient way of handling transactions. New PostgreSQL users are often pleasantly surprised by the performance boost of MVCC over row-level locking, especially in a large multi-user environment.
Another advantage MVCC offers is hot backups. Many other databases require users to shutdown the database or lock all tables to get a consistent snapshot -- not so with PostreSQL. MVCC allows PostgreSQL to make a full database backup while the database is live. It simply takes a snapshot of the entire database at a point in time and dumps the output even while data is being inserted, updated or deleted.
In conclusion, the real world advantages of using PostgreSQL's MVCC are numerous, especially in any high-transaction environment where efficiency is key. But why take my word for it? To find out if PostgreSQL's multi-version model is truly better than row-level locking, try it out for yourself.