== PostgreSQL Weekly News – July 02 2012 ==
== PostgreSQL Weekly News – July 02 2012 ==
Kevin Grittner and Dan Ports will be presenting their paper on SSI at
VLDB in Istanbul. While they won’t be the first or only people there
to do their work based on PostgreSQL, they will be the first public
representatives of the PostgreSQL community to do so.
PostgreSQL Conference Europe 2012 in Prague, The Czech Republic,
on October 23-26 is now accepting registrations for conference
== PostgreSQL Product News ==
AnySQL Maestro 12.6, an ODBC-based management tool which works with
pg_extractor 1.2.0, a customizing add-on to pg_dump, released.
== PostgreSQL Jobs for July ==
== PostgreSQL Local ==
Everything this week was global.
== PostgreSQL in the News ==
Planet PostgreSQL: http://planet.postgresql.org/
PostgreSQL Weekly News is brought to you this week by David Fetter
Submit news and announcements by Sunday at 3:00pm Pacific time.
Please send English language ones to firstname.lastname@example.org, German language
to email@example.com, Italian language to firstname.lastname@example.org. Spanish language
== Applied Patches ==
Kevin Grittner pushed:
- Fix warning for 64-bit literal on 32-bit build.
Robert Haas pushed:
- Remove sanity test in XRecOffIsValid. Commit
061e7efb1b4c5b8a5d02122b7780531b8d5bf23d changed the rules for
splitting xlog records across pages, but neglected to update this
test. It’s possible that there’s some better action here than just
removing the test completely, but this at least appears to get some
of the things that are currently broken (like initdb on MacOS X)
- Fix typo in DEBUG message, introduced by recent WAL refactoring.
- Unbreak pg_resetxlog -l. Fujii Masao
- Backport fsync queue compaction logic to all supported branches.
This backports commit 7f242d880b5b5d9642675517466d31373961cf98,
except for the counter in pg_stat_bgwriter. The underlying problem
(namely, that a full fsync request queue causes terrible checkpoint
behavior) continues to be reported in the wild, and this code seems
to be safe and robust enough to risk back-porting the fix.
- Reduce use of heavyweight locking inside hash AM. Avoid using
LockPage(rel, 0, lockmode) to protect against changes to the bucket
mapping. Instead, an exclusive buffer content lock is now viewed as
sufficient permission to modify the metapage, and a shared buffer
content lock is used when such modifications need to be prevented.
This more relaxed locking regimen makes it possible that, when we’re
busy getting a heavyweight bucket on the bucket we intend to search
or insert into, a bucket split might occur underneath us. To
compenate for that possibility, we use a loop-and-retry system:
release the metapage content lock, acquire the heavyweight lock on
the target bucket, and then reacquire the metapage content lock and
check that the bucket mapping has not changed. Normally it hasn’t,
and we’re done. But if by chance it has, we simply unlock the
metapage, release the heavyweight lock we acquired previously, lock
the new bucket, and loop around again. Even in the worst case we
cannot loop very many times here, since we don’t split the same
bucket again until we’ve split all the other buckets, and 2^N gets
big pretty fast. This results in greatly improved concurrency,
because we’re effectively replacing two lwlock acquire-and-release
cycles in exclusive mode (on one of the lock manager locks) with a
single acquire-and-release cycle in shared mode (on the metapage
buffer content lock). Testing shows that it’s still not quite as
good as btree; for that, we’d probably have to find some way of
getting rid of the heavyweight bucket locks as well, which does not
appear straightforward. Patch by me, review by Jeff Janes.
- Make DROP FUNCTION hint more informative. If you decide you want to
take the hint, this gives you something you can paste right back to
the server. Dean Rasheed
- When LWLOCK_STATS is defined, count spindelays. When LWLOCK_STATS
is *not* defined, the only change is that SpinLockAcquire now
returns the number of delays. Patch by me, review by Jeff Janes.
- Allow pg_terminate_backend() to be used on backends with matching
role. A similar change was made previously for pg_cancel_backend,
so now it all matches again. Dan Farina, reviewed by Fujii Masao,
Noah Misch, and Jeff Davis, with slight kibitzing on the doc changes
- Update release notes for pg_terminate_backend changes.
- Add missing space in event_source GUC description. This has
apparently been wrong since event_source was added. Alexander
- Dramatically reduce System V shared memory consumption. Except when
compiling with EXEC_BACKEND, we’ll now allocate only a tiny amount
of System V shared memory (as an interlock to protect the data
directory) and allocate the rest as anonymous shared memory via
mmap. This will hopefully spare most users the hassle of adjusting
operating system parameters before being able to start PostgreSQL
with a reasonable value for shared_buffers. There are a bunch of
documentation updates needed here, and we might need to adjust some
of the HINT messages related to shared memory as well. But it’s not
100% clear how portable this is, so before we write the
documentation, let’s give it a spin on the buildfarm and see what
- Fix broken mmap failure-detection code, and improve error message.
Per an observation by Thom Brown that my previous commit made an
overly large shmem allocation crash the server, on Linux.
- Make walsender more responsive. Per testing by Andres Freund, this
improves replication performance and reduces replication latency and
latency jitter. I was a bit concerned about moving more work into
XLogInsert, but testing seems to show that it’s not a problem in
practice. Along the way, improve comments for WaitLatchOrSocket.
Andres Freund. Review and stylistic cleanup by me.
- Make commit_delay much smarter. Instead of letting every backend
participating in a group commit wait independently, have the first
one that becomes ready to flush WAL wait for the configured delay,
and let all the others wait just long enough for that first process
to complete its flush. This greatly increases the chances of being
able to configure a commit_delay setting that actually improves
performance. As a side consequence of this change, commit_delay now
affects all WAL flushes, rather than just commits. There was some
discussion on pgsql-hackers about whether to rename the GUC to, say,
wal_flush_delay, but in the absence of consensus I am leaving it
alone for now. Peter Geoghegan, with some changes, mostly to the
documentation, by me.
- Work a little harder on comments for walsender wakeup patch. Per
gripe from Tom Lane.
- Fix position of WalSndWakeupRequest call. This avoids
discriminating against wal_sync_method = open_sync or open_datasync.
Fujii Masao, reviewed by Andres Freund
- Fix a stupid bug I introduced into XLogFlush(). Commit
f11e8be3e812cdbbc139c1b4e49141378b118dee broke this; it was right in
Peter’s original patch, but I messed it up before committing.
Peter Eisentraut pushed:
- Unify calling conventions for postgres/postmaster sub-main
functions. There was a wild mix of calling conventions: Some were
declared to return void and didn’t return, some returned an int exit
code, some claimed to return an exit code, which the callers
checked, but actually never returned, and so on. Now all of these
functions are declared to return void and decorated with attribute
noreturn and don’t return. That’s easiest, and most code already
worked that way.
- Use system install program when available and usable. In
a3176dac22c4cd14971e35119e245abee7649cb9 we switched to using
install-sh unconditionally, because the configure check
AC_PROG_INSTALL would pick up any random program named install,
which has caused failure reports
Now the configure check is much improved and should avoid false
positives. It has also been shown that using a system install
program can significantly reduce “make install” times, so it’s worth
- Fix install program detection. configure handles INSTALL as a
substitution variable specially, and apparently it gets confused
when it’s set to empty. Use INSTALL_ instead as a workaround to
avoid the issue.
- Further fix install program detection. The $(or) make function was
introduced in GNU make 3.81, so the previous coding didn’t work in
3.80. Write it differently, and improve the variable naming to make
more sense in the new coding.
- Make init-po and update-po recursive make targets. This is for
convenience, now that adding recursive targets is much easier than
it used to be when the NLS stuff was initially added.
- initdb: Update check_need_password for new options. Change things
so that something like initdb –auth-local=peer –auth-host=md5 does
not cause a “must specify a password” error, like initdb -A md5
- Assorted message style improvements
Alvaro Herrera pushed:
- Tighten up includes in sinvaladt.h, twophase.h, proc.h. Remove
proc.h from sinvaladt.h and twophase.h; also replace xlog.h in
proc.h with xlogdefs.h.
- pg_upgrade: fix off-by-one mistake in snprintf. snprintf counts
trailing NUL towards the char limit. Failing to account for that
was causing an invalid value to be passed to pg_resetxlog -l,
aborting the upgrade process.
- Make the pg_upgrade log files contain actual commands. Now the log
file not only contains the output from commands executed by
system(), but also what command it was in the first place. This
arrangement makes debugging a lot simpler.
Tom Lane pushed:
- Make pg_dump emit more accurate dependency information. While
pg_dump has included dependency information in archive-format output
ever since 7.3, it never made any large effort to ensure that that
information was actually useful. In particular, in common
situations where dependency chains include objects that aren’t
separately emitted in the dump, the dependencies shown for objects
that were emitted would reference the dump IDs of these un-dumped
objects, leaving no clue about which other objects the visible
objects indirectly depend on. So far, parallel pg_restore has
managed to avoid tripping over this misfeature, but only by dint of
some crude hacks like not trusting dependency information in the
pre-data section of the archive. It seems prudent to do something
about this before it rises up to bite us, so instead of emitting the
“raw” dependencies of each dumped object, recursively search for its
actual dependencies among the subset of objects that are being
dumped. Back-patch to 9.2, since that code hasn’t yet diverged
materially from HEAD. At some point we might need to back-patch
further, but right now there are no known cases where this is
actively necessary. (The one known case, bug #6699, is fixed in a
different way by my previous patch.) Since this patch depends on 9.2
changes that made TOC entries be marked before output commences as
to whether they’ll be dumped, back-patching further would require
additional surgery; and as of now there’s no evidence that it’s
worth the risk.
- Improve pg_dump’s dependency-sorting logic to enforce section dump
order. As of 9.2, with the –section option, it is very important
that the concept of “pre data”, “data”, and “post data” sections of
the output be honored strictly; else a dump divided into separate
sectional files might be unrestorable. However, the
dependency-sorting logic knew nothing of sections and would happily
select output orderings that didn’t fit that structure. Doing so
was mostly harmless before 9.2, but now we need to be sure it
doesn’t do that. To fix, create dummy objects representing the
section boundaries and add dependencies between them and all the
normal objects. (This might sound expensive but it seems to only
add a percent or two to pg_dump’s runtime.) This also fixes a
problem introduced in 9.1 by the feature that allows incomplete
GROUP BY lists when a primary key is given in GROUP BY. That means
that views can depend on primary key constraints. Previously,
pg_dump would deal with that by simply emitting the primary key
constraint before the view definition (and hence before the data
section of the output). That’s bad enough for simple serial
restores, where creating an index before the data is loaded works,
but is undesirable for speed reasons. But it could lead to outright
failure of parallel restores, as seen in bug #6699 from Joe Van Dyk.
That happened because pg_restore would switch into parallel mode as
soon as it reached the constraint, and then very possibly would try
to emit the view definition before the primary key was committed (as
a consequence of another bug that causes the view not to be
correctly marked as depending on the constraint). Adding the
section boundary constraints forces the dependency-sorting code to
break the view into separate table and rule declarations, allowing
the rule, and hence the primary key constraint it depends on, to
revert to their intended location in the post-data section. This
also somewhat accidentally works around the bogus-dependency-marking
problem, because the rule will be correctly shown as depending on
the constraint, so parallel pg_restore will now do the right thing.
(We will fix the bogus-dependency problem for real in a separate
patch, but that patch is not easily back-portable to 9.1, so the
fact that this patch is enough to dodge the only known symptom is
fortunate.) Back-patch to 9.1, except for the hunk that adds
verification that the finished archive TOC list is in correct
section order; the place where it was convenient to add that doesn’t
exist in 9.1.
- Cope with smaller-than-normal BLCKSZ setting in SPGiST indexes on
text. The original coding failed miserably for BLCKSZ of 4K or
less, as reported by Josh Kupershmidt. With the present design for
text indexes, a given inner tuple could have up to 256 labels
(requiring either 3K or 4K bytes depending on MAXALIGN), which means
that we can’t positively guarantee no failures for smaller
blocksizes. But we can at least make it behave sanely so long as
there are few enough labels to fit on a page. Considering that
btree is also more prone to “index tuple too large” failures when
BLCKSZ is small, it’s not clear that we should expend more work than
this on this case.
- Make UtilityContainsQuery recurse until it finds a non-utility
Query. The callers of UtilityContainsQuery want it to return a
non-utility Query if it returns anything at all. However, since we
made CREATE TABLE Alexander Shulgin/SELECT INTO into a utility
command instead of a variant of SELECT, a command like “EXPLAIN
SELECT INTO” results in two nested utility statements. So what we
need UtilityContainsQuery to do is drill down to the bottom
non-utility Query. I had thought of this possibility in setrefs.c,
and fixed it there by looping around the UtilityContainsQuery call;
but overlooked that the call sites in plancache.c have a similar
issue. In those cases it’s notationally inconvenient to provide an
external loop, so let’s redefine UtilityContainsQuery as recursing
down to a non-utility Query instead. Noted by Rushabh Lathia. This
is a somewhat cleaned-up version of his proposed patch.
- Provide MAP_FAILED if sys/mman.h doesn’t. On old HPUX this has to
be #defined to -1. It might be that other values are required on
other dinosaur systems, but we’ll worry about that when and if we
- Fix NOTIFY to cope with I/O problems, such as out-of-disk-space.
The LISTEN/NOTIFY subsystem got confused if SimpleLruZeroPage
failed, which would typically happen as a result of a write()
failure while attempting to dump a dirty pg_notify page out of
memory. Subsequently, all attempts to send more NOTIFY messages
would fail with messages like “Could not read from file
“pg_notify/nnnn” at offset nnnnn: Success”. Only restarting the
server would clear this condition. Per reports from Kevin Grittner
and Christoph Berg. Back-patch to 9.0, where the problem was
introduced during the LISTEN/NOTIFY rewrite.
- Fix confusion between “size” and “AnonymousShmemSize”. Noted by
Andres Freund. Also improve a couple of comments.
- Prevent CREATE TABLE LIKE/INHERITS from (mis) copying whole-row
Vars. If a CHECK constraint or index definition contained a
whole-row Var (that is, “table.*”), an attempt to copy that
definition via CREATE TABLE LIKE or table inheritance produced
incorrect results: the copied Var still claimed to have the rowtype
of the source table, rather than the created table. For the LIKE
case, it seems reasonable to just throw error for this situation,
since the point of LIKE is that the new table is not permanently
coupled to the old, so there’s no reason to assume its rowtype will
stay compatible. In the inheritance case, we should ideally allow
such constraints, but doing so will require nontrivial refactoring
of CREATE TABLE processing (because we’d need to know the OID of the
new table’s rowtype before we adjust inherited CHECK constraints).
In view of the lack of previous complaints, that doesn’t seem worth
the risk in a back-patched bug fix, so just make it throw error for
the inheritance case as well. Along the way, replace
change_varattnos_of_a_node() with a more robust function
map_variable_attnos(), which is capable of being extended to handle
insertion of ConvertRowtypeExpr whenever we get around to fixing the
inheritance case nicely, and in the meantime it returns a failure
indication to the caller so that a helpful message with some context
can be thrown. Also, this code will do the right thing with
subselects (if we ever allow them in CHECK or indexes), and it
range-checks varattnos before using them to index into the map
array. Per report from Sergey Konoplev. Back-patch to all
- Declare AnonymousShmem pointer as “void *”. The original coding had
it as “PGShmemHeader *”, but that doesn’t offer any notational
benefit because we don’t dereference it. And it was resulting in
compiler warnings on some platforms, notably buildfarm member
castoroides, where mmap() and munmap() are evidently declared to
take and return “char *”.
- Remove inappropriate semicolons after function definitions. Solaris
Studio warns about this, and some compilers might think it’s an
outright syntax error.
- Suppress compiler warnings in readfuncs.c. Commit
7357558fc8866e3a449aa9473c419b593d67b5b6 introduced “(void) token;”
into the READ_TEMP_LOCALS() macro, to suppress complaints from gcc
4.6 when the value of token was not used anywhere in a particular
node-read function. However, this just moved the warning around:
inspection of buildfarm results shows that some compilers are now
complaining that token is being read before it’s set. Revert the
READ_TEMP_LOCALS() macro change and instead put “(void) token;” into
READ_NODE_FIELD(), which is the principal culprit for cases where
the warning might occur. In principle we might need the same in
READ_BITMAPSET_FIELD() and/or READ_LOCATION_FIELD(), but it seems
unlikely that a node would consist only of such fields, so I’ll
leave them alone for now.
- Fix race condition in enum value comparisons. When (re) loading the
typcache comparison cache for an enum type’s values, use an
up-to-date MVCC snapshot, not the transaction’s existing snapshot.
This avoids problems if we encounter an enum OID that was created
since our transaction started. Per report from Andres Freund and
diagnosis by Robert Haas. To ensure this is safe even if enum
comparison manages to get invoked before we’ve set a transaction
snapshot, tweak GetLatestSnapshot to redirect to
GetTransactionSnapshot instead of throwing error when
FirstSnapshotSet is false. The existing uses of GetLatestSnapshot
(in ri_triggers.c) don’t care since they couldn’t be invoked except
in a transaction that’s already done some work — but it seems just
conceivable that this might not be true of enums, especially if we
ever choose to use enums in system catalogs. Note that the
comparable coding in enum_endpoint and enum_range_internal remains
GetTransactionSnapshot; this is perhaps debatable, but if we changed
it those functions would have to be marked volatile, which doesn’t
seem attractive. Back-patch to 9.1 where ALTER TYPE ADD VALUE was
- Fix to_date’s handling of year 519. A thinko in commit
029dfdf1157b6d837a7b7211cd35b00c6bcd767c caused the year 519 to be
handled differently from either adjacent year, which was not the
intention AFAICS. Report and diagnosis by Marc Cousin. In passing,
remove redundant re-tests of year value.
Heikki Linnakangas pushed:
- Fix pg_upgrade, broken by the xlogid/segno -> 64-bit int
refactoring. The xlogid + segno representation of a particular WAL
segment doesn’t make much sense in pg_resetxlog anymore, now that we
don’t use that anywhere else. Use the WAL filename instead, since
that’s a convenient way to name a particular WAL segment. I did
this partially for pg_resetxlog in the original xlogid/segno ->
uint64 patch, but I neglected pg_upgrade and the docs. This should
now be more complete.
- I neglected many comments in the log+seg -> 64-bit segno patch. Fix.
Reported by Amit Kapila.
- Fix two more neglected comments, still referring to log/seg. Fujii
- Update outdated commit; xlp_rem_len field is in page header now.
Spotted by Amit Kapila
- Initialize shared memory copy of ckptXidEpoch correctly when not in
recovery. This bug was introduced by commit
20d98ab6e4110087d1816cd105a40fcc8ce0a307, so backpatch this to
9.0-9.2 like that one. This fixes bug #6710, reported by Tarvi
- Validate xlog record header before enlarging the work area to store
it. If the record header is garbled, we’re now quite likely to
notice it before we try to make a bogus memory allocation and run
out of memory. That can still happen, if the xlog record is split
across pages (we cannot verify the record header until reading the
next page in that scenario), but this reduces the chances. An
out-of-memory is treated as a corrupt record anyway, so this isn’t a
correctness issue, just a case of giving a better error message.
Per Amit Kapila’s suggestion.
== Rejected Patches (for now) ==
No one was disappointed this week
== Pending Patches ==
Alvaro Herrera and Kevin Grittner traded patches to implement foreign
Alvaro Herrera and Zoltan Boszormenyi traded patches to implement a
lock_timeout and SIGALARM framework.
Ryan Kelly sent in another revision of the patch to allow breaking out
of hung connection attempts in psql.
KaiGai Kohei and Etsuro Fujita traded new revisions of the patch to
add an option to allow selective binary conversion for CSV foreign
Satoshi Nagayasu sent in two revisions of a patch to add a
pg_stat_lwlocks system view.
Pavel Stehule sent in a PoC patch to see psql client-side variables in
Pavel Stehule sent in another revision of the patch to add a way to
check PL/pgsql functions.
Fujii Masao sent in another revision of the patch to report the WAL
file containing checkpoint’s REDO record in pg_controldata output.
Nils Goroll sent in three revisions of a patch to replace s_lock
spinlock code with pthread_mutex on linux.
Andres Freund sent in three revisions of a patch to add an embedded
list to the back-end.
Fujii Masao sent in two revisions of a patch to keep pg_basebackup
from blocking all queries, which resulted in horrible performance.
Magnus Hagander sent in three revisions of a patch to output the part
of the pg_hba.conf that’s erroring out.
Josh Kupershmidt sent in two revisions of a patch to make
pg_signal_backend() symmetric with respect to database
Peter Eisentraut sent in a patch to ensure that initdb only errors out
asking for a password in cases where PostgreSQL would control that
Alex Hunsaker and Marco Nenciarini traded patches to add
array_remove() and array_replace() functions.
Peter Eisentraut sent in a patch to make static code analyzers happier
about elog/ereport’s not returning anything.
Alexander Korotkov sent in another revision of the patch to add
conversion from pg_wchar to multibyte.
Zoltan Boszormenyi sent in a patch to make pg_basebackup configure and
start the standby.
Dean Rasheed sent in a PoC patch to implement updateable views.
Peter Geoghegan sent in two more revisions of a patch to enhance the
data structure on which error messages are based.
Dimitri Fontaine sent in another revision of the patch to add event
triggers. Thom Brown responded with some corrections of the included
Robert Haas sent in a patch to demote “implicit creation” messages,
quieting, at the default logging level, output for the operations
that cause them.
Amit Kapila sent in a patch to unify the parsing of pg_ident.conf and
KaiGai Kohei sent in a patch to track the user ID from when a portal
was started in case of changes.
Comments are closed.