How times have changed for PostgreSQL

njharman · on March 11, 2014

I was 76% expecting a treatise on how various time and date fields and functions have evolved over PostgreSQL versions.

iagooar · on March 11, 2014

I have been using Postgres since I was a student, but I must recognize that being a better programmer than sysadmin, I always used it after some kind of abstraction layer. Today, this layer is Rails with ActiveRecord.

I want to convince my boss to use Postgres instead of MySQL in a new project we've just started. Does anybody know about a good comparison of both databases, or a nice list of unique Postgres features? I have been googling a little bit but the findings weren't that good.

gdulli · on March 11, 2014

Reasons I like PostgreSQL compared to MySQL:

- Timestamp with time zone

- More robust, fewer crashes, less corruption of data

- More features (JSON data type, partial indexes, function/expression indexes, window functions, CTEs, hstore, ranges/sequences/sets, materialized views, too many to list)

- More disciplined (doesn't do things like auto-truncate input to get it to fit into a column)

- Not owned by Oracle, it's actively developed, regular major release schedule, developers/maintainers are talented and trustworthy, etc.

- Better Python driver (don't know about other languages)

- Choice of languages for database functions/procedures (Python, JS, etc.)

- Better partitioning support

- Better explain output, explain analyze, buffers

- Multiple indexes allowed per table in a query (I hear MySQL has made a little progress here since I last used it)

antjanus · on March 11, 2014

Add "built-in fulltext search" and the fact that you've just replaced Redis, Sphinx/Solr/Lucene, and Mongo with a single application.

THAT, I think, is the biggest sell, at least to me. The fact that PgSQL does so much and so well.

I used Pg for full-text search in the past and the fact that I did not have to bother with setting up Solr or an interface was a wonderful feature. The search index lived right alongside my data.

jlogsdon · on March 11, 2014

Technically mysql has "built-in fulltext search" as well, but the implementation leaves something to be desired.

antjanus · on March 11, 2014

I couldn't find anything nearly as powerful as Pg so I would discount it fully.

mrweasel · on March 11, 2014

How good is the fulltext search? Comparable with Solr, better or does it depend on your needs?

jeltz · on March 12, 2014

It has less features and worse performance than Solr and Sphinx. The worse performance is being worked on and will probably be fixed in 9.4 (the index guys claim that they can beat Sphinx in benchmarks).

The advantage of having your fulltext search inside your database is huge though and can outweigh the current disadvantages. For example you do not have to manage another database, indexing can be transactional and instant, and you can can have both fulltext search and normal SQL stuff in the same query (often reducing latency).

Doctor_Fegg · on March 11, 2014

I'll add one more: PostGIS.

nobleach · on March 11, 2014

Three. Freaking. Cheers for this one. Anyone who's had to suffer through MySQL's abysmal attempt at doing "spatial" should give PostGIS a well deserved look. I've used Oracle Spatial, ESRI's SDE (which rides on top of MSSQL/Oracle/DB2/SyBase) MySQL 'spatial' and PostGIS. I have to say, hands down, PostGIS has been my favorite. We had to implement MySQL spatial as all of our tabular data was already there and the IT folks felt "comfortable" with it as they knew they could dump it to a plain text file full of CREATE statements any time they wanted to. The problem we have seen is that dumping some of our ancient MySQL spatial tables, we could not import them into other versions. And vise-versa appeared to be true as well. (I'd need to go back and look)

ZenoArrow · on March 11, 2014

You and your IT folks may be interested in Spatialite, which sounds like a good use case for what you've described. Spatialite is spatial extensions to SQLite, means you can store spatial data in standalone DB files. Has good level of support in GIS software, plus it's the first spatial database format to support the upcoming GeoPackage standard (due to replace shapefiles): http://www.gaia-gis.it/gaia-sins/ http://www.ogcnetwork.net/geopackage

ucho · on March 11, 2014

> - Better partitioning support

Yeah, right. Partitioning in PostgreSQL is as braindead as setting sequence value in Oracle. IMHO the devs made wrong choice: https://wiki.postgresql.org/wiki/Table_partitioning#Possible... . You want a partition for every month of stored data? Prepare to copy & paste a lot as you need a set of triggers for all child tables. Oh, the table You are spiting was referenced as foreign keys in other tables? Drop the constraint, it isn't supported even if is the same column you are partitioning over.

kermatt · on March 11, 2014

In what partition design do you need a trigger per child table? You only need to add partitions to the trigger on the parent table, or at the branches if you have multiple hierarchy levels.

I run a Postgres instance that partitions daily (for reasons of space management). When creating the partitions in advance, it is easy enough to alter the trigger (or a function called by the trigger) in the same script that creates the tables.

DDL via copy-edit-paste is just asking for problems anyway. Those operations should be automated.

dllthomas · on March 11, 2014

Granting that there are problems with PostgreSQL partitioning, that doesn't invalidate the claim unless MySQL partitioning does these things better and doesn't have other, worse problems. I'm not making any claim as to whether this is the case, but you didn't address it.

nobleach · on March 11, 2014

I've had a problem selling the reliability angle as our instances of MySQL have never crashed. We have drawn the line in the sand that ALL new apps get Postgres by default, but we've been turned down on funding to migrate all old apps. The belief is that they're "doing just fine". I'd love to see every single one of them moved. I know we'd hear squawking from the few folks that know how to use MySQL Workbench as it is a very pretty interface, but it'd be worth it.

Lewisham · on March 11, 2014

Well, if they aren't crashing, and the apps are already built, then aren't they really running just fine? What value do you buy by switching legacy apps that aren't being actively developed to Postgres?

smacktoward · on March 11, 2014

You can cut down on support expenses by not having to have MySQL expertise on staff.

Of course, if your DBAs (or whatever passes for a DBA in your outfit) are conversant in both already -- or if they are equally clueless with both -- this doesn't save you much.

DrJ · on March 12, 2014

I would contest 'better python driver' with SQLAlchemy.

gdulli · on March 12, 2014

I'm not a fan of SQLAlchemy, I was referring to psycopg2. I don't think of SQLAlchemy as a driver, just an optional layer on top of the driver that one may prefer to use, but I don't.

jakobe · on March 11, 2014

The most amazing Postgres feature: Transactional schema changes. Create tables, alter tables, insert data, all inside a transaction. If any step fails, everything is rolled back!

This blew my mind when I switched from MySQL (which doesn't even support transactions for inserts/updates in most engines).

davidw · on March 11, 2014

I had my mind blow the other way, in the past, when I realized that Mysql did not do that. I took it for granted.

sixbrx · on March 11, 2014

Neither can Oracle do this, FWIW.

tragic · on March 11, 2014

A list of reasons not to use MySQL:

http://grimoire.ca/mysql/choose-something-else

matwood · on March 11, 2014

I was just complaining about mysql yesterday. Data truncation and silent failures should not be warnings! They should be errors.

For example, 'str1' + 'str2' results in '0' and a warning in the default mysql config. A real rdbms should fail and tell you + is not for string concatenation in that system instead of running your query and perhaps overwriting your data and then warning you about it after the fact ('That query you just ran is broken, but we overwrote your data anyways - hey, here's a warning.').

Broken, horrible design.

davidw · on March 11, 2014

One that has bitten my colleagues lately:

Auto-increment primary keys get reused when, say, you create a record, then erase it, then pull the power plug. Mysql does not save the highest number used, meaning that you turn on the machine again, add a new record, and it has the same ID as the previous one.

My very subjective impression is that every time I have anything to do with Mysql, sooner or later, some weird thing like that jumps out and bites someone in the ass.

laumars · on March 11, 2014

I dislike MySQL as much as the next guy, but I'm not convinced what you've posted there is true because you can query the auto-increment number - in fact you can even define it in SQL when creating the the table. So MySQL does store the increment number (and it's not even some volatile runtime value that might get lost during power cycles as I've bounced enough dev machines to notice if that was the case).

davidw · on March 11, 2014

The autoincrement number is stored just fine unless you erase the last record and then stop Mysql, in which case that last one is lost and reused when you restart Mysql. This can cause problems if, for instance, you did something with that record between the time it was created and deleted and used that ID...

http://stackoverflow.com/questions/3718229/stop-mysql-reusin...

http://dba.stackexchange.com/questions/16602/prevent-reset-o...

laumars · on March 11, 2014

I've done exactly what you've described and never ran into that problem. Quite the opposite in fact, on some dev servers I've have to manually fudge the auto increment number because it didn't automatically reset to zero after I'd flattened the table.

Like I said before, you can even create an empty table and define the auto increment to be 10000 if you wanted. So there is a persistent value that's stored. However I suspect the issue here is that I'm using a different MySQL engine to yourself.

ericcoleman · on March 11, 2014

This effects InnoDB, but not MyISAM

davidw · on March 11, 2014

Ok, but ... MyISAM is like racing down a hill on a bicycle with no brakes.

smacktoward · on March 11, 2014

That's actually better than MyISAM, because at least experiencing it is thrilling...

matwood · on March 11, 2014

If you are using MyISAM you might as well use a raw file system and assume you will lose your data at some point.

KingMob · on March 11, 2014

Yup. I encountered MySQL's silent truncation behavior back in 2000, and it blew my mind that a tool whose primary job is to safely store my data would ever delete some of it without my explicit command. Been a Postgres person ever since.

sp332 · on March 11, 2014

I've never liked using MySQL (silent errors everywhere!), but it always seemed to be the most practical option. It would be really nice if Postgres is catching up in terms of features.

davidw · on March 11, 2014

Features? I've always thought of things like 'transactions' as the features I'm most interested in in a database.

Granted, Mysql is catching up in terms of features like that, but for a long time, many people were using it without them (MYISAM).

stephen_g · on March 11, 2014

For the most part, that happened long ago... What features are you using that aren't in Postgres?

sp332 · on March 11, 2014

Speed :) On one project we were able to get MySQL onto an embedded system where Postgres kept choking. Also replication is dead-simple with MySQL.

Refefer · on March 11, 2014

It's amazing how fast you can go when you don't care about safety ;)

That said, for all non trivial work loads, the latest postgres is quite the work horse, but of course requires some tuning for performance. We ended up switch to it after MySQL consistently sucked on smaller joins.

Can't argue on the replication deal: it's a work in progress.

nieve · on March 11, 2014

For any reasonable RDBMS deployment resources PostgreSQL has been faster than MySQL for me. Write-heavy loads are a huge win for it, but even on large reads (formerly its weakness) PostgreSQL has the advantage unless you've got no writes at all on the table. I guess if you've got completely static data MVCC isn't a win, but at that point I'm not sure you need to be using an RDBMS in the first place...

ufmace · on March 11, 2014

If it's an embedded system, why not use SQLite instead?

sp332 · on March 11, 2014

Honestly I wasn't part of that conversation on that project. I just heard rumblings from a distance :)

kaoD · on March 11, 2014

> Speed :)

At the cost of ACID. In my experience PostgreSQL is faster than MySQL-InnoDB.

jeltz · on March 12, 2014

Replication is dead simple now, but a few years ago it was not.

copergi · on March 11, 2014

Postgresql is generally faster than mysql. Fitting it onto an embedded system is either memory or storage constraints, but postgresql is equal or better in both those regards so I can't imagine what problem you had. Replication is dead simple with postgresql as well, with the added bonus that it doesn't randomly die and require manually intervention to get it working again.

mgkimsal · on March 11, 2014

select count(*) speed.

Apparently it's decently fast in 9.4, but... that's been 20+ years where it's not been anywhere near as fast as MySQL for most common use cases (blogging, forums, etc where paginating records is common).

olavgg · on March 11, 2014

This select count in PostgreSQL is comparable to MySQL's

select n_live_tup from pg_stat_user_tables where relname = 'mytable';

copergi · on March 11, 2014

Postgresql has long been consistently faster than mysql. Count(*) is slow because it is doing what you ask. You solve this problem the same way you solve any aggregate performance problem, with a computed column.

mgkimsal · on March 11, 2014

Consistently faster?

In my experience over the years, on any moderate sized db (more than, say, 100k rows in a table) select count() has always been faster under mysql - both myisam and innodb (although innodb doesn't claim to be 100% accurate all the time).

Why should I* have to keep a computed column when the core engine has all the data all the time? And it's something pretty fundamental to the data - how much of it is actually in there.

copergi · on March 11, 2014

I think we're misunderstanding each other. I thought you were claiming mysql is faster for "most common uses cases", which is why I pointed out that it is actually the opposite. Yes, mysql is faster at count(), and yes it is because it doesn't actually do a count() and just returns a guess. If your argument that postgresql is bad is that it doesn't return fake data, then it says more about you than about postgresql.

>Why should I have to keep a computed column when the core engine has all the data all the time?

The core engine does not have that information, keeping that information for no reason would be foolish. You should keep a computed column for performance, obviously. That is your complaint remember? How is this any different than "select users.id, users.name, count(photos.id) from users left join photos on photos.user_id = users.id where users.id = ?" being slow? How do you solve that being slow? You use a computed column. The fact that you complain about a complete non-issue because you inexplicably refuse to use the standard solution to the problem in this one particular instance of the general pattern is neither logical nor reasonable.

mcherm · on March 11, 2014

I may be wrong here, but I think the difference is largely around how things are viewed differently between experts and non-experts.

Experts understand that "estimate the number of rows" and "count the number of rows" are two different operations. They know that "count()" is supposed to count, not estimate and that in MySQL it does an estimate instead. They even know a simple way to provide quick estimates.

Non-experts know that "count()" is faster in one tool than another.

copergi · on March 11, 2014

Certainly, mysql's existence has always been based on "we're the db for people who don't know anything about dbs". But in this particular case, the person I responded to does know that mysql isn't doing a count, he pointed it out himself.

collyw · on March 11, 2014

MySQL is like the PHP of databases. Not the best choice, but it is easy to get started, and available almost everywhere, especially on cheap hosting platforms.

davidgerard · on March 11, 2014

The only reason to use MySQL: that you have to run an application heavily optimised for MySQL, where any other DB is an afterthought (e.g. MediaWiki, Wordpress).

marcosdumay · on March 11, 2014

On those cases, I normaly have better luck with the aftertought.

Use MySQL when your application is incompatible with other databases. 99 times out of 100, that means that the database has inconsistencies, and the application depends on them, so be carefull when doing that.

stephenson · on March 11, 2014

Why do you want to use Postgres instead? When you answer that you have the list of reasons to show your boss.

piokuc · on March 11, 2014

One amazing little extension of Postgres I found lately is Multicorn http://multicorn.org/ It lets you write a Foreign Data Wrapper in Python. One Python class with two methods is enough to e.g. expose files in a proprietary data format as normal Postgres tables. That's what I did in about two hours, the users can now query their data with SQL and access it in a programming language of their choice, as long as the language has a Postgres driver (most have). Really cool.

craigkerstiens · on March 11, 2014

It's a bit old but the PostgreSQL wiki has some good details on the differences - https://wiki.postgresql.org/wiki/Why_PostgreSQL_Instead_of_M...

Further I've written a couple of posts that don't directly compare to MySQL, but many of the points are pretty relevant for MySQL vs. Postgres

  http://www.craigkerstiens.com/2012/04/30/why-postgres/
  http://www.craigkerstiens.com/2012/05/07/why-postgres-part-2/

jfb · on March 11, 2014

Personally, the cavalier attitude towards data integrity displayed by the MySQL folks circa the last time I used it in anger (~1998?) made Postgres a simple choice for me. The specific problems I had then are undoubtably fixed, but the initial culture of "return possibly bad values, fast!" seems like it is still causing problems.

I've had issues with Postgres in the past, but it's trustworthy software developed by a trustworthy team.

neverminder · on March 11, 2014

I'm a prisoner of MySQL. Of the top of my head at least from my own experience I can name these features that are present on Postgres and not on MySQL: PL/Language (stored procedures written in "3rd party" languages), custom data types, Arrays, JSON, dynamic SQL in triggers, better regular expression support, etc.

JanezStupar · on March 11, 2014

Tell him MySQL is PHP/MongoDB of relational world.

steveklabnik · on March 11, 2014

Rails 4 and 4.1 have added a ton of Postgres-specific features to ActiveRecord, too.

copergi · on March 11, 2014

If you are using activerecord it really doesn't matter. You have a "the database is stupid and shouldn't do anything so we'll write a buggy ad-hoc database in ruby on top of it" layer on top of the database, which pretty much precludes using the database effectively. If you want to actually use the database, here's some reasons postgresql is a better choice than mysql:

strict treatment of invalid data

mysql views with aggregates DISTINCT GROUP BY HAVING LIMIT UNION or UNION ALL Subquery in the select list are too slow to be useable

mysql has no check constraints

mysql has no conditional/expression indexes

mysql only has broken geospatial support

mysql has no full text indexing on innodb tables

postgresql has an extensible type system (and booleans!)

postgresql has sequences

postgresql has table inheritance

mysql triggers don't fire on cascaded actions or replication

mysql has no window functions

mysql can't set default values to be a function (there is a workaround for only dates, and only one per table)

postgresql has schemas

postgresql has a transactional DDL (you can roll back an alter table)

rollbacks in mysql are incredibly slow, and if you cancel one it can corrupt the table

mysql limit can't accept variables!?

mysql subqueries are limited: can't modify and select from the same table

mysql stored procedures can't have default args

mysql functions (and thus triggers) can't use prepare/execute, so no dynamic sql

mysql functions can't be called recursively

mysql triggers can't alter the table they are being called against

mysql stored procedures can't be called from dynamic sql (prepare)

mysql can't log error producing queries

mysql slow query log has a resolution of seconds

dijit · on March 11, 2014

>mysql has no full text indexing on innodb tables

it has this in recent versions, but it's shockingly bad at matching.

DrJokepu · on March 11, 2014

If you want job security and fat paychecks for the next decade, this is a very good time to learn PostgreSQL well and specialise as a PostgreSQL consultant.

iagooar · on March 11, 2014

The question I always ask myself is: how does one learn specialized stuff like managing a DB without working with it professionally? This also applies to Big Data, scaling, etc.

I would love to work building highly scalable systems, but I don't get to do it at my current position, and all the job offers out there require having experience doing it. Looks like a chicken-and-egg problem to me.

marcosdumay · on March 11, 2014

Just do it.

Normaly a job is not so specific that you can't start into some path like that, and once you are on the path, you can change positions.

But, if your job is that specific, well, that sucks. You'll have to build something on your free time.

DrJokepu · on March 11, 2014

Contribute to open source software, attend meetups, talk at meetups.

iagooar · on March 11, 2014

Can you recommend some interesting open source projects for an experienced Ruby / Rails developer, but with very little open source experience?

rch · on March 11, 2014

Start by tuning single-server instances for performance. There is quite a lot to learn from that before worrying about replication, for example.

ris · on March 11, 2014

Start working with OpenStreetMap data.

iagooar · on March 11, 2014

Do you mean processing the data and getting meaningful information out of it?

ris · on March 11, 2014

Yes. There is a lot of data. Managing tile servers can be quite involved.

dragansah · on March 11, 2014

What extra knowledge would you advice that one should combine with postgres. Web development vs OS knowledge for example?

delluminatus · on March 11, 2014

Heck, you don't need anything else. I've worked with SQL consultants who are paid big bucks to come in and optimize a company's "enterprise analytic reporting" (i.e. HUGE queries run on non-production databases). All these guys do is optimize SQL queries. They're valuable because they can reduce the runtime of a naively-written report from 100 hours to 1 hour.

danielweber · on March 11, 2014

How does "big bucks" compare to the "big bucks" in other computer specialties, like security?

There's this thing I wish existed but AFAIK doesn't: a guide that shows computer people the relative salaries of a bunch of different specialties. So people who have one specialty but would enjoy something else can see what the salaries are like.

Glassdoor is a very rough start on this.

delluminatus · on March 11, 2014

I have no idea. All I meant to imply there was that they are highly valued by the companies that contract them, because they typically are called in as a last resort when the organization's reporting situation has degenerated to being basically unusable.

dragansah · on March 11, 2014

Are you talking about SQL Server analytic environments like MS Business Intelligence suite? Yes that's a lot more tooling than postgres, probably 10x of times bigger than all the postgres tools out there. Most of postgres tools I've used are command line tools like pgtop or pgbadger.

delluminatus · on March 11, 2014

Mmm, the company I met them at was using Business Objects (Clarity/Crystal, etc.) which was where the inefficiency arose. But the consultants would come in and write optimized queries against the underlying SQL database because the Business Objects reporting was so inefficient. Basically they dealt with optimizing some very long, complex queries.

And yes, I think typically orgs. today are using SQL Server or Oracle for these databases.

doppel · on March 11, 2014

From my experience tackling a postgres server that is under-specced, a good knowledge of file systems (block devices, NFS, SANs, etc.), OS caching and kernel settings for maximum memory and such (check the postgres docs). Ofcourse, this only becomes an issue once you have a decent amount of data that does not fit into memory - before that, postgres will keep everything running relatively smooth without much tinkering.

For postgres itself, make sure you know what all the settings in the configuration means, why they make sense 90% of the time and definitely do not make sense in 10% of the time (such as a low memory server with super-fast SSD disk arrays). And in general ofcourse a good knowledge of SQL, index usage/performance. Postgres extensions (arrays, JSON, etc.) are decent, but in my experience it's something you can get into relatively fast if you are solid on the rest.

nieve · on March 11, 2014

As someone who's used PostgreSQL for 15 years and had to evangelize it almost everywhere I've gone that's a bit surreal to hear. Good, very good, but a little surreal. I've found that if you've got Ruby & Perl skills and you've worked on large-ish systems you'll get pigeon-holed as a non-database dev and it's hard to escape that without seriously warping your resume.

iblaine · on March 11, 2014

It is worth pointing out that AsterData and Redshift are built on pgSQL. From what I have read, this is due to licensing issues. MySQL is ok but I'm unconvinced that MySQL is a good option for single instance terabyte size databases. I'm not sure if pgSQL is any better but you hardly see large unwieldy pgSQL instances. Large slow MySQL instances are everywhere. And it's frustrating.

PublicEnemy111 · on March 11, 2014

What are the best resources for one to learn PostgreSQL?

captainmojo · on March 11, 2014

I'd start here: http://www.postgresql.org/docs/9.3/static/index.html

I've always found this documentation to be great.

If you've worked with RDBMS' before, you can skip around the chapters. If not, I'd read up through chapter 14 and go from there.

You should be able to easily install it on whatever OS you're running.

abstrct · on March 11, 2014

The Schemaverse (http://schemaverse.com)

Play a space battle MMO, but using SQL. The game is also completely open source so if you want to find out how to write an entire application layer in PostgreSQL, it is a pretty good example of that.

craigkerstiens · on March 11, 2014

This is a shameless plug, but hopefully all some good resources:

  http://www.craigkerstiens.com/categories/postgres/ 
  http://postgresguide.com/
  http://www.postgresweekly.com

new_test · on March 11, 2014

Since we are on the subject of PostgreSQL, does anyone know of a good resource that would teach me how to create webapps where each "customer" would get their own "database"? What are the best practices for that? Also, what are the best practices for ALTER'ing tables in production (e.g. adding/deleting columns).

Here's a concrete (made-up) example: I want to write a webapp where people would sign up to do some personal tracking. They create variables they are interested in (weight, mood, calory intake, etc.) And then enter their data daily. So for each customer I need to have a different database with different columns, and I want them to be able to add/delete variables "on the fly". Is it very straightforward, and hard to get wrong? Or are there "best practices" for this sort of thing? Thanks.

dragansah · on March 11, 2014

You can use a generic data model (example [1]) with appropriate metadata etc. What you are proposing seems more difficult to maintain (separate databases for different customers). If this are separate deployments than different databases are an option with some core module and than specific databases for each customer. I don't recommend doing DDL updates in realtime because of other issues for example: logging of data in some log tables, the data structures in your application (consider what changes would be implied to your ORM entities or other domain entities) etc.

[1] Generic Data Model: http://c2.com/cgi/wiki?GenericDataModel

bananas · on March 11, 2014

Look up the following terms

1. SaaS tenancy models. This is the data separation.

2. Custom fields would be usually represented as an EAV model (Entity-Attribute-Value).

3. Don't ALTER in production if you can help it. If you're going to do it, use migrations which are scripts which first add columns, then transform data, then reapply constraints.

arethuza · on March 11, 2014

I've also seen applications basically just assign a fixed number of additional columns on the relevant tables for custom fields - although this approach doesn't win any awards for elegance it can be pretty straightforward.

From what I've seen, applications that use EAV tend to evolve to suffer from bad cases of the "inner platform effect":

http://en.wikipedia.org/wiki/Inner-platform_effect

NB There is nothing "wrong" with using EAV - just that it seems prone to misuse (a bit like XML).

bananas · on March 11, 2014

The only problem with a limited number of custom fields attached to the table is that they aren't necessarily optimised for sorting, might extend past the row size limit of database tables (this is a pain!), can't be reliably typed and a client will always want one more.

EAV is a sort of inner platform thing I agree, but the correct solution i.e a document store with full field level indexing that works with enterprise loads doesn't exist (yet). CouchDB was promising on that front but didn't go all the way.

XML is fine. Just don't stick it in database columns (my favourite chunk of pain!)

wiredfool · on March 11, 2014

I like HStore a lot better than XML for postgres. And it's Waaaaay better than the wide table model.

Way back when, I did a data model that added columns for arbitrary data fields that the users wanted in PG, and wound up with a wide table model. PG can store a surprisingly large number of fields, I think it got up to the mid to high hundreds after years of this. At the time, hstore wasn't there, xml was either not there yet, or just recently added. And the queries for EAV looked surprisingly awful, especially when added to the not exactly straightforward queries we were doing on the events.

It wound up being an extremely large, extremely sparse table, with some fields having 100% usage, most having >>.01%, and a few getting used in the 1% range. On the plus side, it was possible to index any of the fields, which was especially useful with functional or 2 column indexes.

If I had to do it again, I'd be on hstore, or maybe hstore/json. It wasn't pretty but it wasn't the fatal flaw in that startup.

arethuza · on March 11, 2014

"XML is fine. Just don't stick it in database columns"

I've seen systems that used XML in a database column and were perfectly sensible, I've also seen systems that did terrible things with XML in a database (e.g. using complex XML string as part of a query).

_979m · on March 11, 2014

Salesforce does this: http://www.dbms2.com/2011/09/15/database-architecture-salesf...

bananas · on March 11, 2014

That they do. In fact for a recent company I worked for, Salesforce was pretty much the inspiration for the entire company with some added sector-specific stuff.

So they built Salesforce basically...

Totally insane but it sort of works.

bennylope · on March 11, 2014

In addition to the previous answers, consider using hstore for custom variables. I've found it helpful for things like settings on customer accounts where these might differ quite a bit between accounts and change quite a bit over time. Otherwise the table layout applies to all accounts.

If you really need to separate data by customer then consider using PostgreSQL schemas. But I'd steer away from any solution that involves adding and dropping columns for different customers.

_979m · on March 11, 2014

You could use a Postgres JSON or hstore column for the custom variables. If you use a JSON column, and want to do queries on the custom data, try out this tool I made for creating views of properties within a JSON column: https://github.com/mwhite/JSONAlchemy

juretriglav · on March 11, 2014

Yes PostgreSQL is awesome. From my experience with a large installation, we had many more issues with the actual hardware on EC2, than we had with postgres which ran on said hardware. However, I was pleasantly surprised Amazon's RDS now supports PostgreSQL too, so you can abstract away most of the hardware considerations, assuming they've optimized their stack: https://aws.amazon.com/rds/postgresql/

tuxychandru · on March 11, 2014

If I have a bunch of records that need to be fetched together often and can't be dumped in a hstore of JSON column, how can I ensure data locality on disk in postgres?

Since primary key is clustered in MySQL, I choose an appropriate key. Is there an equivalent mechanism or is periodically running the cluster command the only option?

jeltz · on March 12, 2014

You cannot do this other than relying on covering indexes and index only scans. PostgreSQL does not support index organized tables.

_euvw · on March 11, 2014

" I see experienced enterprise software professionals who are versed in operating systems and Windows, for example, taking the training to advance their careers. "

That had me laughing.