Revolution Systems specializes in performance and scalability, Django web application development, Systems Administration services and commercial support of many Open Source Software systems.

Three things you should never put in your database

As I've said in a few talks, the best way to improve your systems is by first not doing "dumb things". I don't mean you or your development staff is "dumb", it's easy to overlook the implications of these types of decisions and not realize how bad they are for maintainability let alone scaling. As a consultant I see this stuff all of the time and I have yet to ever see it work out well for anyone.

Images, files, and binary data

Your database supports BLOBs so it must be a good idea to shove your files in there right? No it isn't! Hell it isn't even very convenient to use with many DB language bindings.

There are a few of problems with storing files in your database:

  • read/write to a DB is always slower than a filesystem
  • your DB backups grow to be huge and more time consuming
  • access to the files now requires going through your app and DB layers

The last two are the real killers. Storing your thumbnail images in your database? Great now you can't use nginx or another lightweight web server to serve them up.

Do yourself a favor and store a simple relative path to your files on disk in the database or use something like S3 or any CDN instead.

Ephemeral data

Usage statistics, metrics, GPS locations, session data anything that is only useful to you for a short period of time or frequently changes. If you find yourself DELETEing an hour, day, or weeks worth of some table with a cron job, you're using the wrong tool for the job.

Use redis, statsd/graphite, Riak anything else that is better suited to that type of work load. The same advice goes for aggregations of ephemeral data that doesn't live for very long.

Sure it's possible to use a backhoe to plant some tomatoes in the garden, but it's far faster to grab the shovel in the garage than schedule time with a backhoe and have it arrive at your place and dig. Use the right tool(s) for the job at hand.

Logs

This one seems ok on the surface and the "I might need to use a complex query on them at some point in the future" argument seems to win people over. Storing your logs in a database isn't a HORRIBLE idea, but storing them in the same database as your other production data is.

Maybe you're conservative with your logging and only emit one log line per web request normally. That is still generating a log INSERT for every action on your site that is competing for resources that your users could be using. Turn up your logging to a verbose or debug level and watch your production database catch on fire!

Instead use something like Splunk, Loggly or plain old rotating flat files for your logs. The few times you need to inspect them in odd ways, even to the point of having to write a bit of code to find your answers, is easily outweighed by the constant resources it puts on your system.

But wait, you're a unique snowflake and your problem is SO different that it's ok for you to do one of these three. No you aren't and no it really isn't. Trust me.

Comments

There are valid reasons to do what you are saying not to, particularly the first and last items.

Storing blobs in the database is good because they benefit from the transactions and other integrity features of the database. As for performance issues, the trick here is to *cache* the blobs in the filesystem, but the *canonical* copies are still in the database.

As for logs, there are common scenarios where one would want to access those on a frequent basis as part of regular operations, and cross-referenced with lots of other regular data.

I agree with you about the transient data though, such as sessions, probably best not being in the same database as regular data, although storing that in temporary tables can mitigate this as the database can treat the session data differently on the assumption it is transient and losable.

  Posted by Darren Duncan on May 2, 2012 at 1:55 PM

You forgot to mention plain text passwords. Far too many people store them in their databases.

  Posted by Leon Cullens on May 2, 2012 at 4:08 PM

Excellent article, Frank. Thank you for taking the time to share this information.

I store web session data in the database to permit node-balanced access to the respective web sites. I've often considered this to be somewhat inefficient use of the database, but since it hasn't produced any noticeable degradation in performance, I haven't sought an alternate solution. Is there
some method or product that you recommend to provide that functionality? Thank you.

  Posted by Gary C. on May 2, 2012 at 10:01 PM

The article should be titled: Three things you should never put in your RELATIONAL database.
NoSQL databases are very popular these days. "database" by itself is not synonym for rdbms anymore.

regarding the topic:
1) document-based databases are designed for the purpose of storing binary files
2) hadoop was designed for the purpose of statistical analysis of huge amount of data

  Posted by Sharon on May 3, 2012 at 12:28 AM

There are good reasons why you might want to place binaries in a database. Not all solutions involve web servers and access to the internet - or involve high transaction rates.

Keeping all the data in one, easy to backup and remote site may be a very good reason to store blobs in a database. Applications with a high security requirement may also need the binaries to be in one highly secure place with the data.

  Posted by Rob on May 3, 2012 at 5:29 AM

Excellent article, well done, more please! :-)

  Posted by Michael on May 3, 2012 at 5:42 AM

Well, I know not to hire you based on this article.

SharePoint, one of Microsoft's flagship products, is entirely based around the concept of storing BLOBs in a database. And it works really well for that purpose. It's fast, easily provides versioning, histories, tons of metadata, etc. And backups are a cinch because it's just a couple files to back up.

Logs are also perfectly suited for databases! Not in the same database as production data? DUH! I mean seriously? But it doesn't take an advanced query to make this useful. Simple queries on a log table is far easier, faster, and more productive than grepping or doing other types of simple text search. You are completely wrong on this one.

Ephemeral data is the only one where you might be right, and even then, it hardly warrants a rant on the internet. Where you might be wrong in this category is simple: server farms. While a specialized distributed caching service is best for key-value pairs, especially session state, there are reason why it might not be practical or it might not make financial sense, or there might be a deadline, etc. And SQL is already there and can do the job. Maybe not the best solution in most cases, but so what?

  Posted by Timothy on May 3, 2012 at 8:02 AM

Storing some binary data in the database has the benefit of built in security.

  Posted by qq on May 3, 2012 at 8:25 AM

As Timothy said it. Your rant is not founded on any data (we have data which show that querying BLOBS of up to 200KB IS faster for SQL Server for example, and there are also types like FILESTREAM etc which provide a mix of both worlds). I am almost sure Oracle can provide similar functionality.
Logging shouldn't be in the production database. Well, it depends on the amounts of logging, and the complexity of the solution. This is not something to rant on. It also seems you've never used an OLAP database or SSRS or something like that. You probably have never used any serious database.
Mind, I am a developer, not a DBA. I've worked as both, I know what I'm talking about.

  Posted by tec-goblin on May 3, 2012 at 8:27 AM

Within the Filestream data types in Ms Sql Server 2012, saving a file within the DB does not stored as a BLOB.
A Filestream Table actually defines a directory on a hard drive, and saving a file into a Filestream Table actually saves it in the directory. The tables actually contain all the common file attributes. Editing a bitmap for instance, can be done via an Update statement or by using MS Paint. Any changes via the file system are atomic with the Filestream table.

  Posted by Mad Myche on May 3, 2012 at 9:35 AM

I cannot believe anyone who has seriously tried managing blob data in files and compared that with persisting those in a database will ever resort to the former method without a very good reason. Managing the added complexity in the architecture, security, and general file management is not paid back by enough performance improvement to warrant its use.

  Posted by Mad Max on May 3, 2012 at 9:42 AM

Stupid idea to keep Blobs in databases. That is where they SHOULD be. Oracle servers blobs quicker than a file server and it is where it belongs with the rest of the data not floating around on a server to get trashed, renamed or otherwise tampered with. As for backing up, filesystem, that has to be backed up too! What a short sighted article.

  Posted by Paul Edwards on May 3, 2012 at 10:08 AM

"As a consultant I see this stuff all of the time and I have yet to ever see it work out well for anyone."

Could it be that they *only* show it to you when it *doesn't* work out well?

  Posted by martin on May 3, 2012 at 10:16 AM

Shame there's no "Like" button here. I'm an aging and experienced DB programmer and I often store BLOBs in databases for many of the reasons covered above. I have also stored logs in the same database as my production data - it depends what you're logging and why. All tools have strengths and weaknesses and ALWAYS and NEVER are hugely presumptive when talking about programming techniques. The advice should be "think through the implications."

  Posted by PhilbertNutt on May 3, 2012 at 10:16 AM

While I do agree with some of the things you said, I really think that we should stop thinking in terms of "things that you should never do", or "[WHATEVER] considered harmful". I believe that this would be a great article if it would be written as "disadvantages of 3 common things we put in our relational databases".

I mean, your points are valid and quite right, for the most of it. However, whether it is a good or bad idea depends entirely on the problem at hand. Quite to the contrary of what you said, each problem space is unique, each has its own constraints, and suddenly the drawbacks (or benefits) of doing certain things in certain ways do no longer apply.

Right, storing a binary file as a blob inside a database may be a bad idea, if you are storing thumbnails of product images that get served to the users thousands of times per day. You will be putting a lot of stress to the database and web servers, which could be easily relieved by using a CDN to serve those same files, and storing a relative path in the related table. But what if they are not? What if, as some other commenters pointed out, you needed transactionality and / or versioning in the image operations, and your filesystem does not offer such features? What if the images are not served often enough to justify using a CDN? What if your clients don't want to pay for a CDN? Are you going to mantain and monitor another server just for displaying images that get served very few times per week? How are you going to handle backups? And environment replication for debugging / testing / staging purposes? What if you want to force your users through the application layer to get to the binary data, for security purposes?

I believe it's always very dangerous to follow patterns and instructions blindly, and most of our industry is doing just that.

  Posted by Andres Arana on May 3, 2012 at 10:20 AM

This article is useful if you are in the right context....

All the point of view of the post are right based on a context.

For me the article applies without problem, I'm facing this kind of issues in my current project.. big Backups files, slow access due the amount of data... for me was a wrong architecture from the original DB plan.

But as other post, if I require security and other functionality it's safe to use the DB.

So please analyze the article and do be so close mind... this article can help you a lot in the future! (also the post, they have good info for when you must do what the article said to don't do.. always analyze your requirement, do add complexity where not required)

  Posted by Anonimos on May 3, 2012 at 10:20 AM

Great points from a couple of you guys!

When it comes to storing logs and sessions, there are a number of disadvantages when not containing this data in the main DB. Firstly, the setup is more complex and will require configuration to allow cross database queries (this is not an ideal design for software that you provide to third parties). Second, you loose all referential integrity unless you implement your own checking mechanism (yuk). Third, lets say for the sake of robustness we have triggers on tables that log changes to critical pieces of data. Do you really want to pay the penalty for pushing that into another separate DB?

In my experience, it is good practice to purge the system of data on a periodic basis, regardless of the architecture, to a data warehouse and keep the main database lean and snappy. In doing so you are improving performance and overcoming any issues presented by that session, log or other diagnostic data being there in the first place.

  Posted by Ozark on May 3, 2012 at 10:22 AM

hahah, S3 - are you serious!

  Posted by Geoff Mather on May 3, 2012 at 10:23 AM

If people want to use a database like a filesystem, then so be it. Personally, I prefer using a filesystem for storing files (um, that's why it's called a "filesystem"). Filesystem design is pretty advanced and understood (understatement), and there are a wealth of tools for incremental backups, etc.

Having everything in "one neat package" sounds good (to some people), but to me that sounds like one gigantic mess. OK for something trivial, maybe, but much less portable, and as it grows, backups become a real issue. Huge databases take non-trivial wall time to backup, and create storage challenges -- or at least hassles.

And no, I don't want to hear about some special-purpose incremental database backup utility that saves stuff in some proprietary format blah-blah-blah. The goal is NOT get yourself bound to a bunch of special utilities and formats. Oh wait, that's Microsoft's goal. Oops, my bad.

Store files in the filesystem. Leverage the decades of design and development there. Filesystems are robust. Use them.

  Posted by John on May 3, 2012 at 10:29 AM

Good blog! I am embarrassed to say that I tried storing image data in a database before, with poor results. :) Before the application made it to production, we changed to a file pointer index stored in the database that pointed to the image location. Much, much better performance.

Of course, as everyone else has said - one solution doesn't work for every problem. But the point of this blog is solid. Don't always assume that the DB is the best place to store all types of data.

  Posted by Paul Bourdeaux on May 3, 2012 at 10:53 AM

I agree that there aren't a lot of use cases where it makes sense to store documents in a database.

However, many document management systems (like Sharepoint and FileNet) give you the option of storing documents in a database as a blob, and I've seen multiple use cases where it makes a lot of sense.

If you're working with a manageable amount of data (< 250 GB) and the averaged file size is small, then putting all of your content and metadata in the same database can make your system much easier to maintain, backup, and fail over.

  Posted by Tom Purl on May 3, 2012 at 12:30 PM

another good reason for storing files in a database:
suppose you have a program that stores data on a database will be accessed from many different computers. one user can scan in an image, go things with it, and save it. if its saved locally then the file is only available to that user and only on that machine, to save it to a server requires a server be made available and be set up, as the maker of the application you dont have this ability...but you do know that they will have a sql server available. by saving the files to blobs in the database they become available (along with data about it) to everyone with access to that database.

that's a real world example we had to figure out some time ago where i work.

  Posted by austin on May 3, 2012 at 12:51 PM

We have learned this hard way, but you are so right about the logs. We ended up cleaning them every hour. Also it makes more sense when you are using services like Google App Engine which charges you for every freakin thing.

  Posted by Java Developer on May 3, 2012 at 3:25 PM

I totally agree with the points about the image data and the logs.

Every shortcut know to man works well for simple applications because they are too small to push any architectural decision to its limits. However, when you application starts to extend in size and scope, its those shortcuts that bite you in the ass in ways that are most times extremely difficult to fix before a deadline.

Storing a file index pointer in the database is by far a better way when scalability becomes an issue. Filesystems are also better at utilising OS buffer caches when dealing with files. As for backups and maintenance, this is what automated scripts are for. Anybody doing manual backups of any kind in 2012 should, frankly, find another career path.

The only thing that does make sense to store in a database is session data. When you have 10 webservers all serving clients in a round robin fashion, you need that session data in one place. Unless you setup HAProxy for session tracking.

  Posted by Julian Kennedy on May 4, 2012 at 4:42 AM

For imagas and the like this is quite true for version prior to 2012. Now the new objects make that fine as it properly uses the file system.

Most people forget that NTFS is a very specialized database.

  Posted by Charles Kincaid on May 4, 2012 at 11:15 AM

I can see someone making the argument "don't store ephemeral data that you don't intend to query in the database", but that is an entirely different argument.

Just because it is ephemeral doesn't mean I don't need to look at it in a variety of ways. I store intra-day bond prices in the database, only to be deleted each night after the market closes. Were I to move them to a key-value store like you suggest, then I would lose the ability to quickly generate reports using tools like SSRS and Excel.

  Posted by Jonathan Allen on May 4, 2012 at 2:09 PM

If you're only doing odd queries over your logs once in a while, you're not exploiting having a logs database. I do somewhere between fifty and a hundred manual queries a week over my logs database; it's very useful for all manner of things.

I defintively agree with having a separate DB from regular app data storage, though - avoid interaction where you can.

  Posted by Eivind Eklund on May 4, 2012 at 2:55 PM

Another thing you shouldn't put in your database: Code!
Yes, I've already seen this in the real world! Total nightmare!

  Posted by Ricardo Soares de Lima on May 4, 2012 at 5:49 PM

There is nothing wrong with logging critical business model events in a log table. You really need to record events such logins, logouts, purchase attempts, refund attempts, user information changes, system startup and shutdowns. This information is a gold mine which accessed by a database query.

  Posted by Mr Bottom on May 5, 2012 at 1:59 AM

I agree on the logs part. I have a tendency to be very liberal with logging, as I always want to know what's happening with my systems. I've always store all logs in a single log table, but I may now just move them to a separate database. I haven't had a heavy load on my systems to really test the benefits.

  Posted by Matthew on May 5, 2012 at 5:58 AM

Frank, thanks for the Riak mention. Nice work on the article. Basho (makers of Riak) also has a new cloud storage offering, RiakCS, with an S3 compatible interface.

  Posted by Randy Hamilton on May 5, 2012 at 12:54 PM

Nice post. I agree with you for the most part. But the last item probably needs more details. I think it really depends on the type of logging you're doing. The example you suggested, of logging http requests in the same DB is going to negatively impact performance. But if you're logging for data integrity, i.e. what user change what value, when? Is probably negligible (in most circumstances).

  Posted by Nicholas Quirk on May 7, 2012 at 12:15 PM

Woah! I’m really loving the template/theme of this site. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between user friendliness and appearance. I must say you have done a amazing job with this. In addition, the blog loads super quick for me on Firefox. Exceptional Blog!

  Posted by Olli Warelius on May 25, 2012 at 10:06 PM