Support » Requests and Feedback » Why not normalized relational database?

  • Andrew

    (@afedericojr)


    For small projects, WordPress is great to implement, but as projects become more involved, it seems to always come back to the core implementation of the database schema; utilizing wp_posts for so much, rather than normalized tables in a relational schema.

    Now, while page versions, etc are saved to wp_posts, as well as multiple post types, I can still develop plugins to use new tables and make the best of the situation, but why are so many plugin developers not following this convention?

    I was surprised to see WCK (WordPress Creation Kit) even use serialized data, and aside from their workaround, finally implemented the option of unserialized data, only to find once again it was not normalized nor properly relational. On top of that, likely for the sake of backward compatibility, ALSO still has duplicate serialized data.

    When compared to other clean frameworks, I am really hoping to see WordPress take this approach.

    I just can’t see a downside, to cleaner structured data, resulting in more refined queries, less risk of serialized errors, and ultimately a more proper schema to develop against.

    Are there any rumors of this becoming a reality?

    • This topic was modified 5 months, 1 week ago by bcworkz.
    • This topic was modified 5 months, 1 week ago by Steve Stern.
Viewing 6 replies - 1 through 6 (of 6 total)
  • 10 years ago, I would have agreed with you. Now, not so much.

    The question isn’t whether the database should be normalized, but rather what would the benefits of such normalization be? Being normalized doesn’t make a database “faster” unless the queries you’re running against that data are optimized for the purpose.

    WordPress is ultimately pulling data to build a webpage. This generally does not require a lot of joins, it’s mostly just pulling large amounts of relatively flat data on a few known and easily indexed fields. You could just as easily store things like post data in flat files, or non-SQL databases like key/value stores.

    The reason WordPress uses MySQL is simple: MySQL is ubiquitous. It’s pretty much a given on every hosting service. On most of them, it’s generally overloaded as well, so making a fully 3NF database schema wouldn’t generally gain you a lot. The main bottleneck is not actually in the queries themselves, most of the time, but in the data transfer between the database server and the web server that’s building the pages from that data.

    If you have a special case where storing data in a relational manner is needed, the option to create new tables for it is there for you and can be expanded to larger things like multisite as well. But for most cases of “building a web page”, storing relatively flat data as “posts” with optional key/value pairs in the form of post-meta makes a reasonably good fit to the purpose. It doesn’t have to be normalized, because it’s not data that you’re doing complex relational queries on. You get the posts, you display the posts. That’s more or less that.

    Andrew

    (@afedericojr)

    Thank you for your reply.
    As mentioned, I think it will do for small projects, but it is highly inefficient, wastes memory, and has more room for error.

    MySQL can handle tables of over 1 billion records, and for many, that may seem like a lot, but it is a hard limit. My project has just begun, and is over several hundred thousand records. Combine that with version history, and many custom post types, and within 10 years, there will be a serious issue. This is not planning or scaling for the future appropriately.

    Aside from that, every select loads into memory, and it is “WRONG” to load data that is not needed.
    If you have 10 post types with a common author, and need to filter through all in order to display the posts on an author’s bio, all post data is loaded into memory. Caching, CDN, etc only goes so far, and the proper way is to ONLY query the table containing the data needed, so that my memory utilization is efficient.

    Another consideration is table corruption. If my schema is normalized, and data is corrupted in a table, only that table is unavailable and the rest of the site will function. This means, that the base web application should operate independently, with all auxiliary calls separate so that the functionality of the site is not impacted when a query fails to return.

    When considering penetration testing, not only is the size of the data once again an issue, but having all the data in a single place or even just a few, is not wise. WP_Posts, WP_Meta, etc. Traversal should be caught functionally and they would need to know what the table names are for blind sql injection.

    Lastly, WP_Query is also inefficient in its current implementation, causing multiple unnecessary queries. When a page is loaded, it should use MVC methodology to query data once and return. My first implementation of this project brought in WCK to help fastrack progress, but with the complexity, we had pages initially returning 3000+ queries, taking anywhere from over one minute to load to simply timing out.
    Once we discovered the limitation of the structure and hacked the core, we now have a single query resulting in a page load of a 700 millisecond average.

    To take it a step further, we have no developed our own API and push the data from the database to JSON for the site to digest. The front end site now no longer touches the wordpress database and is unidirectional. This eliminates possibility of being hacked, and also keeps site efficiency to a maximum.

    The point is, other frameworks already do this. Drupal, Laravel, etc are already much more mature than WordPress, and I am only recommending that the community learn from that.

    Andrew

    (@afedericojr)

    One more final point. WordPress is amazing, and has the potential to be so much more. Having begun as a blog, it would be unfortunate to see the vision stay small, when some basic core changes (hooks into core functions, wp_query enhancement, and data normalization), could make WordPress an Enterprise Go-To.

    every select loads into memory, and it is “WRONG” to load data that is not needed.

    WordPress does not generally load data that isn’t needed. Most sites build a single user-facing page using less than 20 queries total. That’s lower than pretty much every other CMS out there.

    Another consideration is table corruption. If my schema is normalized, and data is corrupted in a table, only that table is unavailable and the rest of the site will function.

    Table locking and corruption handling has nothing to do with normalization. In fact, a database in Third-Normal-Form is more prone to table corruption errors, because a corrupted table used in many JOINs breaks the relational restrictions. Meaning that one table failure in a 3NF schema actually can prevent all insertion of any kind, just because the table is inaccessible.

    You’re misunderstanding the purpose of database normalization. It is not to ensure speed, or for error-proofing. Normalization’s sole goal is simply one of data consistency. That is, if the database is properly normalized, then it is always consistent. It becomes impossible to INSERT inconsistent data because of the relational nature of the tables and their relationships between keys. It becomes impossible to DELETE data and produce an inconsistent result for the same reason. That’s what Normalization does.

    Traversal should be caught functionally and they would need to know what the table names are for blind sql injection.

    Blind injection is an uncommmon issue, and discovering table names or even fuzzing most common ones is not difficult either. You don’t secure things by obscuring them. Yes, you can limit damage through those practices, but it’s much better to simply prevent the injection vectors in the first place.

    The point is, other frameworks already do this. Drupal, Laravel, etc are already much more mature than WordPress, and I am only recommending that the community learn from that.

    Other frameworks like these are not really our competitors. You’re describing developer ideas and systems. Tools for programmers. WordPress is not a programming language or tool, it’s a publishing tool. For writing words on websites.

    Yes, practices like these you mention are important to understand and follow, but they don’t drive growth or adoption rates. That’s not what makes a better publishing tool. The best programmed thing is not always the best user-friendly thing.

    BTW: This forum, where you’re reading this post? WordPress. Running the bbPress 2 plugin.

    Number of SQL queries executed to build this page: 6.

    Andrew

    (@afedericojr)

    Thanks for the reply Samuel.

    If you speak for the majority community, that is fine, but I do disagree, which should also be fine.

    Table locking and corruption handling has nothing to do with normalization. In fact, a database in Third-Normal-Form is more prone to table corruption errors, because a corrupted table used in many JOINs breaks the relational restrictions. Meaning that one table failure in a 3NF schema actually can prevent all insertion of any kind, just because the table is inaccessible.

    It does though. A single table with many columns utilized more than would be in a normalized relational schema is far more likely to experience locks in concurrency. Should data become corrupt, the entire table will be effected throughout the query, rather than perhaps a single post type being unavailable, in which case the rest of my site will work, while yours may not.

    Regarding your mention of many JOINs, I am not saying to over-normalize here, as that would have its own negative affect on queries as well, but reducing the amount of data that is needed to be returned to memory, filtered, sorted and displayed is much more efficient in smaller logical sets.

    Consider even Microsoft’s SharePoint, as they have made huge strides in this area. Originally, they imposed a soft limit best practice of list queries below 5000 for this exact reason, as sites became slower, even as intranets.

    Blind injection is an uncommmon issue, and discovering table names or even fuzzing most common ones is not difficult either. You don’t secure things by obscuring them. Yes, you can limit damage through those practices, but it’s much better to simply prevent the injection vectors in the first place.

    Obscurity is a part of security. Let’s not forget your installation allows for naming the table prefix. When someone Google hacks a community framework, they are looking for common known exploits. When someone runs WPScan in Kali Linux, etc, it too is looking through a common library of attacks. I can manually enact each of these specifically in my penetration testing, but packaged attacks are much more likely, and as a simple example, may be one reason you rename wp-admin. Does that mean it is secure? Of course not, and you should implement access control on it so it is only accessible from approved IP’s, but does that secure it?…no, they can be spoofed, but it is certainly part of the process. The point is, the more obscure, they won’t even no you’re there. On top of that, secure the application, and on top of that, feed them incorrect data during the attack.

    Yes, practices like these you mention are important to understand and follow, but they don’t drive growth or adoption rates. That’s not what makes a better publishing tool. The best programmed thing is not always the best user-friendly thing.

    If that is the consensus and vision of the community, I am saddened to hear, as it is quite close to reaching the next level. I did not come here to pick a fight, but present a few concerns, and take a peek into your vision.

    Regards

    • This reply was modified 5 months, 1 week ago by Andrew.
    • This reply was modified 5 months, 1 week ago by Andrew.
Viewing 6 replies - 1 through 6 (of 6 total)
  • You must be logged in to reply to this topic.