Saturday, March 28, 2015

“Database is wrong for you” and all that FUD

Checksum-based storage. It’s one of the key features that makes Artifactory better than the competition. Here is the typical false claim made by Sonatype (creator of Nexus, that uses a plain Maven2 filesystem for their repository) about Artifactory storage:

“Artifactory takes the polar opposite approach and stores the metadata and the artifacts themselves in a huge database. The reason they claim it’s needed is for transactional behavior. Using a database doesn’t guarantee transactionality and it certainly isn’t the only way to get transactional behavior.”
Not only are these claims and vague reasoning wrong (we didn’t do it just for the sake of transactional behavior, and we recommend to store the artifacts on the disk), but so is the main claim that using a database is bad for you. Here are the real reasons that Artifactory uses checksum-based storage:
  • Deduplication- By referencing binaries by their checksum, pretty much like Git or Dropbox do, and not relying on filesystem paths same-content files are never stored more than once. This is one of the few ways you can optimize the storage of binaries.
  • Free copy and move- Artifact promotion (i.e. Moving between repositories with different visibility rules) is critical for building continuous delivery pipelines. With a checksum-based storage those operations don’t involve any filesystem activity at all – just adding and removing file references in the database.
  • Efficient uploads, downloads and replication- Optimize network operations by first sending checksum headers. Files that already exist in storage are skipped, even when they appear under a different path.
  • Filesystem performance. Since files are never overwritten, and are only deleted during storage GC, normal operation never requires a write lock on the FS. Expect significant performance gains compared to plain FS storage.
  • Search stability and performance-
    Misusing frameworks initially meant for full-text indexing to search through a filesystem is fragile and slow. With a checksum-based storage, all repository information and artifact metadata are stored in optimized database-indexes: always up to date, blazingly fast and never broken.
  • Freedom of layout- When metadata is extracted only from the filesystem and the files are locked in a Maven2 (or any other) layout, it’s hard to impossible to support other layouts. Since Artifactory uses the database as a layer of indirection between the actual storage and the displayed layout, it natively supports all layouts including Maven2, Maven1, Ivy, Gradle, Nuget, YUM, npm, debs, Docker, PyPI or any other custom layout.

Get the picture?

Smart checksum-based storage == powerful  features and better performance.

Naturally, Artifactory supports all the major relational databases including Oracle, MySQL, MS SQL Server and PostgreSQL. We could have used any database, including NoSQL ones, however we choose to use a relational database as a bullet-proof technology that many of our customers, enterprise and startups alike, feel at home with. Out of the box, Artifactory comes with a bundled Apache Derby DB, and customers switch to their production DB by modifying one line of configuration.

And look who’s crashing the party!

It took a while, but it looks like Sonatype finally got the picture as well.
Тhe Maven2 layout is burned into Nexus repository information that is a mixture of a Maven2 inflicted Nexus index in Lucene format and the file-system. This effectively blocks Nexus from being flexible in supporting other systems and layouts.
To solve this problem Sonatype mimicked the Artifactory approach by using H2 for their NuGet support. It didn’t help much; Nexus support for NuGet is “suboptimal” to say the least, but at least it shows someone finally understood that relying purely on the filesystem for metadata and indexes can not scale.

“H2″ you said?

So what’s that H2 database that Sonatype use for Nexus? Well, it’s an embedded in-process database which is supposed to give Nexus an escape route from its Maven2 layout prison. Is it any good? It is! Does it have commercial support that you might transitively need when using Nexus for production? Judge for yourself! And I’ll quote:
“Commercial support for H2 is available from Steve McLeod (steve dot mcleod at gmail dot com). Please note he is not one of the main developers of H2.”
All in all, it’s definitely a step in the right direction. Unfortunately, it leaves Nexus users in pretty much the same place: a repository locked down to the Maven2 layout filesystem, effectively making anything other than Maven artifacts a second class citizen; and optimized metadata locked down to a production-questionable in-process database.




No comments:

Post a Comment