Thursday, October 01, 2009

Terracotta as distributed DBMS? Bad idea!

Some months ago I have had a fucking genius idea about a new application. In order to start working to my idea I need to distributed at large tons of objects in several nodes around the Internet. After a deep research about all the best solutions available now I decided to try Terracotta.

Basically I need a real distributed Object repository so why don't create a simple Object DBMS virtually all in RAM and let to Terracotta most of the hard work?
  1. Handle the cache on nodes
  2. Manage the object/page fault in a transparent way for the application
  3. Share the load among nodes
  4. And obviously: read/write objects in persistent way
So I've developed a very simple library that uses the JDK 1.6+ TreeMap to collect data and to query them. I was surprised to discover how much terracotta makes a good job in hiding the dirty work and complexity of data distribution.

But before to think to use this piece of code in a production system I need to be sure that the objects became really persistent in ACID way. For this reason I developed some Test Cases as microbenchmarks to see the real throughtput and if data are written in synchronous way to the disk. This was my test using the library I wrote:

UserDatabase db = ServerNode.getInstance().connect("petshop", null, null);
clazz = db.getClazz("Animal");

for( int i = 0; i < 100000; ++i ){
UserObjectTransient object = clazz.createObject();


object.setValue("name", "Gaudi");
object.setValue("description", "My crazy ferret!");
object.setValue("from", "Barcelona, Spain");
object.create();
}


Pretty nice, don't you? The Terracotta server instance was on a remote server with 1GB of heap and enought bandwitch available and the client runned on my laptop. Yes I know, this is a microbenchmark and can't tell to me the performance in all the scenarios. But to go on on my experiments I needed to know if I was totally crazy or if there was a way to build a robust & scalable solution to satisfy my needs.

Well, in this test my library + Terracotta (after some tuning by reading the documentation and the forum) are able to write around 190 objects per second. Not really bad for the first release and in comparison to a RDBMS solution. But I've stored just a dynabean with two properties... I remember in the past (about 8 years ago!) when I tried a real ODBMS (Orient ODBMS) it was able to store 100x of this solution using a 8-years-old hardware!

But the really bad news is another one: When the "for cycle 100.000" was finished and the application seemed to be succesfully ended I noted that the console had no control. The test was still running! After 10 seconds I killed the java process and counted the objects created (using the Terracotta console): 99.345!

Where are the 655 objects missed? Ok, probably I got wrong about ACID features. No, the Terracotta documentation tells ACID! But how it can be ACID if the client sends the objects to the server in asynchronous way?

I'm pretty confident that it would be exist in any point of configuration the real synchronous mode but even though it exists what about the performance? If this ASYNCHRONOUS solution was able for 190 obejcts/sec as throughtput the synchronous one must be much more slow!

These are the reasons why I abandoned this path. Terracotta surely fits well in traditional replication contexts and the product is really good on documentation, presentation, APIs, etc. But for my needs I have to continue in searching...

21 comments:

Taylor said...
This comment has been removed by the author.
Ari said...

Interesting. You _might_ consider asking for a bit of help because you are getting results about 100X below what Terracotta can do. Perhaps you are not doing what it is you think you are doing.

Its hard to tell w/o your code or harness or anything. What confuses me is, of the 3 or 4 people who I have seen get this poor a result, they all seemed to post on our forums and assume they were doing something wrong. In fact, we were able to tune up their app or show them a small change that sped things up right away. Why not take the approach of assuming something is wrong in this case as well?

BTW, When using Terracotta's new Ehcache product we are reliably getting 150K transactions per second where MySQL is getting 1900. Please feel free to post your use case to our forums and we'll see what we can do.

Cheers,

--Ari
CTO, Terracotta, Inc.

Taylor said...

Luca,

You're right, micro-benchmarks are a bad idea, but your performance test seems off by at least an order of magnitude.

I just wrote a simple test to load 100,000 objects into a Map. On my recent model MacBook Pro (2.2 GHZ) I see ~3,000 objects created / sec.

It seems to me there must be something simple that you've overlooked, not sure what it could be though.

I wrote a response on my blog to show you the code I used - maybe we can figure out together what is going wrong in your test?

http://javathink.blogspot.com/2009/10/simple-load-test-in-terracotta.html

Luca Garulli said...

Hi Taylor,
thank you for your prompt response (Google Alerts? :-))

I tried the exactly the example of your blog with the only difference in setting the server persistence mode as "permanent-store" (but probably is by default).

You're right. The first 100.000 objects took only 23 seconds: 4.300 objects/sec. But repeating the same test once again (so starting with 100.000 objects in terracotta) tooks 158 seconds! Now we are at 644 objects/sec. 7 times bad. Summary:
- 1: 4.300 obj/sec
- 2: 644 obj/sec

And please consider that this test runs in local. The same test against a remote server (cleaning all previous objects) is much slower:
- 1: 1.960 obj/sec
- 2: 25 obj/sec

So try to execute your microbenchmark at least 2 or 3 times and collect the result.

Furthermore this results are not valid since the objects are not REALLY written when the "elapsed" is print.

What about this asynchronous problem? Running in localhost the time between the application prompts the result and the end of the same app is less then 1 second, but in the remote test this time is (obviously) larger. What happen if I kill the client just when it gives me the "elapsed"? All my objects are definitely lost...

bye,
Luca

Taylor said...

Luca,

1) Something like that, yes. I use Yahoo! Pipes. Here's something similar I blogged about: Terracotta Buzz

2) Running the test twice doesn't make sense. You're overwriting the keys, so the second run is a load and overwrite, where the first test is just a load. A real-world app wouldn't be starting cold like that, so it would be better to run the load test twice inside the same client overwriting keys on the second run to test overwrite performance. That said, if startup and load times from cold are an issue for you (and for some people it is depending on data-set size and SLAs) then there are many many approaches to take. The easiest of which is to set the fault-count parameter in the tc-config.xml which increases the batch size of objects loaded.

3) The reality is a bit more complex than that. Yes, the changes are being written asynchronously, but they can never be seen by any other client in an out of sync state, so for all intents and purposes in a healthy running system the system never exposes out of sync data. However you must understand this is the DEFAULT mode, and you can change that to get more or less strict semantics. If you really care about changes being written synchronously you can change the lock setting to be synchronous, instead of asyncronous. This is controlled in the tc-config.xml file. Actually, the algorithm used is precisely the same used modern JVMs to synchronize memory between threads on multi-core processors, if you just replace "processor" with "TC client" and "system memory" with "TC Server" and "system bus" with network. Of course in the case of TC the latencies are speeds are all orders of magnitude (several) slower since processors and the system bus operate on nanosecond and microsecond scales where network operates at millisecond scale.

Does that help? I'll have to go with Ari on this one, I don't think you're actually observing what you think you are observing. The reality is a bit more involved. As you state yourself, micro-benchmarks are never a good idea :) But if I had been in your shoes I certainly would have been surprised by such poor numbers. Glad you posted about it anyway instead of just jumping ship alltogether.

Taylor said...

I should add that the death of any distributed system is to operate synchronously. If you start moving in that direction, and do it without careful design, you're going to end up with a very slow system. If you've ever benchmarked 2PC (XA) systems, you'll know what I mean.

What you want to do is examine your problem domain very carefully and understand exactly where things must be synchronous, and where they can be asynchronous. Then do everything you can async, and do as little as absolutely possible sync.

In other words, a blanket approach switching everything sync is going to be slow, and a microbenchmark will probably show that.

Furthermore, one thing you need to be very careful of here is understanding that Terracotta can go several times as fast as you (and I) are "benchmarking" because these tests are single-threaded. Terracotta is designed to be very concurrent (distributed systems are very concurrent) and as such it's impossible to gauge the full throughput using only a single thread of execution.

Luca Garulli said...

Hi Taylor,
the 2nd time the test was executed it could be a sort of simulation of what happen in read/update transactions as comparison of a DBMS.

About the asynchronous issue it's not just a matter of Isolation but of consistency. If I tell to the user "Ok, your data was saved", I can't risk to loose the user's data if something goes wrong in the network or the process dump.

This is the reason why using Terracotta as distributed DBMS can't be a real solution. Maybe setting the synchronus mode it works but it's unacceptable as throughtput IMHO.

Terracotta remains a good product, but for what it has been designed ;-)

Bye,
Lvc@

Taylor said...

The second test is not a good simulation at all of a read/update scenario, but I don't want to get into that discussion on the comments.

Regarding sync vs. async, every modern DB has the ability to be ACID or not ACID. The reason for this is the same reason Terracotta also gives you that option - trade durability for performance.

As I said, you can design your object model to have sync and async operations, and limit the amount of sync operations that need to be performed by intelligently combining the two.

kirk said...

jee, I hope this isn't the only bit of code you've used for the benchmark because it is very very deficient in any aspect of benchmarking that would lead me to believe that these numbers have any meaning at all.

Have you validated that any or all of the machinery is working as you believe it to be working? Can you account for the time?

I really believe Ari was being polite when he asked for your harness. You've slammed his product without any evidence. Very unprofessional and almost slanderous. Microbenchmarking can be achieved and useful but not if you're not willing to invest the time into creating a good benchmark.

Regards,
Kirk

Mickey Alon said...

Hey Luca,
2 things I wanted to comments
1. reliability
2. durability & performance

first, the default with Terracotta seems to give you fare performance numbers, BUT you didn't know (as it is not clear to many) you are running in un-reliable mode, which I find VERY disturbing.

second, it is not true you need to scarify durability in order to have better performance if you use clustered memory solution, the backup is expected to be in memory and failover should take place in your scenario

you can check-out GigaSpaces: www.gigaspaces.com
We have a community edition that might fit your needs.

Good post
Mickey

Sergio Bossa said...

Hi Luca,

I'm currently working at a Terracotta-based key/value store, and I'm experiencing far better performance results than yours: so feel free to contact me for any question, you already have my contact details ;)

Cheers,

Sergio B.

Mickey Alon said...

A simple fact is memory to memory is by far faster than memory to disk, if you are getting extreme TPS and using a disk for durability look for the holes as Luca did...

peace,
Mickey

Ari said...

Mickey,

Please be quiet. Your product is the most dangerous on the market. It has no true synchronous replication. From your docs:

---------
In a synchronous replication scheme, the client receives acknowledgement for destructive operations only after both sets of the spaces (source and target - that is, master and backup spaces) have performed the operation.
When the master is active and the backup is not active, the client receives acknowledgement from the master space; the operation on the backup space is performed only when the backup space is restarted. The master logs all operations until the backup is started, the same as when working in asynchronous mode.
------------

Are you serious: the master ACKS a transaition when the slave is gone and queues it till later? IN SYNCHRONOUS REPLICATION MODE? WHAT?!?!?!?!

--Ari

Jerry Lam said...

I came across this blog and I would like to ask the following:
- How does Terracotta handle ACID if it is possible at all?
- What is the scalability of Terracotta when ACID is required?
- If Terracotta uses disk to maintain ACIDity, what is the difference between Terracotta and a DB? Will I/O becomes Terracotta bottleneck at the end?

The reason I ask these questions is because ACID is a must property for DBMS.

What I understand so far is that if you need 100% durability, disk will provide the highest guarantee. If performance is the key metric then you might go for in-memory solutions which trade performance with A.C.I.D.. A good in-memory solution should give you many gray levels of optimization which balances well with your complex requirements.

Ari said...

Jerry,

Terracotta (TC for short) is ACID and it puts its data on disk and it does so at 10X the throughtput and an order of magnitude lower latency than that of the other IMDGs. In fact, extensive testing we have completed shows the IMDGs are same performance as MySQL! They are just more scalable. Terracotta is both faster _and_ more scalable at the same time. I am sure this sounds like hype so let's break it down here:

1. Terracotta offers causal or strict ordering. No one else does. Causal == you are always guaranteed to see everything that happened before you in other JVMs while still running asynchronously with those JVMs. This is _hard_ to figure out but we have done it. Know who else does this? The JVM itself! Threads are guaranteed to see what happened to main memory before they got to that page to read it. We do the same--this is causal ordering (whatever causes the memory I care about to be in the state it is when I get there is guaranteed to execute before I can see it). Strict ordering means that everything happens in the time-ordered sequence it occurs in reality. If JVM1 changes something before JVM2, even if JVM2 doesn't care, the 2 will guaranteed fly in that order to the cluster. (BTW, this is what Mickey is upset about with Terracotta. He doesn't understand causal ordering and thinks it is async and wants to say this is how TC can go so fast. It is not cheating. It is just not strict ordering--just like the JVM--and thus WAY faster and lower contention than his product).

2. We leverage causal ordering technology plus asynchronous programming plus our networked high availability to give you ACID to disk at in-memory speeds. Can the disk bottleneck Terracotta, yes? Have we cheated the laws of physics? No. But we decoupled having the disk write as part of the transaction. Everything your JVM does gets written to our server and its partner. The 2 servers have your data in memory and if one dies, the other will return the data. If your JVM doesn't hear from both our servers that your data is in memory, it will keep resending the data till we have it safe on our side. Next step, we look at all the data changing per unit time, organize it and write it to disk in an efficient way. So envision our server has having 4 copies of your data. One in one TC instance, the other in the other TC instance, both in memory at in-memory speed PLUS both servers write the data to disk as fast as possible. If they can't keep up with the rate of write in the app (which has to be very high) then you can start 3, 4, 5 or n copies of TC each writing 1/nth of the data to their disks (partitioning in the caching layer, well above the disks, just like Memcache, but with disk durability built in).

Now, it is worth noting that Terracotta is taking the time in memory to pack and organize what it writes to disk so we get full streaming throughput of the disks at our disposal--just like a good DB. We can write 80MB/sec to disk. And if you stripe that data across 2 or 4 disks, we can get 300+MB/sec to disk. Per Terracotta Server. Or, you can just run 1 disk per Terracotta server and run 4 Terracotta servers each on its own box and they can get the same 300+MB/sec to disk with zero bottlenecks.

There you have it--a recipe to piss off the competition and provide ACID to disk at in-memory speed all in one box.
a) they are pissed because they do everything with strict synchronous ordering and we don't so they are going very slow
b) we are writing everything to disk on 2 boxes simultaneously and we write things that logically must happen before others in the correct order and we guarantee no one misses changes he needs so we are ACID (atomic, consistent, durable are most of what I explained here)
c) We are fast because we detect causality (not making one thread wait for any others if he doesn't have to) and we then go asynchronous allowing threads unrelated to each other to not block in any way shape or form.

Cheers,

--Ari

Luca Garulli said...

Hi Jerry,
theorically you don't need to write data to disk. If you have a cluster with 10 nodes distributed around the world the probabilities that all 10 nodes crash at the same time is really remote.

This is the theory. In the practice you need to write data in a persistent way for many other reasons:
- mantainance: sometime your objects/data changes and migrate them while the servers are running could be hard (or impossible)
- reporting: if you store data in a common RDBMS you can use the most famous reporting tool that work on top of a ODBC/JDBC driver
- psycologic? :-)

bye,
Luca Garulli

Jerry Lam said...

Hi Ari,

Thanks for your sharing.Based on what you said and if I understand correctly, Terracotta provides the scalability, the performance and ACID without compromising availability.

One question remained is that how can Terracotta provides all these nice properties at the same time? I mean the CAP theorem states that you can only have 2 of them (CAP: consistency, availability and partition tolerant) for a reliable distributed system. How can Terracotta provides all three properties at no cost (if there are some hidden costs)?

Cheers~

Jerry

Ari said...

Jerry: We provide CA - Consistency and availability. Partition-tolerance means that you can survive when the network is down and run autonomously, resolving conflicts later. We cannot.

--Ari

Mickey Alon said...

Ari,
thank you for changing the "shut-up" to "Please be quiet", let's try to keep the discussion on a professional level.
Now to your point (@see WHAT!?!?!?!)
Yes we don't block the user when a back-node fails instead we start a new one immediately. You can have more then one backup to reduce the chances for failure in that case as well.

With in-memory technology using any synchronous calls to disk is a bad idea. Relying on disk for high availability makes things even worse. We assume that machines and disks! can crash completely, we rely on pure commodity hardware setup and we don't rely on third party database (+the extra cost associated with its support) to maintain our in-memory reliability. This enables us to be elastic. In our case a backup can start on completely new machines without relying on any shared disk storage.

As a side note in earlier versions we used to throw an exception in such an event but through the experience with our users we learned that non of them turned this feature on from the reason I mentioned earlier and decided to drop it entirely. If you would look at all other in-memory data-grid solution you'll find that they do pretty much the same thing.

I would doubt that Any of your users choose to turn synchronous write to disk on as the performance penalty diminish the value of having solution like Terracotta in the first place.

In other words I would argue that having more options to deal with pure memory based reliability fits better in the realm of large scale deployment then relying completely on synchronous filesystem.

and lastly, with GigaSpaces you can register to cluster state change notification + you have the "admin API" that gives you the power to completely handle the cluster in 2 lines of code...

Anonymous said...

It was extremely interesting for me to read the post. Thanks for it. I like such topics and anything that is connected to them. I would like to read more soon.

Ari said...

Mickey,

Pretty much everything you claim about our product in your follow up post is wrong. BTW, please please please keep telling your prospects during sales meetings that we require shared storage. Great tactic to lie to people there ;)

BTW, I wonder what all these misleading statements suggest about your claims for your own product ?!?