Tuesday, September 28, 2010

Working on large datasets in OrientDB

I was really busy in the last 5 days because I'was working in a huge rewriting of the RB+Tree algorithm that is behind the OrientDB index system. All is started some days ago when in the OrientDB Group the user Mark Harwood was sharing the first experiences in the import of the entire Internet Movie Data Base (IMDB) inside an OrientDB instance. We're talking about 13 millions of Vertices and about 8 millions of Edges.

Well, performance seemed good, specially in comparison with Neo4J, but the insertion time was not-linear and after some millions of records the engine slow down a lot. Furthermore a nasty bug in the management of multiple segments broke the import job after some minutes in his machine while on my PC all went correctly.

This is the reason why I stopped the development of the Cluster feature in favor of a strong review of the RB+Tree algorithm. Indexing has been alway fast, but the Mark Harwood's case stressed the engine as no one did until now. The reason is in the import process itself that executes random reads into the index while the RB+Tree seemed optimized for sequential insertion and retrieval.

I'm fixing all the pending issues but first benchmarks show a speed improvement in Indexing between 100% and 800%! Furthermore times now are really linear. Inserting the first millions of Vertices is the same of the last million. This is a big result to the already super-fast OrientDB engine.

Once finished the test phase (with new Unit tests, of course) I'll commit all the changes in SVN to be reviewed by the community before the next release.

4 comments:

valerio said...

Hello Luca,
good job!
It would be useful to the community, especially to produce fair benchmarks against other platforms, if you could share the full IMDB dataset, or provide a link to download it (or how to retrieve it from the source).

Thanks,
valerio

Luca Garulli said...

IMDB snapshots are available directly from the official site:
http://www.imdb.com/interfaces

I'd like to compare OrientDB performances with others with the same use case.

seralf said...

hi is it possibile / have you plan some benchmark on OrientDB as a triple-store? I mean: in order to have a comparizon on numebers of triples instead of numbers of edges?

Alfredo

Luca Garulli said...

Nothing of official. Why don't start some benchmark by your own? What RDF triple store are using now?

Lvc@