fsteeg.com | notes | tags

∞ /notes/scala-object-persistence-and-the-original-nosql-xml | 2010-08-30 | programming eclipse java xml scala

Scala, object persistence, and the original NoSQL: XML

Cross-posted to: https://fsteeg.wordpress.com/2010/08/30/scala-object-persistence-and-the-original-nosql-xml/

In our digitization wiki project based on Eclipse 4 and Scala it was time to find a persistence solution. As a quick overview of what it's about, our app will basically allow a user to edit a digitized book page, showing the original scan as an image, with the selected word highlighted. It currently looks like this:

A digitization wiki using Scala and Eclipse 4

For the initial prototype implementation, we used XML files in zips to store the page content, combined with TrueZIP, a library which allows access to files in zips the way it should be: like files in a folder. So we could basically create and use the zip entries like this:

val zipEntry = new File("PPN345572629_0004.zip/0001.xml")

The XML is created using Scala's XML support, e.g. the top class in our domain model (Page) is serialized by specifying the root element and calling toXml on each containing element (the words), like this:

def toXml:Elem = <page> { words.map(_.toXml) } </page>

The XML is deserialized with the reverse action, by passing the deserialized word elements contained in the root element to the Page factory method:

def fromXml(page:Elem) = Page( for(word <- (page \ "word"))
  yield Word.fromXml(word) )

So that was basically our prototypical persistence mechanism. Of course, to actually collaboratively correct our digitized texts, we needed a central DB instance on some server. Or maybe not exactly: having heard good things about CouchDB (in german), it was the first option I looked into. The distributed nature of CouchDB sounded very interesting and seemed appropriate for our project - not for performance reasons, but to provide offline or decentralized editing as an option.

So I looked into CouchDB with scouchdb, a Scala API to access it (and a very welcoming project). After input from my colleague, I became aware that we'd lose quite a bit of installation ease if we pick a non-Java DB, so CouchDB was not really a client side option for offline storage, and therefore the strongest reason for us to consider it no longer there.

Next I considered an embedded relational Java DB with some ORM or ORM-like access. A JPA solution like EclipseLink or Hibernate seemed overkill and not ideal for Scala. Instead I took a closer look at ScalaQuery, which uses Scala's for expressions for DB queries, which seems like the thing you want if you happen to use a language that basically has a query language built-in.

For instance, in ScalaQuery, we could write a query to get all pairs of pages and words where the page contains the word and the original form of the word is Bonifaci like this:

val bonifaci = for (page <- Pages; word <- Words;
  if page.id === word.pageId && word.original === "Bonifaci")
  yield (page, word)

To allow this kind of queries, we need to define tables named Pages and Words. Defining the tables with ScalaQuery is quite elegant, e.g. our main table for pages could be defined like this:

object Pages extends Table[(String, java.sql.Blob)]("pages") {
  def id = column[String]("id", O PrimaryKey)
  def image = column[java.sql.Blob]("image")
  def * = id ~ image
}

However, this leads to either a duplication of our domain objects (have a table object as above and the original class) or a complete rewrite (use table objects like above as the domain objects), which both seemed wrong.

Also, thinking about alternatives to our XML representation, I came to realize that we would want some form of XML export anyway to ensure long-term access to our data, which is in the public domain and a cultural asset. Of course it would be possible to implement this as some form of export, but if possible it would be just perfect if we could use a single persistence mechanism.

I did remember there are some XML DBs around, but initially abandoned that idea since XML is only half of what we store (the other thing being the images used to correct the text, see screenshot above). But after being unhappy with these other solutions, I took a closer look at eXist-db, an XML DB licensed under the LGPL which by default runs on Jetty.

And I learned that eXist actually supports both XML and binary data, which was just what we needed. With eXist, we also get XQuery support and various interfaces to the data. In our case, this is less of an immediate concern, given Scala's XML support and for expressions are a bit like built-in XQuery capabilities. For instance, a query on XML that is equivalent to the ScalaQuery from above could look like this:

val bonifaci = for (page <- xml\"page"; word <- page\"word";
  if (word\"@original").text == "Bonifaci") yield (page, word)

This is not only about as concise as the query above, but also allows to express the hierarchical nature of our data in the query ('each word in the page'), instead of having to map the relational representation ('all words with a certain page ID').

As we already deserialize the XML to page objects, what we actually do is query on these. For switching the serialization from entries in a zip file to the eXist DB and allow these queries, the only conceptual change was to store the XML and image files in the DB instead of using a File object, which allowed a really smooth transition.

And with Scala's rich semantics and XML support, our DB wrapper in Scala can be very simple and precise about what it offers in its API, namely XML documents for a collection ID (if found in the DB):

val xmls:Option[List[Elem]] = Db.xml(collection)

Or binary data:

val imgs:Option[List[Array[Byte]]] = Db.bin(collection)

The elements provided by the DB in this way can be passed to the original deserialization method of Page:

val pages:List[Page] = for(page <- Db.xml(collection))
  yield Page.fromXml(page)

Given this, we can now use a plain Scala for expression on the deserialized objects to query our data like this:

val bonifaci = for (page <- pages; word <- page.words;
  if word.original == "Bonifaci") yield (page, word)

At this point we have replaced the file storage with the DB, but are still creating the XML manually, using Scala's XML literals. To avoid any duplication (each class parameter or field is represented again in the serialization logic), and basically eliminate all direct XML manipulation, we could use an XML binding library like scalaxb, JAXB-RI or MOXy.

On the other hand, the type safety of the Scala XML serialization code makes the duplication feel much less dangerous than traditional, string-based XML serialization (which can break easily unnoticed by the compiler). And being about as concise as some JAXB annotations and marshalling code would be, the disadvantages seem almost entirely theoretical, but we'll see - swapping the serialization logic should be as easy as swapping the storage implementation was.