The Zope Object Database provides an object-oriented database for Python
 that provides a high-degree of transparency.  Applications can take 
advantage of object database features with few, if any, changes to 
application logic.  Usage of the database is described and illustrated 
with an example.  Features such as a plug-able storage interface, rich 
transaction support, undo, and a powerful object cache are described.
1. Introduction
 
Many applications need to store data for use over multiple application 
executions, or to use more data than can practically be stored in 
memory.  A number of approaches can be used to manage large amounts of 
persistent data. Perhaps the most common approach is to use relational 
database systems.  Relational database systems provide a simple model 
for organizing data into tables, and most can handle large amounts of 
data effectively.  Because of their simple data model, relational 
databases are easy to understand, at least for small problems.  
Unfortunately, relational databases can become quite cumbersome when the
 problem domain does not fit a simple tabular organization.
 
An advantage of relational database systems is their 
programming-language neutrality. Data are stored in tables, which are 
language independent.  An application must read data from tables into 
program variables before use and must write modified data back to tables
 when necessary.  This puts a significant burden on the application 
developer.  A significant amount of application logic is devoted to 
translation of data to and from the relational model.
 
An alternative is to retain the tabular structure in the program.  For 
example, rather than populating objects from tables, simply create and 
use table objects within the application.  In this case, high-level 
tools can be used to load tables from the relational database.  With 
sufficient knowledge of database keys, tools could automate saving data 
when tables are changed.  A disadvantage of this approach is that it 
forces the application to be written to the relational model, rather 
than in an object-oriented fashion.  The benefits of object orientation,
 such as encapsulation and association of logic with data are lost.
 
Object databases provide a tighter integration between an applications 
object model and data storage.  Data are not stored in tables, but in 
ways that reflect the organization of the information in the problem 
domain.  Application developers are freed from writing logic for moving 
data to and from storage
1
.
 
The purpose of this paper is to present an object database for Python, 
the Zope Object Database (ZODB).  The goals of the paper are to describe
 the use and benefits of the ZODB, provide a high-level architectural 
view, to highlight interesting technical issues, and to describe recent 
and future developments.
2. Application development
2.1 Example: an issue tracking system
 
This section will present a simple issue tracking system as a means for showing how the Zope object database can be used.
 
Consider an application that manages a collection of issues.  The data 
for this application might be implemented in an 'Issue' module, as shown
 in Example 
See A simple Issue module..
:
| from TextIndex import TextIndex
class Issues:
  def __init__(self):
    self._index=TextIndex()
    self._issues=[]
  def addIssue(self, issue):
    issue.setId(len(self._issues))
    self._issues.append(issue)
  def __getitem__(self, i): 
    return self._issues[i]
  def search(self, text):
    return map(
       self.__getitem__,
       self._index.search(text))
class Comment:
  _text=''
    
  def __init__(self, text, parent):
    self._parent=parent
    self.edit(text)
    self._comments=[]
  def text(self):
    return self.text
  def edit(self, text):
    self._unindex(self._text)
    self._text=text
    self._index(self._text)
  def _index(self, text):
    self._parent._index(text)
  def _unindex(self, text):
    self._parent._unindex(text)
  def __getitem__(self, i): 
    return self._comments[i]
  def comment(self, text):
    self._comments.append(
        Comment(text, self))
class Issue(Comment):
  _id=None
  def __init__(self, title, text, 
               parent):
     Comment.__init__(self, text, 
        parent)
     self._title = title
  def setId(self, id):
    self._id=id
    self._index(self._text)
  def title(self): return self._title
    
  def _index(self, text):
    if self._id is not None:         
       self._parent._index(
         text, self._id)
    
  def _unindex(self, text):
    if self._id is not None:
        self._parent._index(
          text, self._id) | 
 
There is an 
Issues
 class that manages a collection of issues and a text index to support 
full-text search for issues.  An issue may have comments, which may have
 comments, and so on, recursively.  The text for an issue and it's 
comments is indexed so that issues can be searched for based on issue 
and comment text.
 
An application for managing issues will typically be some sort of server
 or long-running application, like a web application or an interactive 
graphical application.  For brevity, the application will be presented 
here as a collection of scripts that operate on issues data.
 
A script for adding issues might be along the lines of that shown in Example 
See A script for adding an issue.
 
| import Issue, sys issues=Issue.Issues() issue=Issue.Issue( sys.argv[1], sys.argv[2], issues) issues.addIssue(issue) | 
 
An obvious problem with this script is that it recreates the issue 
database each time. Obviously, some logic needs to be added to make data
 persistent between script invocations.  The data could be stored in a 
relational database, but it would be  cumbersome to map the hierarchical
 issue data to a relational model, let alone the text index, which is a 
"black box" from the point of view of the issue application.
 
A simple way to add persistence is to save the data in a file in Python pickle format (Example
See A script for adding an issue and saving the issue data in pickler format.
 )
| import Issue, sys, pickle, os issues=pickle.Unpickler( open(`issues.pickle')).load() issue=Issue.Issue( sys.argv[1], sys.argv[2], issues) issues.addIssue(issue) pickle.Pickler( open(`issues.pickle','w') ).dump(issues) | 
 
The data are stored in the file, 
issues.pickle
.  When we start the application, the data are read by opening the file, creating an unpickler on it, and calling the 
load
 method on the unpickler to load the data.  After adding the issue, the 
data must be written to the file by opening the file for writing, 
creating a pickler on it, and calling the pickler's 
dump
 method to save the data.
 
Before calling the add script, the data file must be created with an initialization script (Example 
See A script for initializing a pickle file with an empty issues collection..
). 
| import Issue, pickle pickle.Pickler( open(`issues.pickle','w') ).dump(Issue.Issues()) | 
 
This approach is very simple, but does not scale very well.  The entire 
database is read or written every time an issue is read or saved. A 
better approach is to use the ZODB.  To do this, there are a few changes
 that need to be made to the application.  First, the application 
classes must be changed to mix-in a special persistence class (Example 
See A simple issue module modified to use the ZODB.
)
| import PTextIndex, Persistence
class Issues(Persistence.Persistent):
  def __init__(self): 
    self._index=TextIndex()
    self._issues=[]
  def addIssue(self, issue):
    issue.setId(len(self._issues))
    self._issues.append(issue)
    self._p_changed=1
  def __getitem__(self, i): 
    return self._issues[i]
  def search(self, text):
    return map(
      self.__getitem__, 
      self._index.search(text))
class Comment(Persistence.Persistent):
  _text=''
    
  def __init__(self, text, parent):
    self._parent=parent
    self.edit(text)
  def text(self): return self.text
  def edit(self, text):
    self._unindex(self._text)
    self._text=text
    self._index(self._text)
  def _index(self, text):
    self._parent._index(text)
    
  def _unindex(self, text):
    self._parent._unindex(text)
  def __getitem__(self, i): 
    return self._comments[i]
  def comment(self, text):
     self._comments.append(
       Comment(text, self))
    self._p_changed=1
class Issue(Comment):
  _id=None
  def __init__(self, title, text, parent):
    Comment.__init__(self, 
                     text, parent)
    self._title = title
  def setId(self, id):
    self._id=id
    self._index(self._text)
  def title(self): return self._title
    
  def _index(self, text):
    if self._id is not None:
      self._parent._index(
        text, self._id)
    
    def _unindex(self, text):
      if self._id is not None:
        self._parent._index(
           text, self._id) | 
 
Changing the application classes is straightforward. The first change needed is to add the 
Persistence.Persistent
 base class.
 
We need to add a line to the 
addIssue
 and 
comment
 methods  to notify the persistence system that objects have changed:
self._p_changed=1
 
This change is necessary because we have modified a list sub-object that
 doesn't participate in persistence.  The normal automatic detection of 
object changed doesn't work in this case. See 
See The rules of persistence.
later in this paper for further discussion of this change.
 
The text index is a bit more problematic.  We need a modified version of
 the text index that mixes in the persistent base class as well. This is
 shown by using a different version of the text index. Modifying the 
text index  is problematic because the text-index is outside the 
application and it would be preferable if the text index did not have to
 be changed. 
 
Finally, the application scripts must be modified. The new add script is shown in example 
See A script for adding an issue by updating an issues collection on a ZODB..
.
| import sys, ZODB, ZODB.FileStorage
import Issue
db=ZODB.DB(
       ZODB.FileStorage.FileStorage(
         `issues.fs'))
issues=db.open().root()[`issues']
issue=Issue.Issue(
         sys.argv[1], sys.argv[2], 
        issues)
issues.addIssue(issue)
get_transaction().commit() | 
 
This add script is similar to the previous one except for a few details.
  First, note the order of the imports. In particular, the application 
module, 
Issue
, is loaded after 
ZODB
. In this case, the import order is important.  The 
Issue
 module imports the 
Persistence
 module. This module is initially empty. When 
ZODB
 is imported, it populates the 
Persistence
 module with classes, like 
Persistent
, that depend on ZODB.  This sequence of imports may seem odd, but it 
allows ZODB to be renamed without affecting much application code. This 
was very useful when switching from the older version of the object 
database, 
BoboPOS
, to 
ZODB
.
 
Rather than loading all of the data from a pickle file, we open the 
object database, open a connection to the database, and get the root 
object, named "issues" from the database.  The Zope object database 
allows a number of different kinds of low-level storage managers to be 
used.  We must first create a storage object, and then create a database
 object using the storage object.  In this example, we used a "file" 
storage, which is a ZODB storage that stores data in a single file.  
Other storages are or soon will be available, such as dbm file storages 
and storages that use relational databases.
 
It's important to note in this example, that we're only loading a small 
part of the database into memory.  Essentially, only the issue container
 and issue place-holders are loaded into memory.  Issue state and issue 
comments are not loaded.
 
Rather than dumping the entire database as a single pickle, we simply 
commit a transaction.  This is an important feature of the ZODB.  The 
application programmer does not have to be aware of the objects that 
were changed in a computation.  The application programmer simply needs 
to define when work should be saved.  This is especially important in 
object-oriented applications.  For an application programmer to control 
what objects need to be saved would require knowledge of object 
internals.  For example, a user of issues would need to know that a 
Issues
 objects contain indexes that needed to be saved when an issue was added or modified.
 
ZODB installs a function, 
get_transaction
 function in the Python 
__builtins__
 module. This is done so that transaction-aware tools can use a 
transaction manager without depending on specific database 
implementations.  To commit the current transaction, call 
get_transaction
 to get the current transaction, and then call the transaction's 
commit
 method to commit the transaction, as shown in  example 
See A script for adding an issue by updating an issues collection on a ZODB..
.
 
As when storing data in a pickle file, we need a script that initializes the database (example 
See A script for initializing a ZODB with an issues collection..
).
| import ZODB, ZODB.FileStorage
import Issue
db=ZODB.DB(
   ZODB.FileStorage.FileStorage(
      `issues.fs', create=1))
root=db.open().root()
root[`issues']=Issue.Issues()
get_transaction().commit() | 
 
In a long running application, such as a web application or a graphical 
application, database open and creation are typically performed during 
application start-up, so this code is not required in every part of the 
application that modifies data.  Further, transaction boundaries are 
usually defined outside of the ordinary application code.  In a web 
application, a transaction might be committed at the end of a web 
request, as is done in Zope. In a graphical application, there might be 
menu options  for "saving work" that commits a transaction.  Typically, 
application code doesn't need to define transaction boundaries.
 
Usually, business logic doesn't contain any database related code, with the exception of mixing in the 
Persistent
 base class in class statements.  There are some cases when the 
application developer does have to be aware of persistence issues.  
These cases will be discussed in later sections of the paper.
2.2 Database organization
 
The ZODB database spreads object storage over multiple records.  Each 
stored persistent object has it's own database record.  When an object 
is modified and saved to the database, only the object's record is 
affected.  Records for unchanged persistent sub-objects are unaffected. 
 Each object has a persistent object id that uniquely identifies the 
object within the database and is used to lookup object data in the 
database.
 
The database has a designated "root" object, which provides access to 
application root objects by name.  An application typically provides a 
single root object as in the example given earlier in this paper.  All 
other objects are accessed through object traversal from the root, where
 object traversal might be performed by attribute access, item access, 
or method call.
 
There is no application level organization imposed by the ZODB.  There 
is no database-imposed notion of tables or indexes.  Applications are 
free to impose any organization on the object database.  One could 
implement a relational database on top of the ZODB.  Indexes are readily
 implemented on top of ZODB.  Zope includes a number of high- and 
low-level indexing facilities built on the ZODB.
2.3 The rules of persistence
 
Most applications require few changes to use the ZODB.  There are, 
however, a few rules that must be followed.  This section details what 
these rules are and the reasons behind them.
 
A major goal of the ZODB is to make persistence as automatic as possible.  Infrastructure exists to automate two critical tasks:
- Notifying the persistence system when an object has changed
- Notifying the persistence system when an object has been accessed
 
The persistence system keeps track of changes to objects so that only 
changed objects are saved when a transaction is committed and so that 
old state can be restored when a transaction is aborted.
- Persistent object classes must subclass persistent object classes.
- All sub-objects of persistent objects must be persistent or immutable.
 
There is a standard persistent base class 
Persistence.Persistent
, that is typically subclassed, directly or indirectly.  This class provides implementations of the special Python methods 
__getattr__
 and 
__setattr__
 that notify the persistence system when an object is accessed or 
modified.  This is the key mechanism by which the tasks described above 
are automated.
 
For standard Python class instances, the special method 
__getattr__
  is called only when a normal attribute look-up fails.  To know when an
 object can be removed from memory, it is necessary to execute logic on 
every attribute access.  For this reason, the persistent base class is 
not an ordinary Python class.  It is, instead, an ExtensionClass 
[Fulton96]. Extension classes are not technically Python classes, but 
are class-like objects that provide features found both in Python 
classes and built-in types.  Any sub-class of an extension class is an 
extension class, so all persistent object classes are extension classes.
 
This rule is necessary because, without it, the persistence system would not be notified of persistent object state changes.
 
Like most rules, this rule can be broken with care, as is done in the 
issue tracking system.  A persistent object can use mutable 
non-persistent sub-objects if it notifies the persistence system that 
the sub-object has changed. It can do this in two ways.  It can notify 
the persistence system directly by assigning a true value to the 
attribute _p_changed, as in:
  def addIssue(self, issue):
    issue.setId(i=len(self._issues)
    self._issues.append(issue)
    self._p_changed=1
  def addIssue(self, issue):
    issue.setId(len(self._issues))
    self._issues.append(issue)
    self._issues=self._issues
- A persistent object must not implement __getattr__ or __setattr__ .
- Persistent objects must be pickle-able.
 
These special methods are already implemented by the persistence system.
 Overriding them correctly, while possible, is extremely difficult.
 
The ZODB stores objects in Python pickle format [van Rossum99].  All of 
the rules for pickling objects apply.  See the documentation for the 
Python 
pickle
 module for more details.
 
Sometimes, a persistent object may temporarily contain unpickleable 
sub-objects.  This is possible as long as the unpickleable objects are 
not included in the object's pickled state.  The object's pickled state 
is obtained during pickling by calling the object's 
__getstate__
 method with no arguments.  The persistent base class, 
Persistence.Persistent
, provides an implementation of 
__getstate__
 that returns the items in an object's instance dictionary excluding items with keys that start with the prefix "
_v_
" or "
_p_
".  The easiest way to prevent data from being pickled is to assign it to an attribute with a name beginning with "
_v_
"
2
.
 
An object's state may be freed at any time by the ZODB to conserve 
memory usage. For this reason, an object must be prepared to recompute 
sub-objects that are not included in the pickled state.  A convenient 
place to do this is in the 
__setstate__
 method, which is called when an object's state is loaded from a 
database.  For example, one might have a persistent object that provides
 an interface to an external file. The persistent object stores the file
 name in it's persistent state and uses a "volatile" variable to hold 
the open file:
  class pfile(Persistence.Persistent):
 def __init__(self, file_name): 
  self._file_name=file_name
  self._v_file=open(file_name)
 def __setstate__(self, state): 
 
   Persistence.Persistent. \
          __setstate__(
             self, state)
   self._v_file=open(
          self._file_name)
 
In addition to the rules of persistence above, the following advice is worth heeding by authors of any pickleable objects:
- Never implement the obsolete __getinitargs__ pickling method. This method introduces significant backward compatibility problems.
- Avoid implementing custom pickle state by overriding the pickling methods __getstate__ and __setstate__ . Overriding these methods provides greater control and can allow significant optimizations, however, experience has shown that using custom pickle state formats introduces brittleness to an application that is rarely justified by the optimization benefits.
2.4 Object copies and states
 
With regard to persistence, Python objects have one state, which is 
existence.  They enter existence when they are created, and leave 
existence when they are destroyed.
 
Objects that are made persistent with the standard pickle module can be 
in two states, in memory, and pickled, and can have multiple copies, any
 of which are in one of the two states.  Objects are created only once, 
even though they may be copied to and from storage many times.  The 
constructor is called only when an object is created initially
3
.
 
ZODB persistent objects can have additional states. Like ordinary 
pickled objects, persistent objects can have copies that are stored 
somewhere as pickles.  Persistent objects are created only once, but may
 be copied two and from storage many times.  When in memory, ZODB 
persistent objects may be in one of several states. The object states 
and transactions are summarized in figure 
See State diagram showing in-memory persistent object states and transitions..
.  The states are described below.
| Figure 1. State diagram showing in-memory persistent object states and transitions. | 
2.5 Error recovery
2.6 Object evolution
 
Lifetimes for persistent objects are typically very long.  It is likely 
that the implementation of an object's behavior or data structures will 
change over time.  Change is accommodated by the ZODB
4
 in a number of ways.  Changes in object methods are easily accommodated
 because classes are, for the most part, not stored in the object 
database.  Changes to class implementation are reflected in instances 
the next time an application is executed.  
 
Changes in data structures require some care.  Adding attributes to 
instances is straightforward if a default value can be provided in a 
class definition.  More complex data structure changes must be handled 
in 
__setstate__
 methods.  A 
__setstate__
 method can check for old state structures and convert them to new 
structures when an object's state is loaded from the database.
3. Architecture and features
 
This section presents a high-level architectural view of the ZODB and 
discusses several important features with their architectural impacts. A
 detailed UML model of the ZODB is provided by [Fulton99]. The 
architecture is shown in a layered representation in figure 
See Layered view of the ZODB architecture.
.
| Figure 2. Layered view of the ZODB architecture | 
 
Database connections are responsible for moving data to and from 
storage.  Transactions keep track of objects that have changed and 
coordinate commit and rollback of object changes.
 
A well-defined storage interface allows different storage managers, with
 varying levels of service, to be used to manage low-level object 
storage. The plug-able storage interface affords a great deal of 
flexibility for managing object data.  A basic file storage is provided 
with ZODB, but other storages are available or planned, including 
relational-database-based storages, dbm-file-based storages, and 
Berkely-DB-based storages.
 
Database, or DB, objects coordinate management of storages and database 
connections.  Applications use DB objects to define the storage to be 
used, to obtain database connections, and to perform administrative 
tasks, such as database maintenance.
3.1 Transactions and concurrency
 
A critical feature of the ZODB is transactions.  Transactions can be 
thought of as small programs that have two important features:
 
The ZODB supports multiple threads in an application that access the 
same persistent objects.  Each thread uses one or more database 
connections to access the database.  Each database connection has it's 
own copies of persistent objects. Application logic is expressed in 
object methods. Because each thread has it's own copies of persistent 
objects, access to an object's methods
5
 is limited to a single thread, and application logic can be written without concern for concurrent access.
 
The ZODB uses an optimistic time-stamp protocol.  Changes to individual 
object copies are made independently, so individual (copies of) objects 
do not need to be locked. Changes are synchronized when transactions are
 committed.
 
Only one transaction is permitted to commit to a storage at a time. If 
two threads modify the same object in multiple connections, one thread 
is guaranteed to commit first.  When the second thread commits, a 
ConflictError
 exception will be raised.  The application should catch conflict errors and re-execute transactions
6
.  When the transaction is re-executed, the states of the affected objects reflect changes made by the committed transactions.
 
Atomicity  greatly simplifies error handling, and is especially 
important for object-oriented applications because it enables 
information hiding. Without atomicity, application error recovery logic 
would need visibility to state of any objects with state that needs to 
be recovered.
 
The transaction manager in the ZODB implements a two-phase commit 
protocol that allows multiple databases to be used in the same 
application. This could include multiple ZODB databases and multiple 
relational databases.  In Zope, a transaction can effect data in the 
ZODB and data in one or more relational databases. For example, a 
transaction might update a Zope object and a row in an Oracle table.  If
 an error occurs, changes made to the ZODB and to the Oracle table will 
be rolled back.
3.1.1 Sub-transactions
 
The ZODB provides two levels of nested transactions.  Transactions may 
be subdivided into sub-transactions.  Sub-transactions can be committed 
and aborted without affecting the containing transaction.  For example, a
 transaction may abort a sub-transaction and continue execution.  Any 
changes made in the sub-transaction are undone before execution 
proceeds.  Thus sub-transactions provide fine-grained error recovery.
 
Sub-transactions are commonly used to reduce memory consumption in 
transactions that modify many objects.  Changed objects cannot be 
deactivated and remain in memory until a transaction commits. With 
sub-transactions, objects can be committed and removed from memory 
without making the changes final, since the enclosing transaction may 
still be aborted.
3.1.2 Versions
 
Transactions can also participate in "versions".  Versions are similar 
to long-running transactions.  Changes can be committed to a version 
within the database.  Only users of that version see changes made in the
 version.  A version can be committed to the main database, or can be 
committed to other versions.
 
Versions provide a mechanism for making changes over a long period of 
time and to many objects.  Changes made in the version are not visible 
until they are committed and the changes made in a version can be easily
 discarded.  In Zope, this feature allows significant changes to be made
 to live web sites without effecting users of the sites.
3.2 Cache management
 
Each ZODB connection has an object cache that holds references to 
objects loaded into memory through the connection.  At various times, 
objects in the cache are inspected to see if they are referenced only by
 the cache, or haven't been accessed for a  period of time.  Objects 
that haven't been accessed in a long time are deactivated so that their 
state is freed.  Objects referenced only by the cache are removed from 
memory. Cache parameters can be set to control how aggressively objects 
are inspected and to control how recently objects must be accessed 
before they are deactivated.
3.3 Undo
 
Transactions may be undone, or rolled back after they are committed if 
the underlying storage supports "undo" by storing multiple object 
revisions.  The file storage provided with ZODB is an example of a 
storage that supports undo. When an object is modified, a new object 
record is appended to the data file and old object revisions are 
retained.
4. Status
 
ZODB 3.0 was released as part of the Zope 2.0 release in September of 
1999.  ZODB 3.0 added a number of significant features over earlier ZODB
 releases, most notably:
- Support for concurrent threads of execution 7 ,
- Well-defined storage interface with integrated transaction support,
- Two-phase commit,
- Integrated versions and sub-transactions,
5. Summary
 
The ZODB provides an object-oriented database for Python that provides a
 high-degree of transparency.  Applications can take advantage of object
 database features with few, if any, changes to application logic.  With
 the exception of "root" objects, it isn't necessary to query or update 
objects through database interactions.  Objects are obtained and updated
 through normal object interactions.  A plug-able storage interface 
provides a great deal of flexibility for managing data. Transactions can
 be undone in a way that maintains transaction integrity.  An object 
cache provides high-performance, efficient memory usage, and protection 
from memory leaks due to circular references.
6. References
 
Fulton96, Fulton, James L., 1996, Extension Classes, Python Extension 
Types Become Classes, http://www.digicool.com/releases/ExtensionClass.
 
Fulton99, Fulton, James L., 1999, Zope Object Database Version 3 UML model, http://www.zope.org/Documentation/Models/ZODB.
 
No comments:
Post a Comment