6 September 2012

Modelling dates in Neo4j

Whether you're dealing with people, products or places, it's generally reasonable to assume that the information you store in your database in some way involves dates and times. Since no native date or time data types are supported by Neo4j, the question of how best to store this information has come up several times in the community mailing list. The general answer to this question has been to use a numeric or textual property value and each of these styles obviously has individual pros and cons. Focusing simply on date information, this article discusses some of the options which can be used, their relative advantages and disadvantages, and introduces a uniquely graph-based solution.

Arguably the simplest approach is to use a textual property value to hold a date. There are of course numerous ways to format a date internationally but the widely accepted common format for computer systems is ISO 8601. For those unfamiliar with this format, the date components are represented in descending order of magnitude, separated by dashes. This means that Christmas Day, 2000 becomes "2000-12-25". This method has the distinct advantage of being both human-readable and incredibly portable although is subject to performance concerns when comparisons and sorting are involved: string comparisons are notoriously more resource intensive than numeric comparisons. In addition, some level of validation is required to ensure that erroneous values are not stored, be they completely invalidly formatted, such as "2000/12/25th" or simply outside of permitted ranges, such as "2000-12-47".

A natural twist on the ISO 8601 format is to simply represent the same date components within an eight digit integer. Here, "2000-12-25" becomes 20001225 and this of course brings benefits in both sorting and validation. While invalid formats can no longer be represented, out of range values still can (20001247). There is also a temptation to try to perform arithmetic on integer values but clearly 20010131 – 20001225 does not represent the number of days between Christmas Day 2000 and the end of January the following year. This model is still of limited use.

An alternative numeric representation is the Unix timestamp. This value stores the total amount of elapsed time between 1970-01-01 (the Unix epoch) and the date represented and, by doing so, allows date arithmetic to be carried out correctly. The only real downside to this representation is a lack of human readabilty and potential portability issues (although since most computer systems have recognised Unix timestamps for a long time, this issue is probably completely academic).

Within Neo4j however, we have another option for storing date information. We can give each individual date value its own node and hold these together in a hierarchical subgraph to give them meaning. As a very small example, the three days between 24th and 26th December inclusive could be held in the following tree:


Not only does this allow us to give meaningful structure to the dates themselves but it provides a facility by which we can represent date-bound events using a relationship instead of a property. We could indicate that a person was born on a particular day using a "BORN_ON" type relationship or show when an order was completed with a "COMPLETED_ON" relationship. The temporal information becomes part of the shape and structure of the graph rather than simply being a flat value.

With this mechanism we can of course also store extra information within the calendar tree. We may choose to store Unix timestamps as properties within the "day" nodes or assign names to days or months. We could also link certain days to a dedicated "weekend" or "bank holiday" node so that such days become easier to find and for sparsely populated calendars (such as those within a family tree) we only have to store the dates in which we are interested.

Building and using a calendar tree of this type in Neo4j has become somewhat easier with the introduction of Cypher's CREATE UNIQUE (née RELATE) clauses. This can ensure that we are reusing entities whenever possible and, with the new wrapper function get_or_create_path now available in py2neo 1.3.2, implementation becomes trivial.

The code below describes a simple GregorianCalendar class containing a method to select individual date nodes. These nodes can then be used as relationship endpoints for any date information.
class GregorianCalendar(object):

    def __init__(self, graph_db, calendar_index_name="CALENDARS"):
        # store a handle to the graph database we are working on
        self._graph_db = graph_db
        # retain a handle to the root node for this calendar
        self._calendar = self._graph_db.get_or_create_indexed_node(
            calendar_index_name, "scheme", "Gregorian",
        )
    
    def day(self, year, month, day):
        # fetch the full date path from the calendar root through the YEAR,
        # MONTH and DAY relationships, building this path if necessary; for
        # example:
        #   (CAL)-[:YEAR]->(2000)-[:MONTH]->(12)-[:DAY]->(25)
        date_path = self._calendar.get_or_create_path(
            ("YEAR",  {"number": year}),
            ("MONTH", {"number": month}),
            ("DAY",   {"number": day}),
        )
        # return the last node of the date path, i.e. the node
        # representing the "day"
        return date_path.nodes[-1]

While a calendar subgraph may not be the optimal approach for all date storage, it can certainly be useful in many situations and offers a lot of potential for expansion both with the properties stored against each date and with options to represent further date-related concepts such as weeks or quarters.

3 comments:

KC said...

Good job !

Stephen Souness said...

Do you think it would be a good approach to use the node approach and include a unix style timestamp as a property of the day node?

Nigel Small said...

Absolutely. If you use unix timestamps within your application then this is exactly the right place to put it. Presumably you would store the "midnight at start" value in there.