Overview
This series of posts is my attempt to familiarize myself with some of the databases that originated in the Clojure ecosystem over the last several years. The first will be Datomic.
The dataset I'll be using for investigation is the daily energy offers published by the Electricity Authority in New Zealand. To describe the set in short, it allows sellers of energy and reserves to make offers to supply the NZ wholesale market.
The process I have used is to load the data first into a local table-based relational store that I can use as a point of comparison. After that, I built up a Datomic model, loaded the data, and tried to produce the same query results as I could with SQL.
Datomic also describes itself as a relational store, just not a table-based one. It stores data in Entity, Attribute, Value (EAV) triples.
- An entity represents something you describe in the database—a person, a product, an event, etc.
- An attribute is a property or characteristic of the entity. Attributes describe specific facts about the entity, such as "name," "age," or "location."
- The value is the specific piece of data associated with the attribute for that entity. It can be any data type, such as strings, numbers, dates, or even references to other entities.
The Offers CSV Data Set
For this example, we'll use the daily dataset from 2024-06-10, it's 41 megs and 271k lines.
I prefer to use DuckDB for my local SQL-based analysis it's easy to create and load a table from a CSV set.
CREATE TABLE offers AS SELECT * FROM '/home/jonesn/dev/arachnid/energynz/resources/20240610_Offers.csv';
DuckDB loads and interprets the dataset as defined by the CREATE TABLE statement below, this is a pretty good effort for a 1 liner. The ramp-up attributes should be numeric as well but this is enough to execute the queries we'll want to compare with Datomic.
CREATE TABLE offers
(
TradingDate DATE,
TradingPeriod BIGINT,
ParticipantCode VARCHAR,
PointOfConnection VARCHAR,
Unit VARCHAR,
ProductType VARCHAR,
ProductClass VARCHAR,
ReserveType VARCHAR,
ProductDescription VARCHAR,
UTCSubmissionDate DATE,
UTCSubmissionTime TIME,
SubmissionOrder BIGINT,
IsLatestYesNo VARCHAR,
Tranche BIGINT,
MaximumRampUpMegawattsPerHour VARCHAR,
MaximumRampDownMegawattsPerHour VARCHAR,
PartiallyLoadedSpinningReservePercent VARCHAR,
MaximumOutputMegawatts VARCHAR,
ForecastOfGenerationPotentialMegawatts VARCHAR,
Megawatts DOUBLE,
DollarsPerMegawattHour DOUBLE
);
Datomic Setup
I added this section because I downloaded the pro bundle instead of using the embedded local option described in the official doco. The main reason for this was it gave me access to the console app that is part of the pro bundle. I found that was useful for insight into Datomic's structure when starting from zero knowledge.
The tmux config I use to start a transactor and the console are listed below.
windows:
- editor:
layout: main-vertical
panes:
- /home/jonesn/dev/tools/idea/bin/idea.sh /home/jonesn/dev/arachnid/energynz
- db:
layout: even-horizontal
panes:
- /home/jonesn/dev/tools/datomic/bin/transactor config/dev-transactor-template.properties
- /home/jonesn/dev/tools/datomic/bin/console -p 8080 local datomic:dev://localhost:4334/
The Schema
Datomic requires a schema before data is loaded. I enjoyed this step as I frequently find I'd like the structure of my domain defined somewhere when I'm using Clojure. I've tried the code-only models like Spec but this feels more scalable with a natural delineation between the formal data model in the database versus the malleable operations on that model in code.
One small issue I have is being forced to serialize through the legacy Java date API (:db.type/instant). I'm always going to want to convert to the java.time variants so every project will require conversion helper functions.
So the above table maps to the Datomic "offers" entity below.
(def offers-schema [{:db/ident :offers/trading-date
:db/valueType :db.type/instant
:db/cardinality :db.cardinality/one
:db/doc "The trading date of a transaction. NZT (Pacific/Auckland)."}
{:db/ident :offers/trading-period
:db/valueType :db.type/long
:db/cardinality :db.cardinality/one
:db/doc "The trading period, with a value from 1 to 48. (50 on Fall Back, 46 on Spring Forward)"}
{:db/ident :offers/tranche
:db/valueType :db.type/long
:db/cardinality :db.cardinality/one
:db/doc "The tranche or segment of the offer. Each tranche represents a portion of the total offer at a specific price point, allowing participants to offer varying quantities of electricity at different prices."}
{:db/ident :offers/participant
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/doc "The market participant making the offer."}
{:db/ident :offers/point-of-connection
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/doc "The grid connection point making the offer."}
{:db/ident :offers/submission-order
:db/valueType :db.type/long
:db/cardinality :db.cardinality/one
:db/doc "The order in which the offers were submitted."}
{:db/ident :offers/product-type
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/doc "The product type, for example 'ENERGY' or 'RESERVE_FIR'."}
{:db/ident :offers/unit
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/doc "The generation unit."}
{:db/ident :offers/product-description
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/doc "The description of the product. Like 'Fast instantaneous reserves interruptible load'."}
{:db/ident :offers/product-class
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/doc "The classification of the product. 'Injection' or 'Off Take'."}
{:db/ident :offers/reserve-type
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/doc "The reserve type being offered, maybe null, Fast Instantaneous Reserve (FIR)
and Sustained Instantaneous Reserve (SIR). Specifically, FIR is required to respond
within one second of the frequency falling to 49.2 Hz and must be sustained for a
minimum of 60 seconds. SIR must be provided within 60 seconds after the onset
of an under-frequency event and can last for several minutes to stabilize the grid."}
{:db/ident :offers/utc-submission-date
:db/valueType :db.type/instant
:db/cardinality :db.cardinality/one
:db/doc "The submission date of the trade in UTC."}
{:db/ident :offers/utc-submission-time
:db/valueType :db.type/instant
:db/cardinality :db.cardinality/one
:db/doc "The submission time of the trade in UTC."}
{:db/ident :offers/is-latest
:db/valueType :db.type/boolean
:db/cardinality :db.cardinality/one
:db/doc "Indicates if the submission is the latest one (Yes/No)."}
{:db/ident :offers/tranche
:db/valueType :db.type/long
:db/cardinality :db.cardinality/one
:db/doc "The tranche or segment of the offer. Each tranche represents a portion of the total offer at a specific price point, allowing participants to offer varying quantities of electricity at different prices."}
{:db/ident :offers/max-ramp-up-mw-per-hour
:db/valueType :db.type/bigdec
:db/cardinality :db.cardinality/one
:db/doc "The maximum rate at which the generation can ramp up, in megawatts per hour."}
{:db/ident :offers/max-ramp-down-mw-per-hour
:db/valueType :db.type/bigdec
:db/cardinality :db.cardinality/one
:db/doc "The maximum rate at which the generation can ramp down, in megawatts per hour."}
{:db/ident :offers/plsr-percent
:db/valueType :db.type/bigdec
:db/cardinality :db.cardinality/one
:db/doc "The percentage of the unit's capacity that is reserved for partially loaded spinning reserve."}
{:db/ident :offers/max-output-mw
:db/valueType :db.type/bigdec
:db/cardinality :db.cardinality/one
:db/doc "The maximum output capacity of the generating unit, in megawatts."}
{:db/ident :offers/forecast-generation-potential-mw
:db/valueType :db.type/bigdec
:db/cardinality :db.cardinality/one
:db/doc "The forecasted potential generation, in megawatts."}
{:db/ident :offers/megawatts
:db/valueType :db.type/bigdec
:db/cardinality :db.cardinality/one
:db/doc "The actual amount of power offered or generated, in megawatts."}
{:db/ident :offers/dollars-per-mwh
:db/valueType :db.type/bigdec
:db/cardinality :db.cardinality/one
:db/doc "The price in dollars per megawatt-hour for the offered energy or reserve."}])
Once the schema is loaded it can be viewed via the Datomic console in the schemas window (localhost:8080/browse)
Unique Keys
I wanted to define what made a row unique, this ended up being a large composite key. To enforce the uniqueness I tried to use the tuple type, this was working quite well but after 8 attributes in the key I got the error below saying that was the maximum the tuple could hold. At this point, I stopped and decided the dataset wasn't well suited to a composite key. I wouldn't use one this large in SQL either. This reflects more on how the data is presented in the CSV file than any limitation with Datomic.
{:db/ident :offers/trading-date-period
:db/valueType :db.type/tuple
:db/tupleAttrs [:offers/trading-date :offers/trading-period :offers/tranche :offers/participant :offers/point-of-connection :offers/product-type :offers/product-description :offers/unit :offers/submission-order]
:db/cardinality :db.cardinality/one
:db/unique :db.unique/identity
:db/doc "Composite key of trading date, trading period, participant, connection point, tranche and submission order."}
;; On Transact
Execution error (Exceptions$IllegalArgumentExceptionInfo) at datomic.error/deserialize-exception (error.clj:167).
:db.error/datoms-conflict Two datoms in the same transaction conflict
{:d1 [17592186046752 :offers/tranche 4 13194139534313 true],
:d2 [17592186046752 :offers/tranche 5 13194139534313 true]}
Execution error (Exceptions$IllegalArgumentExceptionInfo) at datomic.error/deserialize-exception (error.clj:167).
:db.error/invalid-tuple-value Invalid tuple value
Populating the Data
As mentioned above the create table statement in DuckDB both created the table and loaded the CSV data. So to catch up the next step is to load the data into Datomic and then compare the number of records with SQL.
Most of the work is converting from a CSV record to a Datomic fact matching our schema.
(defn csv-row->datomic-fact
[csv-row]
{:offers/trading-date (legacy-instant-from-zdt (nzd-to-ztd (:TradingDate csv-row))) ;; This needs to be an instant
:offers/trading-period (as-int (:TradingPeriod csv-row))
:offers/participant (:ParticipantCode csv-row)
:offers/point-of-connection (:PointOfConnection csv-row)
:offers/unit (:Unit csv-row)
:offers/product-type (product-type (:ProductType csv-row) (:ReserveType csv-row))
:offers/product-class (:ProductClass csv-row)
:offers/reserve-type (:ReserveType csv-row)
:offers/product-description (reserves-description (:ProductDescription csv-row))
:offers/submission-order (as-int (:SubmissionOrder csv-row))
:offers/is-latest (as-bool (:IsLatestYesNo csv-row))
:offers/tranche (as-int (:Tranche csv-row))
:offers/max-ramp-up-mw-per-hour (as-bigdec (:MaximumRampUpMegawattsPerHour csv-row))
:offers/max-ramp-down-mw-per-hour (as-bigdec (:MaximumRampDownMegawattsPerHour csv-row))
:offers/plsr-percent (as-bigdec (:PartiallyLoadedSpinningReservePercent csv-row))
:offers/max-output-mw (as-bigdec (:MaximumOutputMegawatts csv-row))
:offers/forecast-generation-potential-mw (as-bigdec (:ForecastOfGenerationPotentialMegawatts csv-row))
:offers/megawatts (as-bigdec (:Megawatts csv-row))
:offers/dollars-per-mwh (as-bigdec (:DollarsPerMegawattHour csv-row))})
As a single transaction, the 271k facts were too much for the dev-local storage and the transaction timed out. I've collected the error messages here for reference. To work around this I created a helper function that would chunk the data into blocks of 10k facts.
Execution error (ExceptionInfo) at datomic.error/raise (error.clj:70).
:db.error/transaction-timeout Transaction timed out.
(defn transact-block
[conn s-m-realized-datomic-facts]
(t/log! :info ["Transacting sequence of [" (count s-m-realized-datomic-facts) "] facts."])
@(d/transact-async conn s-m-realized-datomic-facts))
(defn transact-facts
[conn s-realized-datomic-facts]
(let [transaction-blocks (partition-all 10000 s-realized-datomic-facts)]
(t/log! :info ["Splitting [" (count s-realized-datomic-facts) "] and Processing [" (count transaction-blocks) "] blocks of transactions."])
(doseq [s transaction-blocks]
(transact-block conn s))))
Equivalent Queries
This section is a set of comparisons between SQL and the equivalent datalog queries. This proved more fiddly than I was expecting, the AI gods struggled with conversions from SQL to Datalog. To the point where it wasn't worth using.
Two points I want to call out when parsing below:
- The SQL "O.ReserveType IN NULL" filter is replaced by the Datomic filter condition: "[?e :offers/product-type "ENERGY"]". This is due to an enrichment at load time in the Datomic version of the loader because it doesn't allow NULL values.
- The other is the importance of the ":with" clause in the aggregate query. It seemed to force Datomic to consider all the entities (Think rows).
;; COUNT(*)
;; ========
;; SQL
SELECT COUNT(*) FROM offers;
271734
;; Datomic - Datalog query that finds the count of all entities with the attribute :offers/trading-period.
[:find (count ?e)
:where
[?e :offers/trading-period]]
271734
;; Select a Single Row/Entity
;; ==========================
;; SQL
SELECT ParticipantCode,
PointOfConnection,
TradingDate,
TradingPeriod,
Tranche,
IsLatestYesNo,
Megawatts,
DollarsPerMegawattHour
FROM MAIN.OFFERS o
WHERE IsLatestYesNo = 'Y'
AND ParticipantCode = 'CTCT'
AND PointOfConnection = 'CYD2201'
AND o.ReserveType IS NULL
AND TradingPeriod = 24
AND tranche = 1;
Name |Value |
----------------------+----------+
ParticipantCode |CTCT |
PointOfConnection |CYD2201 |
TradingDate |2024-06-10|
TradingPeriod |24 |
Tranche |1 |
IsLatestYesNo |Y |
Megawatts |185.0 |
DollarsPerMegawattHour|0.01 |
;; Datomic
[:find ?participant
?point-of-connection
?trading-date
?trading-period
?tranche
?is-latest
?megawatts
?dollars-per-mwh
:where
[?e :offers/participant "CTCT"]
[?e :offers/point-of-connection "CYD2201"]
[?e :offers/trading-period 24]
[?e :offers/tranche 1]
[?e :offers/is-latest true]
[?e :offers/product-type "ENERGY"]
[?e :offers/participant ?participant]
[?e :offers/point-of-connection ?point-of-connection]
[?e :offers/trading-date ?trading-date]
[?e :offers/trading-period ?trading-period]
[?e :offers/tranche ?tranche]
[?e :offers/is-latest ?is-latest]
[?e :offers/megawatts ?megawatts]
[?e :offers/dollars-per-mwh ?dollars-per-mwh]]
;; By default it returns a vector
;; #{["CTCT" "CYD2201" #inst"2024-06-09T12:00:00.000-00:00" 24 1 true 185.000M 0.01M]}
;; Use Pull syntax to get a map
'[:find (pull ?e [:offers/participant
:offers/point-of-connection
:offers/trading-date
:offers/trading-period
:offers/tranche
:offers/is-latest
:offers/megawatts
:offers/dollars-per-mwh])
:where
[?e :offers/participant "CTCT"]
[?e :offers/point-of-connection "CYD2201"]
[?e :offers/trading-period 24]
[?e :offers/tranche 1]
[?e :offers/is-latest true]
[?e :offers/product-type "ENERGY"]]
[[#:offers{:participant "CTCT",
:point-of-connection "CYD2201",
:trading-date #inst"2024-06-09T12:00:00.000-00:00",
:trading-period 24,
:tranche 1,
:is-latest true,
:megawatts 185.000M,
:dollars-per-mwh 0.01M}]]
;; ===============
;; Aggregate Query
;; ===============
SELECT o.ParticipantCode,
ROUND(SUM(o.Megawatts), 2) megawatts_offered,
MIN(o.DollarsPerMegawattHour) min_offer_dollar,
MAX(o.DollarsPerMegawattHour) max_offer_dollar,
ROUND(AVG(o.DollarsPerMegawattHour), 2) avg_offer_dollar
FROM offers o
WHERE o.IsLatestYesNo = 'Y'
AND o.ReserveType IS NULL
GROUP BY o.ParticipantCode
ORDER BY 1
participant_code|megawatts_offered|min_offer_dollar|max_offer_dollar|avg_offer_dollar|
----------------+-----------------+----------------+----------------+----------------+
ALNT | 3072.0| 0.0| 0.0| 0.0|
CHHE | 2016.0| 0.0| 5.0| 2.8|
CNIR | 22519.0| 0.0| 5005.0| 775.38|
CTCT | 86208.58| 0.0| 6500.0| 1822.4|
GENE | 74258.4| 0.0| 4999.07| 1118.52|
KING | 468.0| 0.0| 325.0| 61.21|
LODE | 2304.0| 0.0| 0.01| 0.0|
MERI | 132192.0| 0.0| 500.0| 107.6|
MRPL | 71734.2| 0.0| 5000.03| 1047.66|
NAPJ | 6603.0| 0.0| 5000.03| 4000.01|
NGAG | 1488.0| 0.01| 1000.0| 217.0|
NPEL | 15.34| 0.0| 1000.0| 109.81|
NZWF | 2328.0| 0.0| 0.01| 0.0|
PUNZ | 616.28| 0.0| 0.0| 0.0|
TAOM | 983.0| 0.0| 200.0| 46.4|
TODD | 7456.5| 0.0| 1500.0| 300.0|
TUAR | 3883.0| 0.0| 5000.03| 4000.01|
;; Datomic
(def aggregate-participant-q
'[:find ?participant_code
(sum ?megawatts_offered)
(min ?dollars_per_mwh)
(max ?dollars_per_mwh)
(avg ?dollars_per_mwh)
:with ?offer
:where
[?offer :offers/participant ?participant_code]
[?offer :offers/megawatts ?megawatts_offered]
[?offer :offers/dollars-per-mwh ?dollars_per_mwh]
[?offer :offers/is-latest true]
[?offer :offers/product-type "ENERGY"]])
[["ALNT" 3072.000M 0.00M 0.00M 0.0]
["CHHE" 2016.000M 0.00M 5.00M 2.8]
["CNIR" 22519.000M 0.00M 5005.00M 775.3752719298246]
["CTCT" 86208.579M 0.00M 6500.00M 1822.3974270833335]
["GENE" 74258.400M 0.00M 4999.07M 1118.5210448717949]
["KING" 468.000M 0.00M 325.00M 61.208333333333336]
["LODE" 2304.000M 0.00M 0.01M 0.002]
["MERI" 132192.000M 0.00M 500.00M 107.60357142857143]
["MRPL" 71734.200M 0.00M 5000.03M 1047.6626517857142]
["NAPJ" 6603.000M 0.00M 5000.03M 4000.0133333333333]
["NGAG" 1488.000M 0.01M 1000.00M 217.002]
["NPEL" 15.344M 0.00M 1000.00M 109.80920833333333]
["NZWF" 2328.000M 0.00M 0.01M 0.002]
["PUNZ" 616.280M 0.00M 0.00M 0.0]
["TAOM" 983.000M 0.00M 200.00M 46.4]
["TODD" 7456.500M 0.00M 1500.00M 300.004]
["TUAR" 3883.000M 0.00M 5000.03M 4000.012]]
Inequivalent Queries
As I was exploring Datomic and building some of the queries I noticed it was set orientated by default and must be optimized for it via indexes on each of the elements in its triple. I will demonstrate with a couple of queries but basically in SQL a SELECT a operation will return the value a of all rows in the database. Datomic operates like SELECT DISTINCT i.
SELECT o.ParticipantCode participant_code FROM offers o
/
participant_code|
----------------+
WNET |
WNET |
WNET |
WNET |
WNET |
WNET |
WNET |
WNET |
WNET |
WNET |
WNET |
.... to 271k rows
SELECT DISTINCT o.ParticipantCode participant_code
FROM offers o
ORDER BY 1
/
participant_code|
----------------+
ALNT |
CHHE |
CNIR |
CNTP |
CTCT |
ERGY |
GENE |
KING |
LODE |
MERI |
MRPL |
NAPJ |
NGAG |
NPEL |
NTHP |
NZWF |
PUNZ |
SZRO |
TAOM |
TODD |
TUAR |
VCTR |
VWEN |
WNET |
;; Datomic Default - Participant
[:find ?participant_code
:where
[?e :offers/participant ?participant_code]]
#{["LODE"]
["VWEN"]
["VCTR"]
["MRPL"]
["SZRO"]
["NAPJ"]
["NTHP"]
["MERI"]
["CNIR"]
["NPEL"]
["WNET"]
["TAOM"]
["ALNT"]
["GENE"]
["PUNZ"]
["KING"]
["TUAR"]
["CNTP"]
["NGAG"]
["TODD"]
["CTCT"]
["ERGY"]
["NZWF"]
["CHHE"]}
Summary
It was a long-winded process to get up and running with Datomic, but I appreciated some of what it offered. It is not a quick hack tool, it feels orientated to well-thought-out, long-life data storage. Which I guess is what you want a database for :-).
Pros
- I like that transactions are first class. I guess they are in traditional relational databases but it's really in your face here.
- I like it has an upfront schema, you can't just toss in whatever you like.
- No nulls.
- The EAV triple model seems interesting but I don't fully grasp the implications yet.
- Database as at a time.
Cons
- The tooling is rough compared to traditional relational stores.
- Still blessed with some opaque Clojure error messages. Especially when parsing queries.
- Building queries with sets and lists. I guess that makes it nice for building dynamic queries but I don't find that is something I need to do often.