You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Peter Colberg 9df6484533 Configure metadata cache 3 years ago
..
CHANGES.html Update copyright notice 3 years ago
CHANGES.mdwn Update release notes 3 years ago
INSTALL.html Update copyright notice 3 years ago
INSTALL.mdwn Update release notes 3 years ago
Makefile Update MathJax CDN 4 years ago
README.html Update copyright notice 3 years ago
README.mdwn Raise an error when closing a file with open objects 3 years ago
contents.mdwn Update copyright notice 3 years ago
hyperslab.svg Add User's Guide 4 years ago
hyperslab.tex Add User's Guide 4 years ago
index.html Update copyright notice 3 years ago
index.mdwn Add release notes 4 years ago
lua-hdf5.png Initialise HDF5 for Lua 5 years ago
lua-hdf5.ps Initialise HDF5 for Lua 5 years ago
pandoc.css Update copyright notice 3 years ago
pandoc.html.in Add documentation in HTML format 4 years ago
reference.html Configure metadata cache 3 years ago
reference.mdwn Configure metadata cache 3 years ago

README.mdwn

---
title: User's Guide
---

If you are new to [HDF5], begin by reading about the [data model and structure]
of an HDF5 file in the [HDF5 User's Guide].

If you are new to [LuaJIT], read about [C data structures] that are used to
pass data between a program and the HDF5 library.

[HDF5]: http://www.hdfgroup.org/HDF5/doc/
[HDF5 User's Guide]: http://www.hdfgroup.org/HDF5/doc/UG/
[LuaJIT]: http://luajit.org/
[C data structures]: http://luajit.org/ext_ffi.html#cdata
[data model and structure]: http://www.hdfgroup.org/HDF5/doc/UG/03_DataModel.html


Datasets
--------

In this example we store particle positions to a dataset.

We begin by filling an array of length N of vectors with 3 components:

~~~ {.lua}
local N = 1000
local pos = ffi.new("struct { double x, y, z; }[?]", N)
math.randomseed(42)
for i = 0, N - 1 do
pos[i].x = math.random()
pos[i].y = math.random()
pos[i].z = math.random()
end
~~~

Next we create an empty HDF5 file:

~~~ {.lua}
local file = hdf5.create_file("dataset.h5")
~~~

An HDF5 file has a hierarchical structure analogous to a filesystem.
Where a filesystem consists of directories that contain subdirectories and
files, an HDF5 file consists of groups that contain subgroups and datasets.
In an HDF5 file, data is stored in datasets and attributes. Datasets store
large data and may be read and written partially. Attributes store small
metadata that is attached to groups or datasets and may be read and written
only atomically.

A dataset is described by a location in the file, a datatype and a dataspace:

~~~ {.lua}
local datatype = hdf5.double
local dataspace = hdf5.create_simple_space({N, 3})
local dataset = file:create_dataset("particles/solvent/position", datatype, dataspace)
~~~

The location of the dataset is specified as a path relative to the file root;
its absolute path is `/particles/solvent/position`. The group `/particles` and
the subgroup `/particles/solvent` are implicitly created since they do not
exist yet. We pass the datatype `hdf5.double` to store data in double-precision
floating point format. The dataspace describes the dimensions of the dataset.
A dataset may be created with fixed or variable-size dimensions. We choose a
dataspace of fixed dimensions N×3 equal to the dimensions of the array.


Next we write the contents of the `pos` array to the dataset:

~~~ {.lua}
dataset:write(pos, datatype, dataspace)
~~~

Note that datatype and dataspace are also passed to the `write` function.
These tell the HDF5 library about the memory layout of the `pos` array, which
happens to match the file layout of the dataset. For advanced use cases file
and memory layouts may differ. For example, the data could be stored in memory
with double precision and stored in the file with single precision to halve
disk I/O, in which case the HDF5 library would automatically convert between
double and single precision when writing the data.

We close the HDF5 file, which flushes the written data to disk:

~~~ {.lua}
dataset:close()
file:close()
~~~

Any open objects in the file, which includes groups, datasets, attributes
and *committed* datatypes, must be closed before closing the file itself;
otherwise, the HDF5 library will raise an error when attempting to close
the file.

To inspect the HDF5 file, we use the `h5dump` tool:

~~~
h5dump dataset.h5
~~~

To show only the metadata of the file:

~~~
h5dump -A dataset.h5
~~~

~~~
HDF5 "dataset.h5" {
GROUP "/" {
GROUP "particles" {
GROUP "solvent" {
DATASET "position" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 1000, 3 ) / ( 1000, 3 ) }
}
}
}
}
}
~~~

The dimensions of the dataset are shown twice, as the current dimensions and
the maximum dimensions. A dataset may be resized at any point in time, as
long as the new dimensions fit within the maximum dimensions specified upon
creation. The maximum dimensions may further be specified as variable, in
which case the dataset may be resized to arbitrary dimensions.

Next we open the HDF5 file and the dataset for reading:

~~~ {.lua}
local file = hdf5.open_file("dataset.h5")
local dataset = file:open_dataset("particles/solvent/position")
~~~

We retrieve and verify the dimensions of the dataset:

~~~ {.lua}
local filespace = dataset:get_space()
local dims = filespace:get_simple_extent_dims()
local N, D = dims[1], dims[2]
assert(N == 1000)
assert(D == 3)
~~~

We read the data from the dataset into a newly allocated array:

~~~ {.lua}
local pos = ffi.new("struct { float x, y, z; }[?]", N)
local memtype = hdf5.float
local memspace = hdf5.create_simple_space({N, 3})
dataset:read(pos, memtype, memspace)
dataset:close()
file:close()
~~~

The array stores elements with single precision, while we created the dataset
with double precision. In passing a different memory datatype to the `read`
function, we instruct the HDF5 library to convert from double precision to
single precision when reading.

Finally we verify the data read from the dataset:

~~~ {.lua}
math.randomseed(42)
for i = 0, N - 1 do
assert(math.abs(pos[i].x - math.random()) < 1e-7)
assert(math.abs(pos[i].y - math.random()) < 1e-7)
assert(math.abs(pos[i].z - math.random()) < 1e-7)
end
~~~

The values are compared with a tolerance corresponding to single precision.


Attributes
----------

In this example we store a sequence of strings to an attribute.

We create an empty HDF5 file with a group `/molecules`:

~~~ {.lua}
local file = hdf5.create_file("attribute.h5")
local group = file:create_group("molecules")
~~~

To the group we attach an attribute for a given set of strings:

~~~ {.lua}
local species = {"H2O", "C8H10N4O2", "C6H12O6"}
local dataspace = hdf5.create_simple_space({#species})
local datatype = hdf5.c_s1:copy()
datatype:set_size(100)
local attr = group:create_attribute("species", datatype, dataspace)
~~~

Like a dataset, an attribute is described by a datatype and a dataspace.
The HDF5 library defines a special datatype `hdf5.c_s1` for C strings, which
needs to be copied and adjusted for the length of the strings to be written.
The size of a string datatype must be equal to or greater than the number of
characters of the longest string plus a terminating [null character].
We choose 100 characters including a null terminator, to avoid having to
determine an exact size at the expense of padding with null characters.

[null character]: http://en.wikipedia.org/wiki/Null_character

Next we write the strings to the attribute:

~~~ {.lua}
attr:write(ffi.new("char[?][100]", #species, species), datatype)
attr:close()
group:close()
file:close()
~~~

For writing, the Lua strings in the `species` table are converted to an array
of C strings that match the string datatype.

We inspect the contents of the HDF5 file:

~~~
h5dump -A attribute.h5
~~~

~~~
HDF5 "attribute.h5" {
GROUP "/" {
GROUP "molecules" {
ATTRIBUTE "species" {
DATATYPE H5T_STRING {
STRSIZE 100;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "H2O", "C8H10N4O2", "C6H12O6"
}
}
}
}
}
~~~

The strings are shown according to their actual lengths and the padding null
characters discarded.

Next we open the file, the group, and the attached attribute:

~~~ {.lua}
local file = hdf5.open_file("attribute.h5")
local group = file:open_group("molecules")
local attr = group:open_attribute("species")
~~~

We retrieve and verify the dimensions of the attribute:

~~~ {.lua}
local filespace = attr:get_space()
local dims = filespace:get_simple_extent_dims()
local N = dims[1]
assert(N == #species)
~~~

We allocate an array for the strings and read the attribute:

~~~ {.lua}
local buf = ffi.new("char[?][50]", N)
local memtype = hdf5.c_s1:copy()
memtype:set_size(50)
attr:read(buf, memtype)
attr:close()
group:close()
file:close()
~~~

The size of the memory datatype (50 characters) need not match the size of the
file datatype (100 characters). If a string exceeds the size of the memory
datatype, the HDF5 library truncates and terminates the string safely with a
null character.

Finally we verify the data read from the attribute:

~~~ {.lua}
for i = 0, N - 1 do
assert(ffi.string(buf[i]) == species[i + 1])
end
~~~

Using [ffi.string], the C strings are converted to Lua strings for comparison.

*Array indices begin at 0, and table indices begin at 1.*

[ffi.string]: http://luajit.org/ext_ffi_api.html#ffi_string


Dataspaces
----------

In this example we use hyperslab selections to store time-dependent data to a dataset.

We simulate a system of particles over a number of steps in which the particles
move randomly in 3‑dimensional space.

~~~ {.lua}
local N = 4200
local nstep = 100
~~~

We begin by creating a dataset for the particle positions:

~~~ {.lua}
local file = hdf5.create_file("dataspace.h5")
local dcpl = hdf5.create_plist("dataset_create")
dcpl:set_chunk({1, 1000, 3})
dcpl:set_deflate(6)
local dataspace = hdf5.create_simple_space({0, N, 3}, {nil, N, 3})
local dataset = file:create_dataset("particles/solvent/position/value", hdf5.double, dataspace, nil, dcpl)
~~~

Here we specify the dataspace of the dataset using two sets of dimensions.
The dimensions `{0, N, 3}` specify the sizes of the dataset after creation.
The dimensions `{nil, N, 3}` specify the maximum sizes to which the dataset
may be resized, where **nil** stands for a variable size in that dimension.
Here the dataset is empty after creation and resizable to an arbitrary number
of steps. When a dataset is resizable, i.e. when the current and maximum
dimensions of a dataset differ upon creation, the data is stored using a
chunked layout, rather than a contiguous layout as for a fixed-size dataset.
The size of the chunks needs to be specified upon creation of the dataset
using a [dataset creation property list].

[dataset creation property list]: reference.html#dataset-creation-properties

The choice of chunk size affects write and read performance, since each chunk
is written or read as a whole. Here we choose a chunk size of `{1, 1000, 3}`.
To write or read the positions of all 4200 particles at a given step, the HDF5
library needs to write or read 5 chunks altogether, of which the fifth chunk is
partially filled. Tracing the position of a single particle through all 100
steps, however, would require the HDF5 library to fetch 100 chunks.
As a general guideline, the chunk size in bytes should be well above 1 kB and
below 1 MB, and the chunk dimensions should match the write and read access
patterns as closely as possible.

Chunks may be individually compressed by specifying a compression algorithm and
level in the dataset creation property list. Here we use the [deflate]
algorithm with a compression level of 6, which is the default level of the gzip
program. Note for this example compression is pointless since we store random
data. Compression is sensible when the data within each chunk is correlated.

[deflate]: https://en.wikipedia.org/wiki/DEFLATE


After creating the dataset, we prepare a file dataspace for writing:

~~~ {.lua}
local filespace = hdf5.create_simple_space({nstep, N, 3})
filespace:select_hyperslab("set", {0, 0, 0}, nil, {1, N, 3})
~~~

The file dataspace specifies a selection within the dataset that data is
written to or read from. We select a 3‑dimensional slab from the file
dataspace that has its origin at `{0, 0, 0}` and a size of `{1, N, 3}`.
As shown in the figure below, at each step we extend the dimensions of the
dataset by one slab, and move the hyperslab selection by one in the dimension
of steps.

![Dataset dimensions with hyperslab selection at steps 1, 2, 3.](hyperslab.svg)

Similarly we prepare a memory dataspace for writing:

~~~ {.lua}
local pos = ffi.new("struct { double x, y, z, w; }[?]", N)
local memspace = hdf5.create_simple_space({N, 4})
memspace:select_hyperslab("set", {0, 0}, nil, {N, 3})
~~~

The memory dataspace specifies the dimensions of and a selection within the
array of particle positions. Each 3‑dimensional vector array element is padded
with an unused fourth component to ensure properly aligned memory access. By
selecting a hyperslab of dimensions `{N, 3}` from the memory dataspace of
dimensions `{N, 4}`, we tell the HDF5 library to discard the unused fourth
component of each vector.

We conclude with a loop that writes the particle positions at each step:

~~~ {.lua}
math.randomseed(42)
for step = 1, nstep do
dataset:set_extent({step, N, 3})
for i = 0, N - 1 do
pos[i].x = math.random()
pos[i].y = math.random()
pos[i].z = math.random()
end
dataset:write(pos, hdf5.double, memspace, filespace)
filespace:offset_simple({step, 0, 0})
end
dataset:close()
file:close()
~~~

Before writing the positions, we append an additional slab of dimensions
`{1, N, 3}` to the dataset by extending its dimensions to `{step, N, 3}`.
We pass the memory dataspace to select the region for reading from the array
and the file dataspace to select the region for writing to the dataset.
Recall the file dataspace has rank 3 and the memory dataspace has rank 2;
the dimensionality of file and memory dataspaces may differ as long as the
number of selected elements is equal. After writing the positions, we move
the origin of the hyperslab selection within the file dataspace to offset
`{step, 0, 0}` to prepare the next step.

At the end of the simulation the HDF5 file contains:

~~~
h5dump -A dataspace.h5
~~~

~~~
HDF5 "dataspace.h5" {
GROUP "/" {
GROUP "particles" {
GROUP "solvent" {
GROUP "position" {
DATASET "value" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 100, 4200, 3 ) / ( H5S_UNLIMITED, 4200, 3 ) }
}
}
}
}
}
}
~~~

To check the stored particle positions, we open the dataset for reading:

~~~ {.lua}
local file = hdf5.open_file("dataspace.h5")
local dataset = file:open_dataset("particles/solvent/position/value")
~~~

We retrieve and verify the dimensions of the dataset:

~~~ {.lua}
local filespace = dataset:get_space()
local dims = filespace:get_simple_extent_dims()
local nstep, N, D = dims[1], dims[2], dims[3]
assert(nstep == 100)
assert(N == 4200)
assert(D == 3)
~~~

For reading individual slabs, we can use the dataspace obtained from the dataset:

~~~ {.lua}
filespace:select_hyperslab("set", {0, 0, 0}, nil, {1, N, 3})
~~~

As above we prepare a memory dataspace that describes the array:

~~~ {.lua}
local pos = ffi.new("struct { double x, y, z, w; }[?]", N)
local memspace = hdf5.create_simple_space({N, 4})
memspace:select_hyperslab("set", {0, 0}, nil, {N, 3})
~~~

We read the slabs for each step and compare the positions for equality:

~~~ {.lua}
math.randomseed(42)
for step = 1, nstep do
dataset:read(pos, hdf5.double, memspace, filespace)
for i = 0, N - 1 do
assert(pos[i].x == math.random())
assert(pos[i].y == math.random())
assert(pos[i].z == math.random())
end
filespace:offset_simple({step, 0, 0})
end
dataset:close()
file:close()
~~~