1
0
mirror of https://github.com/TREX-CoE/trexio.git synced 2024-12-21 20:04:05 +01:00

Improved documentation

This commit is contained in:
Anthony Scemama 2023-02-17 17:41:36 +01:00
parent e321abc6f4
commit d37f4d6610
11 changed files with 416 additions and 63 deletions

View File

@ -1,6 +1,6 @@
#+TITLE: Examples
#+STARTUP: latexpreview
#+SETUPFILE: docs/theme.setup
#+SETUPFILE: ./theme.setup
* Writing nuclear coordinates

60
docs/intro.org Normal file
View File

@ -0,0 +1,60 @@
#+TITLE: Motivation
#+STARTUP: latexpreview
#+SETUPFILE: ./theme.setup
#+BEGIN_EXPORT html
</td>
<td>
<img src="trexio_logo.png" alt="TREXIO logo"
align="left" width="200" vspace="20" hspace="20" />
</td></tr>
</table>
#+END_EXPORT
Quantum chemistry relies on quantum mechanics to explain and predict
the properties and behaviors of atoms, molecules, and materials.
Although density functional theory (DFT) is one of the most widely
used approaches thanks to its excellent ratio between computational
cost and accuracy, another important tool is wave function theory
(WFT), which describes the behavior of a quantum system in terms of
its wave function.
In order to perform WFT calculations, it is necessary to manipulate a
large number of parameters, such as the expansion coefficients of the
wave function and the matrix elements of the Hamiltonian operator.
These parameters can be numerous and difficult to handle, making it
important to have a robust and efficient method for storing and
accessing them.
Reproducible research remains a challenging topic, despite recent
advances such as the introduction of the FAIR (findable, accessible,
interoperable, reusable) data principles. A key
aspect of reproducibility is software interoperability, which refers
to the ability of different programs to work together and exchange
information, allowing different systems to communicate and exchange
data in order to function as a cohesive whole.
Interoperable software is prevalent nowadays and is a key component of
the Unix philosophy. In Unix shells, the most
straightforward application of software interoperability is made
through the use of the /pipe/ operator, where the output of a
program is the input of another program.
Similarly, shell scripts are created through the composition of
smaller programs, exchanging data through files or pipes.
A major challenge of reproducible research is the unified input/output
(I/O) of data within a particular research domain. The Unix
philosophy recommends the use of text files because they are
architecture-independent, readable in any language, and can be read as
a stream, which is useful for making programs communicate over a
network.
However, storing data in a text format can result in larger file sizes
and conversion from ASCII to binary format can be computationally
expensive for large data sets. To address this concern,
domain-specific binary formats have been developed, such as the Joint
Photographic Experts Group (JPEG) format for digital images
and the Moving Picture Experts Group (MPEG) format for videos.
These binary formats are utilized through a standardized application
programming interface (API).
In the field of wave function theory such a standard format and API is
still lacking, and the purpose of the TREXIO library is to fill this gap.

311
docs/lib.org Normal file
View File

@ -0,0 +1,311 @@
#+TITLE: The TREXIO library
#+STARTUP: latexpreview
#+SETUPFILE: ./theme.setup
* Format specification
#+BEGIN_EXPORT html
</td>
<td>
<img src="trex_specs.png" alt="TREX in a library"
align="right" width="300" vspace="20" hspace="20" />
</td></tr>
</table>
#+END_EXPORT
#
The TREXIO format is designed to store all the necessary information
to represent a wave function.
One notable feature of TREXIO is that it is self-contained, meaning
that all the parameters needed to recreate the wave function are
explicitly stored within the file, eliminating the need for external
databases. For example, instead of storing the name of a basis set
(such as cc-pVDZ), the actual basis set parameters used in the
calculation are stored.
** Organization of the data
The data in TREXIO are organized into *groups*, each containing
multiple *attributes* defined by their *type* and *dimensions*. Each
attribute within a group corresponds to a single scalar or array
variable in a code. In what follows, the notation
~<group>.<attribute>~ will be used to identify an attribute within a
group. For example, ~nucleus.charge~ refers to the
~charge~ attribute in the ~nucleus~ group. It is an array of type
~float~ with dimensions ~nucleus.num~, the attribute describing the
number of nuclei.
** Data types
So that TREXIO can be used in any language, we use a limited number
of data types. The main data types are ~int~ for integers,
~float~ for floating-point values, and ~str~ for
character strings. For complex numbers, their real and imaginary
parts are stored as ~float~. To minimize the risk of integer
overflow and accuracy loss, numerical data types are stored using
64-bit representations by default. However, in specific cases where
integers are bounded (such as orbital indices in four-index
integrals), the smallest possible representation is used to reduce the
file size. The API handles any necessary type conversions.
There are also two types derived from ~int~: ~dim~ and ~index~.
~dim~ is used for dimensioning variables, which are positive integers
used to specify the dimensions of an array. In the previous example,
~nucleus.num~ is a dimensioning variable that specifies the
dimensions of the ~nucleus.charge~ array. ~index~ is used for
integers that correspond to array indices, because some languages
(such as C or Python) use zero-based indexing, while others (such as
Fortran) use one-based indexing. For convenience, values of the
~index~ type are shifted by one when TREXIO is used in one-based
languages to be consistent with the semantics of the language.
You may also encounter some ~dim readonly~ variables. It means
that the value is automatically computed and written by the TREXIO
library, thus it is read-only and cannot be (over)written by the
user.
Arrays can be stored in either dense or sparse formats. If the
sparse format is selected, the data is stored in coordinate format.
For example, the element ~A(i,j,k,l)~ is stored as a quadruplet of
integers $(i,j,k,l)$ along with the corresponding value. Typically,
two-dimensional arrays are stored as dense arrays, while arrays with
higher dimensions are stored in sparse format.
For sparse data structures the data can be too large to fit in memory
and the data needs to be fetched using multiple function calls to
perform I/O on buffers. For more information on how to read/write
sparse data structures, see the [[./examples.html][examples]].
For the Configuration Interaction (CI) and Configuration State
Function (CSF) groups, the ~buffered~ data type is introduced, which
allows similar incremental I/O as for ~sparse~ data but without the
need to write indices of the sparse values.
For determinant lists (integer bit fields), the ~special~ attribute
is present in the type. This means that the source code is not
produced by the generator, but hand-written.
Some data may be complex. In that case, the real part should be stored
in the variable, and the imaginary part will be stored in the variable
with the same name suffixed by ~_im~.
* The TREXIO library
#+BEGIN_EXPORT html
</td>
<td>
<img src="trex_lib.png" alt="TREX in a library"
align="left" width="300" vspace="20" hspace="20" />
</td></tr>
</table>
#+END_EXPORT
The TREXIO library is written is the C language, and is licensed under
the open-source 3-clause BSD license to allow for use in all types of
quantum chemistry software, whether commercial or not.
The design of the library is divided into two main sections: the
front-end and the back-end. The front-end serves as the interface
between users and the library, while the back-end acts as the
interface between the library and the physical storage.
** The front-end
By using the TREXIO library, users can store and extract data in a
consistent and organized manner. The library provides a user-friendly
API, including functions for reading, writing, and checking for the
existence of data. The functions follow the pattern
~trexio_[has|read|write]_<group>_<attribute>~, where the
group and attribute specify the particular data being accessed. It
also includes an error handling mechanism, in which each function call
returns an exit code of type ~trexio_exit_code~, explaining
the type of error.
This can be used to catch exceptions and improve debugging in the
upstream user application.
To ensure the consistency of the data, the attributes can only be
written if all the other attributes on which they explicitly depend
have been written. For example, as the ~nucleus.coord~ array is
dimensioned by the number of nuclei ~nucleus.num~, the ~nucleus.coord~
attribute can only be written after ~nucleus.num~. However, the
library is not aware of non-explicit dependencies, such as the
relation between the electron repulsion integrals (ERIs) and MO
coefficients. A complete control of the consistency of the data is
therefore impossible, so the attributes were chosen to be by default
/immutable/. By only allowing data to be written only once, the
risk of modifying data in a way that creates inconsistencies is
reduced. For example, if the ERIs have already been written, it would
be inconsistent to later modify the MO coefficients. To allow for
flexibility, the library also allows for the use of an /unsafe/
mode, in which data can be overwritten. However, this mode carries
the risk of producing inconsistent files, and the ~metadata~ group's
~unsafe~ attribute is set to ~1~ to indicate that the file has
potentially been modified in a dangerous way. This attribute can be
manually reset to ~0~ if the user is confident that the modifications
made are safe.
** The back-end
At present, TREXIO supports two back-ends: one relying only on the
C standard library to produce plain text files (the so-called /text/
back-end), and one relying on the HDF5 library.
With the text back-end, the TREXIO "file" is a directory containing
multiple text files, one for each group. This back end is intended
to be used in development environments, as it gives access to the
user to the standard tools such as ~diff~ and ~grep~.
In addition, text files are better adapted than binary files for
version control systems such as Git, so this format can be also
used for storing reference data for unit tests.
HDF5 is a binary file format and library for storing and managing
large amounts of data in a hierarchical structure. It allows users
to manipulate data in a way similar to how files and directories
are manipulated within the file system. The HDF5 library provides
optimal performance through its memory mapping mechanism and
supports advanced features such as serial and parallel I/O,
chunking, and compression filters. However, HDF5 files are in
binary format, which requires additional tools such as ~h5dump~ to
view them in a human-readable format. It is widely used in
scientific and engineering applications, and is known for its high
performance and ability to handle large data sets efficiently.
The TREXIO HDF5 back-end is the recommended choice for production
environments, as it provides high I/O performance. Furthermore,
all data is stored in a single file, making it especially suitable
for parallel file systems like Lustre. These file systems are
optimized for large, sequential I/O operations and are not
well-suited for small, random I/O operations. When multiple small
files are used, the file system may become overwhelmed with
metadata operations like creating, deleting, or modifying files,
which can adversely affect performance.
In a benchmarking program designed to compare the two back-ends of
the library, the HDF5 back-end was found to be significantly faster
than the text back-end. The program wrote a wave function made up
of 100 million Slater determinants and measured the time taken to
write the Slater determinants and CI coefficients. The HDF5
back-end achieved a speed of $10.4\times10^6$ Slater determinants
per second and a data transfer rate of 406 MB/s, while the text
back-end had a speed of $1.1\times10^6$ determinants per second and
a transfer rate of 69 MB/s. These results were obtained on a DELL
960 GB mix-use solid-state drive (SSD). The HDF5 back-end was able
to achieve a performance level close to the peak performance of the
SSD, while the text back-end's performance was limited by the speed
of the CPU for performing binary to ASCII conversions.
In addition to the HDF5 and text back-ends, it is also possible to
introduce new back-ends to the library. For example, a back-end
could be created to support object storage systems, such as those
used in cloud-based applications or for archiving in open data
repositories.
** Supported languages
One of the main benefits of using C as the interface for a library is
that it is easy to use from other programming languages. Many
programming languages, such as Python or Julia, provide built-in
support for calling C functions, which means that it is relatively
straightforward to write a wrapper that allows a library written in C
to be called from another language.
In general, libraries with a C interface are the easiest to use from
other programming languages, because C is widely supported and has a
simple, stable application binary interface (ABI). Other languages,
such as Fortran and C++, may have more complex ABIs and may
require more work to interface with them.
TREXIO has been employed in codes developed in various programming
languages, including C, C++, Fortran, Python, OCaml, and Julia. While
Julia is designed to enable the use of C functions without the need
for additional manual interfacing, the TREXIO C header file was
automatically integrated into Julia programs using the
~CBindings.jl~ package.
In contrast, specific bindings have been provided for Fortran, Python,
and OCaml to simplify the user experience.
In particular, the binding for Fortran is not distributed as multiple
compiled Fortran module files (~.mod~), but instead as a single
Fortran source file (~.F90~). The distribution of the source file
instead of the compiled module has multiple benefits. It ensures that
the TREXIO module is always compiled with the same compiler as the
client code, avoiding the compatibility problem of ~.mod~ files
between different compiler versions and vendors. The single-file
model requires very little changes in the build system of the user's
codes, and it facilitates the search for the interface of a particular
function. In addition, advanced text editors can parse the TREXIO
interface to propose interactive auto-completion of the TREXIO
function names to the developers.
Finally, the Python module, partly generated with SWIG and fully
compatible with NumPy, allows Python users to interact with the
library in a more intuitive and user-friendly way. Using the Python
interface is likely the easiest way to begin using TREXIO and
understanding its features. In order to help users get started with
TREXIO and understand its functionality, tutorials in Jupyter
notebooks are available on GitHub
(https://github.com/TREX-CoE/trexio-tutorials), and can be executed
via the Binder platform.
** Source code generation and documentation
Source code generation is a valuable technique that can significantly
improve the efficiency and consistency of software development. By
using templates to generate code automatically, developers can avoid
manual coding and reduce the risk of errors or inconsistencies. This
approach is particularly useful when a large number of functions
follow similar patterns, as in the case of the TREXIO library, where
functions are named according to the pattern
~trexio_[has|read|write]_<group>_<attribute>~.
By generating these functions from the format specification using
templates, the developers can ensure that the resulting code follows a
consistent structure and is free from errors or inconsistencies.
The description of the format is written in a text file in the Org
format. Org is a structured plain text format, containing information
expressed in a lightweight markup language similar to the popular
Markdown language. While Org was introduced as a mode of the GNU
Emacs text editor, its basic functionalities have been implemented in
most text editors such as Vim, Atom or VS Code.
There are multiple benefits in using the Org format. The first
benefit is that the Org syntax is easy to learn and allows for the
insertion of equations in \LaTeX{} syntax. Additionally, Org files
can be easily converted to HyperText Markup Language (HTML) or
Portable Document Format (PDF) for generating documentation. The
second benefit is that GNU Emacs is a programmable text editor and
code blocks in Org files can be executed interactively, similar to
Jupyter notebooks. These code blocks can also manipulate data defined
in tables and this feature is used to automatically transform tables
describing groups and attributes in the documentation into a
JavaScript Object Notation (JSON) file.
This JSON file is then used by a Python script to generate the needed
functions in C language, as well as header files and some files
required for the Fortran, Python, and OCaml interfaces.
With this approach, contributions to the development of the TREXIO
library can be made simply by adding new tables to the Org file, which
can be submitted as /pull requests/ on the project's GitHub
repository (https://github.com/trex-coe/trexio). Overall, this
process allows for a more efficient and consistent development process
and enables contributions from a wider range of individuals,
regardless of their programming skills.
** Availability
The TREXIO library is designed to be portable and easy to install
on a wide range of systems. It follows the C99 standard to ensure
compatibility with older systems, and can be configured with either
the GNU Autotools or the CMake build systems. The only external
dependency is the HDF5 library, which is widely available on HPC
platforms and as packages on major Linux distributions. Note that
it is possible to disable the HDF5 back-end at configuration time,
allowing TREXIO to operate only with the text back-end and have
zero external dependencies. This can be useful for users who may
not be able to install HDF5 on certain systems.
TREXIO is distributed as a tarball containing the source code,
generated code, documentation, and Fortran interface. It is also
available as a binary ~.deb~ package for Debian-based Linux
distributions and as packages for Guix, Spack and Conda. The Python
module can be found in the PyPI repository, the OCaml binding is
available in the official OPAM repository, and the ~.deb~ packages
are available in Ubuntu 23.04.

View File

@ -1,11 +1,11 @@
# -*- mode: org; -*-
#+HTML_LINK_HOME: index.html
#+OPTIONS: H:4 num:t toc:t \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t d:(HIDE)
#+OPTIONS: H:4 num:t toc:nil \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t d:(HIDE)
####### #+SETUPFILE: ../docs/org-html-themes/org/theme-readtheorg.setup
#+INFOJS_OPT: toc:t mouse:underline path:org-info.js
#+INFOJS_OPT: mouse:underline path:org-info.js
#+HTML_HEAD: <link rel="stylesheet" title="Standard" href="trexio.css" type="text/css" />
#+STARTUP: align nodlcheck hidestars oddeven lognotestate

BIN
docs/trex_lib.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 MiB

BIN
docs/trex_specs.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 MiB

BIN
docs/trexio.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 425 KiB

BIN
docs/trexio_logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

View File

@ -2,34 +2,57 @@
#+PROPERTY: comments org
#+SETUPFILE: ../docs/theme.setup
--------------------------------
#+BEGIN_EXPORT html
<script async src="https://cse.google.com/cse.js?cx=a67f8ab65a66f97f2"></script>
<div class="gcse-search"></div>
#+END_EXPORT
--------------------------------
------------------
TREXIO is an open-source file format and library developed for the storage and
manipulation of data produced by quantum chemistry calculations. It was
designed with the goal of providing a reliable and efficient method of storing
and exchanging wave function parameters and matrix elements.
- [[./tutorial_benzene.html][Tutorial]]
The library consists of a front-end implemented in the C programming language
and two different back-ends: a text back-end and a binary back-end utilizing
the HDF5 library enabling fast read and write speeds. It is compatible with a
variety of platforms and has interfaces for Fortran, Python, and OCaml.
--------------------------------
#+BEGIN_EXPORT html
<table style="width:100%">
<tr><td style="width:50%">
#+END_EXPORT
- [[./intro.html][Motivation]]
- [[./lib.html][The TREXIO library]]
- [[./trex.html][Data stored with TREXIO]]
- [[./tutorial_benzene.html][Tutorial]]
- [[./examples.html][How-to guide]]
- [[./templator_front.html][Front end API]]
- [[./templator_hdf5.html][HDF5 back end]]
- [[./templator_text.html][TEXT back end]]
#+BEGIN_EXPORT html
</td>
<td>
<img src="trexio.png" alt="T-Rex talking about chemistry"
align="right" width="300"/>
</td></tr>
</table>
#+END_EXPORT
--------------------------------
The TREXIO library defines a standard format for storing wave functions,
together with an C-compatible API such that it can be easily used in any
programming language.
The source code of the library is available at
https://github.com/trex-coe/trexio
and bug reports should be submitted at
https://github.com/trex-coe/trexio/issues.
The TREXIO library is licensed under the open-source 3-clause BSD license.
------------------
[[https://trex-coe.eu/sites/default/files/inline-images/euflag.jpg]] [[https://trex-coe.eu][TREX: Targeting Real Chemical Accuracy at the Exascale]] project has received funding from the European Unions Horizon 2020 - Research and Innovation program - under grant agreement no. 952165. The content of this document does not represent the opinion of the European Union, and the European Union is not responsible for any use that might be made of such content.

View File

@ -81,7 +81,7 @@ function extract_doc()
${org} \
--load ${CONFIG_TANGLE} \
-f org-html-export-to-html &> /dev/null
mv ${local_html} ${DOCS}
mv -f ${local_html} ${DOCS}
rm -f "${local_html}~"
}
@ -99,7 +99,7 @@ function main() {
# Create documentation
cd ${SRC}
for dir in ${SRC}/templates_*/ ${TREXIO_ROOT}/
for dir in ${SRC}/templates_*/ ${TREXIO_ROOT}/ ${TREXIO_ROOT}/docs
do
dir=${dir%*/}
echo ${dir}

View File

@ -1,58 +1,17 @@
#+TITLE: TREX Configuration file
#+TITLE: Data stored in TREXIO
#+STARTUP: latexpreview
#+SETUPFILE: docs/theme.setup
This page contains information about the general structure of the
TREXIO library. The source code of the library can be automatically
generated based on the contents of the ~trex.json~ configuration file,
which itself is generated from different sections (groups) presented
below.
For simplicity, the singular form is always used for the names of
groups and attributes, and all data are stored in atomic units.
The dimensions of the arrays in the tables below are given in
column-major order (as in Fortran), and the ordering of the dimensions
is reversed in the produced ~trex.json~ configuration file as the
library is written in C.
All quantities are saved in TREXIO files in atomic units. The
dimensions of the arrays in the tables below are given in column-major
order (as in Fortran), and the ordering of the dimensions is reversed
in the produced ~trex.json~ configuration file as the library is
written in C.
TREXIO currently supports ~int~, ~float~ and ~str~ types for both
single attributes and arrays. Note, that some attributes might have
~dim~ type (e.g. ~num~ of the ~nucleus~ group). This type is treated
exactly in the same way as ~int~ with the only difference that ~dim~
variables cannot be negative. This additional constraint is required
because ~dim~ attributes are used internally to allocate memory and to
check array boundaries in the memory-safe API. Most of the times, the
~dim~ variables contain the ~num~ suffix.
You may also encounter some ~dim readonly~ variables.
It means that the value is automatically computed and written by the
TREXIO library, thus it is read-only and cannot be (over)written by the
user.
In Fortran, arrays are 1-based and in most other languages the
arrays are 0-based. Hence, we introduce the ~index~ type which is a
1-based ~int~ in the Fortran interface and 0-based otherwise.
For sparse data structures such as electron replusion integrals,
the data can be too large to fit in memory and the data needs to be
fetched using multiple function calls to perform I/O on buffers.
For more information on how to read/write sparse data structures, see
the [[./examples.html][examples]]. The ~sparse~ data representation implies the
[[https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)][coordinate list]] representation, namely the user has to write a list
of indices and values.
For the Configuration Interaction (CI) and Configuration State Function (CSF)
groups, the ~buffered~ data type is introduced, which allows similar incremental
I/O as for ~sparse~ data but without the need to write indices of the sparse values.
For determinant lists (integer bit fields), the ~special~ attribute is present in the type.
This means that the source code is not produced by the generator, but hand-written.
Some data may be complex. In that case, the real part should be stored
in the variable, and the imaginary part will be stored in the variable
with the same name suffixed by ~_im~.
#+begin_src python :tangle trex.json :exports none
#+begin_src python :tangle trex.json :exports none
{
#+end_src
#+end_src
* Metadata (metadata group)