25 KiB
Important algorithms for CIPSI
#+LaTeX_CLASS_OPTIONS:[aspectratio=169]
Determinant-driven methods
-
Integral-driven : sequential access to $\mathcal{O}(N^4)$ integrals, indirect access to vectors
for (i,j,k,l,integral) in all_integrals: pairs = find_determinant_pairs(i,j,k,l,ijkl) for (d1,d2) in pairs: do_work(d1,d2)
-
Determinant-driven : sequential access to vectors, indirect access to integrals
for d1 in determinants: for d2 in determinants: i,j,k,l = get_excitation(d1,d2) do_work(d1,d2)
- Integral-driven: $\mathcal{O}(\nmo^4)$
- Determinant-driven: $\mathcal{O}(\Ndet^2)$
- Efficient CIPSI: How to be efficient within a determinant-driven approach
Data structures
Slater-Condon's rules
Need for functions : $f(I, J) \rightarrow (i,j,k,l,\phi)$
Double-determinant representation
A Slater determinant can be written as a Waller-Hartree double determinant
\[ |I \rangle = \hat{I}\, |\rangle = -1^p \times \hat{I}_\uparrow \, \hat{I}_\downarrow\, |\rangle = -1^p \times \hat{I}_\uparrow \, |\rangle \otimes\ \hat{I}_\downarrow \, |\rangle \]
Storage:
- 1 determinant: one integer for $\hat{I}_\uparrow$ and one integer for $\hat{I}_\downarrow$
- Set the bit to
1
if the orbital is occupied - $>64$ orbitals: $N_{\text{int}}$ integers for $\hat{I}_\uparrow$ and for $\hat{I}_\downarrow$
Fast determinant comparisons
Bitwise operations (1 CPU cycle):
and
,or
,xor
,shl
,shr
: logicalshl
,shr
: shift left/rightlzcnt
,tzcnt
: Number of leading/trailing zero bitspopcnt
: Number of bits set to1
Example: degree of excitation between $|I\rangle$ and $|J\rangle$:
integer function degree(det_i, det_j, N_int)
integer, intent(in) :: N_int
integer*8, intent(in) :: det_i(N_int,2), det_j(N_int,2)
integer :: two_d, i
two_d = 0
do i=1,N_int
two_d = two_d + popcnt( ieor( det_i(i,1), det_j(i,1) ) ) &
+ popcnt( ieor( det_i(i,2), det_j(i,2) ) )
end do
degree = rshift(two_d,1)
end function degree
Fast determinant comparisons
/scemama/Nuclearistes/src/branch/master/slaterRules.pdf
To get the orbital indices:
number of leading/trailing zeros gives the positions of the 1
's.
Integrals
Constraints
- Integrals require a fast random access
- 8-fold permtutation symmetry $\langle ij|kl \rangle = \langle kj|il \rangle = \cdots$
- Many integrals are zero: need for a sparse data structure
Implementation
- Hash table
- $f(i,j,k,l) -> K$ gives the same $K$ for all similar permutations
- $f(i+1,j,k,l) - f(i,j,k,l)$ is likely to be 1 : locality
- Array (cache) for $128^4$ frequently used integrals
Integrals
Access | Array | Hash table |
$i,j,k,l$ | 9.72 | 125.79 |
$i,j,l,k$ | 9.72 | 120.64 |
$i,k,j,l$ | 10.29 | 144.65 |
$l,k,j,i$ | 88.62 | 125.79 |
$l,k,i,j$ | 88.62 | 120.64 |
Random | 170.00 | 370.00 |
Efficient direct CI
Sorting
Popular misconception
Sorting is not $\mathcal{O}(N \log(N))$ :
(linear in $N$, log in $M$)
. B_ignoreheading
- A is an array of $N$ integer values
-
The bitmask is an integer with only one bit set to one (00001000)
void radix_sort(int* A, size_t N, int bitmask) { if (bitmask == 0) return; int left[N], right[N]; int p=0 ; int q=0 ; for (int i=0 ; i<N ; i++) { if (A[i] & bitmask) { right[q] = A[i]; q++; } else { left [p] = A[i]; p++; } } radix_sort(left , p, bitmask >> 1) ; radix_sort(right, q, bitmask >> 1) ; for (int i=0 ; i<p ; i++) { A[ i] = left [i] ; } for (int i=0 ; i<q ; i++) { A[p+i] = right[i] ; } }
Double-determinant representation of $\Psi$
\[ \mPsi = \sum_\mi \mci |\mi\rangle = \sum_{\mk=1}^{\Ndetup} \sum_{\mm=1}^{\Ndetdn} C_{\textcolor{darkgreen}{km}} \mDup \mDdn \]
- If $\mDup$ and $\mDdn$ are represented as $\nmo$ -bit strings, this transformation can be done in $\mathcal{O}(\Ndet \times \nmo)$ (sorting).
- Searching for same-spin excitations: looping over $\mk$ or $\mm$ : $\mathcal{O}(\Ndetup) \sim \mathcal{O}(\sqrt{\Ndet})$
$\mathcal{H} | \Psi \rangle$
For all $\mi = \mDup \mDdn$ in $\mPsi$:
- Find indices $p$ of $\uparrow$ singles and $\uparrow \uparrow$ doubles \[ \langle \mi | \mathcal{H} | \Psi \rangle = \sum_J \langle \mi | \mathcal{H}| J \rangle c_J = \sum_p \langle \mDup \mDdn | \mathcal{H} | D_p^\uparrow \mDdn \rangle C_{pm} \]
- Find indices $q$ of $\downarrow$ singles and $\downarrow \downarrow$ doubles \[ \langle \mi | \mathcal{H} | \Psi \rangle = \sum_J \langle \mi | \mathcal{H}| J \rangle c_J = \sum_q \langle \mDup \mDdn | \mathcal{H} | \mDup D_q^\downarrow \rangle C_{kq} \]
-
Find indices $pq$ of $\uparrow \downarrow$ doubles:
- Find indices $p$ of $\uparrow$ singles
- Find indices $q$ of $\downarrow$ singles \\
\[ \langle \mi | \mathcal{H} | \Psi \rangle = \sum_J \langle \mi | \mathcal{H}| J \rangle c_J = \sum_{pq} \langle \mDup \mDdn | \mathcal{H} | D_p^\uparrow D_q^\downarrow \rangle C_{pq} \]
Scaling with $N_{\text{det}}$
Parallel efficiency
Stochastic evaluation of the PT2 correction and selection
Epstein-Nesbet Second order correction
Consider a wave function $\mPsi$ expanded on an arbitrary set $\mD$ of $\Ndet$ orthonormal Slater determinants, \[ \mPsi = \sum_{\mi \in \mD} {\mci} | \mi \rangle, \;\;\; \Evar = \frac{ \langle \mPsi | \mH | \mPsi \rangle}{\langle \mPsi | \mPsi \rangle} \]
The Epstein-Nesbet 2nd order correction to the energy is \[ \Ept = \sum_{\ma \in \mA} \frac{ \langle \mPsi | \mH | \ma \rangle \langle \ma | \mH | \mPsi \rangle}{\Evar - \langle \ma | \mH | \ma \rangle } \]
The set $\mA$ contains the Slater determinants
- that are not in $\mD$
- for which $d(\mi,\ma) = 1$ or $2$ for at least one pair $(\mi,\ma)$
Formal scaling
\[ \Ept = \sum_{\ma \in \mA} \frac{ \langle \mPsi | \mH | \ma \rangle \langle \ma | \mH | \mPsi \rangle}{\Evar - \langle \ma | \mH | \ma \rangle} = \sum_{\ma \in \mA} \frac{ \left( \sum_{\mi \in \mD} \mci \langle \mi | \mH | \ma \rangle \right)^2 }{\Evar - \langle \ma | \mH | \ma \rangle } \]
- Size of $\mA$ : size of $(\hat{T}_1 + \hat{T}_2)|\mPsi\rangle$
- Number of non-zero terms : $d(\mi,\ma) \le 2 \sim \Ndet \times \left[ \textcolor{darkgreen}{\left( \nea \times (\nmo - \nea) \right)^2 }\right]$
- Expensive
Solutions to make simulations possible
"Non-general'' but conventional solutions:
- Partition the MO space into different classes (active, virtual, inactive, \emph{etc})
- Use another zeroth-order Hamiltonian (CAS-PT2, NEV-PT2)
Solutions applicable to any wave function:
- Truncation of $\mD$ to consider only contributions due to large
${\mci}$
But: Truncation $\longrightarrow$ bias because $\Ept$ is a sum of same-sign values (negative). - Algorithmic improvement
- Monte Carlo sampling in $\mA$.
But: Statistical error decreases as $\mathcal{O}\left(1/\sqrt{\Nsamples}\right) \Longrightarrow$ Difficult to get $\textcolor{darkgreen}{10^{-5} a.u}$ precision. - Parallelism
:B_ignoreheading:
Central idea
- Choose an arbitrary ordering of $|\mi \rangle$. Natural choice: \[w_\mi = \frac{\mci ^2}{\langle \mPsi | \mPsi \rangle } \]
- Make \emph{disjoint} groups $\mA_{\mi}$ of $| \ma \rangle$ originating from the same generator $| \mi \rangle$
- Each $\mA_\mi$ has its own contribution $\epsi$ to $\Ept$
Central idea
Contribution per internal determinant
\[ \epsi = \sum_{\ma_\mi \in \mA_\mi} \frac{ \left( \langle \mPsi | \mH | \ma_\mi \rangle \right)^2}{\Evar - \langle \ma_\mi | \mH | \ma_\mi \rangle} \]
From $\mathcal{O}(N_\text{det}^2)$ to $\mathcal{O}(N_\text{det}^{3/2})$
\[ \mPsi = \sum_\mi \mci |\mi\rangle = \sum_{\mk=1}^{\Ndetup} \sum_{\mm=1}^{\Ndetdn} C_{\textcolor{red}{km}} \mDup \mDdn \]
- Sorting is $\mathcal{O}(\Ndet)$
- $\langle \mi | \mH | \ma \rangle \langle \ma | \mH | \mj \rangle = 0$ when $d(\mi, \mj) > 4$
- Loop over $\Ndetup$ determinants (rows of the $C$ matrix)
Remove all the rows where $d(\mDup, \mDupp)>4$ ($\sim \mathcal{O}(\sqrt{\Ndet})$) - Loop over $\Ndetdn$ determinants (columns of the $C$ matrix)
Remove all the columns where $d(\mDdn, \mDdnn)>4)$ - The remaining number of determinants is bounded by the size of the CISDTQ space
Finer filtering
\[ \epsi = \sum_{\ma \in \mA_\mi} \frac{ \langle \mPsi'_I | \mH | \ma_\mi \rangle \langle \ma_\mi | \mH | \mPsi'_I \rangle}{\Evar - \langle \ma_\mi | \mH | \ma_\mi \rangle} \]
- We know that all the $| \ma_\mi \rangle$ are singles and doubles with respect to $| \mi \rangle$
- $|\mPsi'_I\rangle$ is the projection of $|\mPsi \rangle$ on the subspace of determinants in $\mD$ which are no more than quadruply excited with respect to $| \mi \rangle$
- For a subset of excitations $ij \rightarrow ab$, $|\mPsi'\rangle$ is filtered further with possible hole/particle constraints
Monte Carlo sampling
\[ \epsi = \sum_{\ma_\mi \in \mA_\mi} \frac{ \left( \langle \mPsi | \mH | \ma_\mi \rangle \right)^2}{\Evar - \langle \ma_\mi | \mH | \ma_\mi \rangle} \]
- $\langle \mPsi | \mH | \ma_\mi \rangle = \sum_{\mj \ge \mi} \mcj\, \langle\, \mj | \mH | \ma_\mi \rangle$
- $\langle \ma_\mi | \mH | \ma_\mi \rangle$ is always large (otherwise $|\ma_\mi \rangle$ would be better in the \textcolor{red}{variational space}, and PT is questionable)
:B_ignoreheading:
- $\forall \mi \in \mD : \epsi \le 0$
- $|\epsi|$ is expected to decrease as $\mci^2$
- The computational cost decreases with $\mi$
Monte Carlo formulation
\[ \Ept = \sum_{\mi \in \mD} \epsi = \sum_{\mi \in \mD} \mpi \frac{\epsi}{\mpi} = \left\langle \frac{\epsi}{\mpi} \right\rangle_{\mpi} \]
Naive sampling
Uniform sampling: \mpi = \frac{1}{\Ndet}$
Improved sampling
Sampling : $\mpi = \mci^2$
Lazy evaluation
Only $\Ndet$ contributions $\epsi \longrightarrow$ all $\epsi$ can be stored in memory.
Lazy Evaluation (Wikipedia)
In programming language theory, lazy evaluation, or call-by-need is an evaluation strategy which delays the evaluation of an expression until its value is needed (non-strict evaluation) and which also avoids repeated evaluations (sharing).
:B_ignoreheading:
def lazy_e(i):
if not e_is_computed[i]:
e[i] = compute_e(i)
e_is_computed[i] = true
return e[i]
Monte Carlo with Lazy Evaluation
\[ \Ept = \sum_{\mi \in \mD} \epsi = \sum_{\mi \in \mD} \mpi \frac{\epsi}{\mpi} = \left\langle \frac{\epsi}{\mpi} \right\rangle_{\mpi} \]
- Draw a generator determinant $|\mi\,\rangle$ with probability $\mpi$
- Increment $n_\mi$, the number of evaluations of $\epsi$
- If $\epsi$ is not already computed, compute it and store its value
- $\Ept \sim \sum_{\mi \in \mD} \frac{n_\mi}{\Nsamples} \frac{\epsi}{\mpi}$
- Statistical error : $\mathcal{O}\left(1/\sqrt{\Nsamples}\right)$
- Lazy evaluation : Exponential acceleration (time to solution)
Monte Carlo with Lazy Evaluation
Monte Carlo with Lazy Evaluation
Monte Carlo with Lazy Evaluation
Monte Carlo with Variance reduction
- Noise can be smoothed out by averaging
- Split $\mD$ into $\mM$ \emph{equiprobable} sets : "Comb"
\[ \Ept = \sum_{\mi \in \mD} \epsi = \sum_{\mk=1}^{\mM} \sum_{\mi_\mk \in \mD_\mk} \epsik \]
New Monte Carlo estimator
\[ \Ept = \left \langle \frac{1}{\mM} \sum_{\mk=1}^{\mM} \frac{\epsik}{{\textcolor{red}{p_{I_k}}}} \right \rangle_{\textcolor{red}{(p_{I_1}, \dots, p_{I_M})}} \]
Monte Carlo with Variance reduction
Monte Carlo with Variance reduction
Monte Carlo with Variance reduction
Monte Carlo with Variance reduction
Monte Carlo with Variance reduction
Monte Carlo with Variance reduction
Monte Carlo with Variance reduction
Hybrid deterministic/stochastic scheme
- When all the determinants have been drawn, the \emph{exact} $\Ept$ can be computed
- $\Longrightarrow$ The result with zero statistical error can be reached in a finite time
- In typical wave functions, $90\%$ of the norm is on a few determinants
- Compute the few first contributions $\epsi$, and perform the MC in the rest
\[ \Ept = \sum_{\mi \in \mD_D} \epsi + \left \langle \frac{1}{\mM} \sum_{\mk=1}^{\mM} \frac{\epsik}{\textcolor{red}{p_{I_k}}} \right \rangle_{(\textcolor{red}{p_{I \in \mD_S})}} \]
Hybrid deterministic/stochastic scheme
Make the deterministic part grow during the calculation.
At each MC step
- Draw a random number
- Find the determinants selected by the comb (increment $n_\mi$'s)
- Compute the $\epsi$ which have not been yet computed
- Compute deterministically the first non-computed determinant
- If a tooth of the comb is completely filled $\Longrightarrow$ Deterministic
At any time
\[ \Ept(t) = \sum_{\mi \in \mD_D(t)} \epsi + \sum_{\mi \in \mD_S(t)} \frac{1}{\mM(t)} \frac{n_\mi(t)}{\Nsamples(t)} \frac{\epsi}{\mpi} \]
Hybrid deterministic/stochastic scheme
Hybrid deterministic/stochastic scheme
Hybrid deterministic/stochastic scheme
Hybrid deterministic/stochastic scheme
Some timings: Cr$_2$, $2\,10^7$ determinants, 800 cores
Basis | $\Ept$ | Wall-clock time |
cc-pVDZ | \textcolor{red}{$-0.068\,3(1)$} | 14 min |
\textcolor{red}{$-0.068\,36(1)$} | 55 min | |
\textcolor{red}{$-0.068\,361(1)$} | 2.4 hr | |
\textcolor{red}{$-0.068\,360\,604$} | 3 hr | |
cc-pVTZ | \textcolor{red}{$-0.124\,4(5)$} | 19 min |
\textcolor{red}{$-0.124\,7(1)$} | 58 min | |
\textcolor{red}{$-0.124\,63(1)$} | 3.5 hr | |
\textcolor{red}{$-0.124\,642(1)$} | 8.7 hr | |
— | $\sim$ 15 hr (estimated) | |
cc-pVQZ | \textcolor{red}{$-0.155\,8(5)$} | 56 min |
\textcolor{red}{$-0.155\,9(1)$} | 2.5 hr | |
\textcolor{red}{$-0.155\,95(1)$} | 9.0 hr | |
\textcolor{red}{$-0.155\,952(1)$} | 18.5 hr | |
— | $\sim$ 29 hr (estimated) |