# Core implementation

$ \renewcommand{\vec}[1]{ {\bf #1} } \newcommand{\bnabla}{ \vec{\nabla} } \newcommand{\Rey}{Re} \def\vechat#1{ \hat{ \vec{#1} } } \def\mat#1{#1} $

# Spatial representation

## Finite difference stencil — $r$

A function $p(r)$ is represented by the values $p_n=p(r_n)$, where $p(r)$ is evaluated on the $N$ radial points $r_n$; $n=1..N$. There is no point on the axis $r=0$ to avoid singularities in $1/r$ terms. The points are located at the roots of a Tchebyshev polynomial, such that they are clustered near the wall, and slightly so near the axis to compensate for loss of order in the calculation of derivatives. (They are calculated using banded matrices, implying use of fewer points at the axis and boundary).

### Differentiation

Consider Taylor expansions about central point $x_0$, with $k$ neighbouring points each side

These expansions can be written

Derivatives are calculated using weights from the appropriate row of $A^{-1}$.

### Integration

Integrating the expansions above, from $x_0$ to $x$, the indefinite integral may be approximated about $x_0$:

The total integral is approximated by

$ \frac1{2} \sum_n \int_{x_{n-1}}^{x_{n+1}} f(x) \, {\mathrm d}x$

### Matrix representation of linear operators

Linear equations can be written in the matrix form

$\mat{L} \,\vec{p} = \mat{M}\, \vec{q} + \vec{s}.$

$\mat{L}$ and $\mat{M}$ are $N\times N$ matrices representing linear operators. If $\vec{q}$ and $\vec{s}$ are known then the right-hand side is simple to evaluate. Consider the equation

$\label{eq:matmod} \mat{L}\, \vec{p} = \vec{q},$

where $\vec{p}$ is unknown. If the matrix operators are formed by linear combinations of the weights in #Differentiation, using only $k$ neighbouring points to each side, they are banded. Boundary conditions are formed in the same way and are placed in the end rows. The equation is solved by forward-backward substitution, using the banded LU-factorisation of $\mat{L}$.

figure=LU.eps, scale=0.65

## Fourier representaion — $\theta$, $z$

### Expansion of variables

$A(\theta,z) = \sum_{|k|<K}\sum_{|m|<M}\, A_{km} {\mathrm e}^{{\mathrm i} (\alpha kz + m_0 m\theta)}$

$A$ real implies $A_{km} = A_{-k,-m}^*$ (coefficients are conjugate-symmetric). It is sufficient to keep only $m\ge 0$, and for $m=0$ keep $k\ge 0$.

### Implied conditions on the axis

The geometry enforces conditions on each variable at the axis when expanded over Fourier modes in $\theta$. For scalars and the $z$-component of a vector $A_z$, each mode is even in $r$ if $m$ is even, and is odd if $m$ is odd. Each mode for $A_r$ and $A_\theta$ is even in $r$ if $m$ is odd, and is odd if $m$ is even. This implies that either the function or its derivative is zero on the axis.

### Orthogonality

Exponentials

$\int_0^{2\pi} {\mathrm e}^{{\mathrm i}(m+n)} \, {\mathrm d}\theta = 2\pi \, \delta_{m,-n}\, , \qquad \int_0^{\frac{2\pi}{\alpha}} {\mathrm e}^{{\mathrm i}\alpha(k+j)} \, {\mathrm d}z = \frac{2\pi}{\alpha} \, \delta_{k,-j}$

### Volume integral

$E \, = \, \int A \, {\mathrm d}V \, = \, \frac{4\pi^2}{\alpha} \, \int A_{00} \, r\,{\mathrm d}r$

### Energy integral

$E \, = \, {\textstyle \frac1{2}} \int \vec{A} \cdot \vec{A} \, {\mathrm d}V \, = \, \frac{2\pi^2}{\alpha} \, \sum_{km} \int |\vec{A}_{km}|^2 \, r\,{\mathrm d}r$

$E = \sum_{m\ge 0} E_m, \qquad E_m = \left\{ \begin{array}{ll} \displaystyle \frac{2\pi^2}{\alpha}\int\vec{A}_{00}^2 \, r \, {\mathrm d}r + \frac{4\pi^2}{\alpha} \sum_{k>0} \int|\vec{A}_{k0}^2| \, r \, {\mathrm d}r, & m=0, \\[15pt] \displaystyle \frac{4\pi^2}{\alpha} \sum_k \int |\vec{A}_{km}|^2 \, r \, {\mathrm d}r, & m>0 . \end{array} \right.$

# Temporal discretisation

Temporal discritisation is via a second-order Predictor-Corrector scheme, with Euler predictor for the nonlinear terms and Crank-Nicolson corrector. The linear viscous term is treated implicitly with the Crank-Nicolson term.

## PPE-formulation with correct boundary conditions

An influence-matrix technique is used to avoid issues in satisfying the boundary conditions. See The_PPE_formulation.

## Predictor-corrector

The model equation for each Fourier mode is

$\label{eq:harmmod} (\partial_{t} - \nabla^2) f = N,$

where nonlinear terms have been evaluated on each radial point by the transform method, and is solved as in #Matrix_representation_of_linear_operators. The predictor at time $t_q$, with Euler nonlinear terms and implicitness $c$

$\frac{f_1^{q+1}-f^q}{\Delta t} - \left( c\nabla^2 f_1^{q+1} + (1-c)\nabla^2 f^q \right) = N^q ,$

$\left( \frac1{\Delta t} - c\nabla^2 \right) f_1^{q+1} = \left( \frac1{\Delta t} + (1-c)\nabla^2 \right) f^q + N^q .$

Corrector iterations are

$\left( \frac1{\Delta t} - c\nabla^2 \right) f_{j+1}^{q+1} = \left( \frac1{\Delta t} + (1-c)\nabla^2 \right) f^q + c\,N_j^{q+1} + (1-c)\, N^q ,$

or equivalently, for the correction $f_{corr} = f_{j+1}^{q+1} - f_j^{q+1}$

$\left( \frac1{\Delta t} - c\nabla^2 \right) f_{corr} = c\,N_j^{q+1} - c\, N_{j-1}^{q+1} ,$

where $j=1,2,\dots$ and $N_0^{q+1}=N^q$. The size of the correction $\|f_{corr}\|$ must reduce at each iteration. For $c=\frac1{2}$ the scheme is second order such that $\|f_{corr}\| \sim \Delta t^2$. Stability is improved without observable degradation in accuracy using $c=0.501$ .

## Timestep control

$\Delta t = \mathrm{C} \, \min(\Delta \, / \, |\vec{v}| ) , \qquad 0<\mathrm{C}<1,$

where $\mathrm{C}$ is the Courant number.

The timestep should also be small enough such that the corrector norm $\|f_{corr}\|$ is satisfactorily small. This may be important for integrating initial transients, but ought not to be the limiting factor in general.

# The code structure

## Overview of program files

declaration and values of parameters.`parameters.f90`macros for parallelisation; definition of loops for parallel+serial cases`parallel.h``_loop_km_begin`etc.mpi initialisation.`mpi.f90`core initialisation, main timestepping loop, clean up at end.`main.f90`definition of radial discretisation.`meshs.f90`definition of derived types for spectral- and physical-space data.`variables.f90`transform from collocated spectral space to physical space. Uses fftw3 library.`transform.fftw3.f90`subroutines to set up time stepping matrices.`timestep.f90`functions related to the velocity field; implementation of the timestepping and influence matrix method.`velocity.f90`Input-Output: loading and saving of data.`io.f90`

Each of the files contains a '`module`', declaring variables that may be used in other code via '`use modulename`'. With the exception of the fundamental parameters, the first three letters of the module name are used as a prefix in the name of public variables, to make the location of their declaration clear (see following section).

## Naming conventions

### Parameters

Parameters defined within `parameters.f90` are given a prefix indicating the data type. For example:

$N$ | i_N |
integer |

$\Delta t$ | d_timestep |
double |

fixed flux? | b_fixed_flux |
boolean |

Note that the naming convention frees the variable name for dummy variables, e.g.:

integer :: n do n = 1, i_N ...

### Public variables

By default variables declared within a module are accessible to any other module or subroutine that uses the module. To ensure that it is clear where a public variable is declared, variables are given a prefix. Examples:

$\Delta t$ | tim_dt |
timestep.f90 |

$u_r$ | vel_r |
velocity.f90 |

## Ordering the Fourier modes

$A = \sum_{|k|<K}\sum_{|m|<M}\, A_{km} {\mathrm e}^{{\mathrm i}(\alpha kz + m_0 m\theta)},$

Coefficients $A_{km}$ are stored in an order according to the following loop, where
rather than two indices `k` and `m`, the single index `nh` labels the modes:

The index nh is as in the following table for the case K=6, M=4:

NOTES:

`nh = (2*K-1)*m + k`.- Loops for
`m<0`are skipped. Values of coefficients for`m<0`are inferred from the conjugate symmetric property, $A_{km}=A_{-k,-m}^*$. Similarly for`k<0`when`m==0`. - This loop is a predefined macro in
`parallel.h`. The macro`_loop_km_begin`replaces the double-loop above (or its equivalent for parallel computation).

## Data types

The data types below consist of logical data groups to facilitate the passing of data between functions. For clarity, in the following text they are defined as though we were working on a single processor. See Parallel regarding subtle differences in the definitions due to parallelisation.

The most important data types are `type (coll) ` and ` type (phys)`. The transform between the two types is achieved with calls to `tra_coll2phys(...)`, `tra_phys2coll(...)`.

### coll, spec — storage of coefficients

The collocated data `type (coll)` is the principle type used for mode-independent operations.

type coll double precision :: Re(i_N, 0:i_H1) double precision :: Im(i_N, 0:i_H1) end type coll

The element `Re(n,nh)` is the real part of the `nh`$^{\mathrm{th}}$ harmonic evaluated at $r_n$.

The spectral `type (spec)` is a data type for operations independent at each radial point. It is rarely used other than as a transitory data format between the `coll` and `phys` types.

type spec double precision :: Re(0:i_H1, i_N) double precision :: Im(0:i_H1, i_N) end type spec

### phys — real space data

The `type (phys)` is used for data in real space.

type phys double precision :: Re(0:i_Z-1, 0:i_Th-1, i_N) end type phys

The element `Re(k,m,n)` refers to the value at $z_k$, $\theta_m$ and $r_n$.

### rdom — the radial domain

The `type (rdom)` contains information about the radial domain. The public variable `mes_D` is declared in `meshs.f90`. Data includes the number of radial points `mes_D%N`, powers of $r$, $r_n^p$=`mes_D%r(n,p)`, weights for integration
$\int_0^1 f(r)\,r\,{\mathrm d}r$=`dot_product(f(1:i_N),mes_D%intrdr)`, matrices for taking the $p^{\mathrm{th}}$ derivative and more:

type rdom integer :: N double precision :: r(i_N,i_rpowmin:i_rpowmax) double precision :: intrdr(i_N) type (mesh) :: dr(i_KL) ... end type rdom

### mesh, lumesh — storage of banded matrices

The finite-difference stencil has finite width (Finite_Differences) and so matrix operations involve matrices that are banded. The `type (mesh)`, defined in `meshs.f90`, is used for matrices on the radial mesh involving $kl$ points to either side of each node.

type mesh double precision :: M(2*i_KL+1, i_N) end type mesh

For a matrix `A`, the element `M(KL+1+n-j, j) = A(n,j)`. See the man page for the lapack routine `dgbtrf`. The LU-factorisation of a banded matrix is also banded:

type lumesh integer :: ipiv(i_N) double precision :: M(3*i_KL+1, i_N) end type lumesh

The element `M(2*KL+1+n-j, j) = A(n,j)`. Pivoting information is stored along with the matrix.

## Modifications for the parallel implementation

### Practical things to remember

The code has been parallelised using the Message Passing Interface (MPI) in a way designed to be as unintrusive as possible. Sections of the code with special MPI function calls are separated from the serial sections by preprocessor directives. The number of processors is given by the parameter `_Np=_Nr*_Ns` set in `parallel.h`. The rank of a process is given by `mpi_rnk` declared in `mpi.f90`, and `mpi_sze`$=$`_Np`.

For modest parallelisations, it is recommended to vary `_Nr` and to keep `_Ns=1` (radial split only). This ensures a couple of conditions are met that could easily be forgotten when setting up a run, namely that `_Ns` must divide both `i_Z` and `i_M`.

### How the data is distributed

The linear parts of the code (evaluating curls, gradients and matrix inversions for the timestepping) involve operations that do not couple Fourier modes. The `type (coll)` holds all radial points for a subset of Fourier modes - the Fourier modes are distributed over the cores.

Products are evaluated in physical space. Here the data is split over `_Nr` sections in $r$ and over `_Ns` sections in $z$. The `type (phys)` holds all azimuthal points for a
given subsection in $r$ and $z$. The following is an example of rank id numbers (`mpi_rnk`) for the case `_Nr=3`, `_Ns=4` ($r$ downwards, $z$ left to right):

0 3 6 9 1 4 7 10 2 5 8 11

In physical space,

- Two cores have data for the same $z$-section if
`rank1/_Nr==rank2/_Nr`(integer arithmetic). - Two cores have data for the same $r$-section if
`modulo(rank1,_Nr)==modulo(rank2,_Nr)`.

### How the code is adapted

In the definition of `type (coll)` the parameter `i_H1` is replaced by `i_pH1`.

In the definition of `type (phys)` the parameters `i_N,i_Z` are replaced by `i_pN,i_pZ`.

`i_pH1` is the maximum number of modes on an individual core. The variable `var_H` has elements to determine which harmonics are on which processor. The elements `pH0` and `pH1` are the zeroth index and the number of harmonics for the local process, minus one. The corresponding values for all processors are stored in the arrays `pH0_(rank)` and `pH1_(rank)`. For a variable of `type (coll)`, valid values for the harmonic index in `Re(n,nh)` are `nh=0..pH1` but refer to the indices `nh=pH0..pH0+pH1` of the serial case, see #Ordering_the_Fourier_modes.

In physical space `i_pN` is the maximum number of radial points for each process. The location of the radial subdomain on the local processor is given by additional elements of the variable `mes_D`. The elements `pNi` and `pN` are the location of the first (inner) radial point and the number of points. The values for all processors are stored in the arrays `pNi_(rank)` and `pN_(rank)`. For a variable of `type (spec)`, valid values for the radial index in `Re(nh,n)` are `n`=`1..pN` which refer to the indices of $r_n$, where $n$=`pNi..pNi+pN-1`.

Also in physical space `i_pZ` is exactly the number axial points held by the core. Logically, the local index `k` in `[0,i_pZ-1]`corresponds to index `(mpi_rnk/_Ns)*i_pZ + k` (integer arithmetic) in `[0,i_Z-1]`.

The bulk of communication occurs during the data transposes in the functions `var_coll2spec()` and `var_spec2coll()`. For the case `_Np=_Nr, _Ns=1` the data is transferred in `_Np-1` steps. At each step each processor sends and receives a block of data, the source and destination are selected systematically as set in the following loop:

do step = 1, mpi_sze-1 dest = modulo(mpi_rnk+step, mpi_sze) src = modulo(mpi_rnk-step+mpi_sze, mpi_sze) ...

### How the parallelisation works

In the Fourier space, all radial points for a particular Fourier mode are located on the same processor, separate modes may be located on separate processes. For the case `K=6, M=4, _Nr=3, _Ns=2,` the modes are split over `_Np=_Nr*_Ns=6` cores (`mpi_rnk=0..5`) indicated by the 6 coloured sections (each contiguous in `nh`) in the following table:

Here, the m=0,1,2,3 are split into for `_Ns=2` groups, m=0,1 and m=2,3, then each group is split into `_Nr=3` sections indicated by the colours.

For `_Ns=2`, split the 6 cores into two sets of three, denoted
$\{c_0,c_1,c_2\}$ and $\{c_3,c_4,c_5\}$.
Prior to an FFT, the core $c_0$ gathers all Fourier coefficients from within the first set for a given
subset of radial points -- doing this for each core constitutes a transpose with in the set.
Having gathered these coefficients, sufficient data is present to perform the axial FFTs
(summing over the k for a given m).
Now three sets of cores may be identified, arranged to have data for common
radial points in each set, $\{c_0,c_3\}$, $\{c_1,c_4\}$, $\{c_2,c_5\}$.
Within each set m is distributed, with
m=0,1 is on one core m=2,3 on the other. Gathering the m for a
given subsection of axial points, constitutes a second transpose within each of these
sets. Finally the azimuthal FFTs may be performed (sum over m).
The resulting data is split by both radial and axial position, whilst data for all azimuthal points are present on a given core.

In the double parallelisation two transposes are required, and the total data sent is
doubled. However, the number of messages required is
`(_Ns-1)+(_Nr-1)` where `_Np=_Nr*_Ns`.
Then, for a given $N_p=$`_Np`, there are only approximately $\sqrt{N_p}$ messages when `_Nr` and `_Ns` are approximately equal, versus $N_p$ messages when `_Np=_Nr` (and `_Ns=1`).
This can substantially reduce time lost in latency, the time setting up communications.
There are fewer large messages, rather than many small ones.