\journalcode

J

\cauthor

[a]Jérô[email protected] Orlans Coquelle Debionne Basu Homs Santoni De Sanctis

\aff

[a]European Synchrotron Radiation Facility;71, avenue des Martyrs;CS 40220;38043 Grenoble Cedex 9 \countryFrance \aff[b]EMBL Grenoble; 71 avenue des Martyrs; CS 90181; 38042 Grenoble Cedex 9; \countryFrance

Application of signal separation to diffraction image compression and serial crystallography

Julien    Nicolas    Samuel    Shibom    Alejandro    Gianluca    Daniele
Abstract

We present here a real-time analysis of diffraction images acquired at high frame-rate (925 Hz) and its application to macromolecular serial crystallography. The software uses a new signal separation algorithm, able to distinguish the amorphous (or powder diffraction) component from the diffraction signal originating from single crystals. It relies on the ability to work efficiently in azimuthal space and derives from the work performed on pyFAI, the fast azimuthal integration library. Two applications are built upon this separation algorithm: a lossy compression algorithm and a peak-picking algorithm; the performances of both is assessed by comparing data quality after reduction with XDS and CrystFEL.

keywords:
background extraction
keywords:
serial crystallography
keywords:
peak-picking
keywords:
lossy compression
{synopsis}

Precise background assessment and application to single crystal image compression and serial crystallography data pre-processing.

1 Introduction

X-ray macromolecular crystallography is one of the most successful methods to determine the atomic structure of biological molecules. The achievable diffraction quality may often be limited by radiation damage. Although cryogenic conditions permit to extend the lifetime of crystals in the X-ray beam and to increase the maximum absorbed dose before inducing damage, they may hinder physiologically relevant conformations. This limitation has renovated the interest for room temperature macromolecular crystallography by applying a more drastic approach to overcome the radiation damage problem by collecting data from thousands of small crystals, in what became known as serial crystallography. First developed at X-ray free electron laser sources (xfel) [Chapman2011, structure_sx] the method is currently applied also at synchrotron sources [ssx, ssx_desy, ssx_id13, ssx_diamond].

1.1 Serial crystallography using synchrotron sources (SSX)

Serial crystallography consists in exposing thousands of small crystals to the X-ray beam only once in a serial way. Diffraction is collected with a very high flux density, in order to extract the most information from a single shot. This is in contrast with traditional rotational crystallography, where a complete dataset is collected from a single crystal rotated around one (or several) axis. Those SSX images represent a slice through the reciprocal space, thus intersect a lower fraction of the reciprocal space as the crystal is still, in comparison with rotational crystallography. To achieve a complete dataset, thousands of frames have to be collected, individually indexed and then merged. The high flux needed to collect all diffraction signal from a single crystal within a single exposure makes the SSX technique a good candidate to benefit from the 4th generation synchrotron sources, such as the new ESRF-EBS update [EBS]. However, macromolecular crystallography beam-lines (MX) have been extremely specialized towards rotational data-collection and thus require modifications to the experimental setup to perfom SSX experiments. The synchrotron serial crystallography beamline (ID29 at the ESRF) has been built to have a dedicated environment to perform SSX experiments with a high flux (using a larger energy bandwidth with a multilayer monochromator), a high-speed chopper (to separate X-ray pulses), several sample delivery methods and a fast detector.

1.2 Jungfrau 4M detector

Macro-molecular crystallography has enormously progressed in the last decades with to the introduction of photon-counting detectors [pilatus]. With their absence of read-out noise and their fast speed, the main limitation of photon-counting detectors is the achievable count-rate, i.e. how fast the electronic of a pixel is able to count arriving photons. Unlike photon-counting detectors such as the Eiger detector [Eiger], the Jungfrau detector [jungfrau2016] is an integrating detector, thus, it is not limited by the count rate, even under the very intense flux expected when recording Bragg-peaks. To cope with this photons density, every single pixel implements an automatic gain switching mechanism (3 levels) which offers a precision of the order of one third of a keV on the higher gain mode, a precision of the order of one photon in the intermediate level and the ability to cope with thousands of photons in the lower gain mode. Moreover, the Jungfrau is able to operate at two kilohertz, which means the image must be read every half millisecond. Since the Jungfrau detector is an integrating detector, dark current and flat-field correction have to be applied: every pixel has three, so called, pedestals and three gain values (one for each gain level). This large number of parameters per pixel makes the pre-processing of raw signal challenging at full speed: the signal from a single pixel, initially packed with 16 bits per pixel, gets expanded to 32 bits (floating point or integer value), doubling the required bandwidth and the storage size [jungfrau_PSI]. The ID29 beamline features a Jungfrau 4M detector, operating at 1kHz, pace imposed by the chopper and synchronized with the photon bunches from the ESRF. At nominal speed, the detector will produce 16 GBytes of pre-processed data per second, making the data analysis and storage extremely challenging.

1.3 Requirements for online data processing

Serial crystallography is one of the techniques where online data processing is likely to have most impact: millions of images are to be collected, and existing detector already saturate the fastest storage systems, not even considering the cost of this storage. Beside this, only a few percentage of the frames are expected to contain diffraction signal and out of them, a fraction will be indexed, integrated and thus useful to solve the protein structure. Efficient processing of raw images is thus essential for SSX, with 4 levels of increasing complexity:

  • image reconstruction with pedestal correction

  • veto-algorithm: be able to sieve-out images with poor signal

  • save only pixels with diffraction signal

  • precise location of peak position with indexation [toro]

  • real-time integration of diffraction peaks

.

Reconstruction and pedestal correction have been already described in \citeasnounlima2. This document will focus on the subsequent steps: the detection of signal is addressed in section 2, sparse data compression in section 3, and peak-finding in section 4 before drawing some conclusions in section 5.

2 Algorithm for the separation of the amorphous background from the Bragg peaks

2.1 Background scattering

The simplest implementation of Bragg peaks separation is to assume that the background signal originates from scattering of amorphous material giving an isotropic signal that ideally presents only smooth variations. Before background subtraction, the raw signal has to be corrected for dark noise and for any systematic anisotropic effects such as polarization corrections. Unlike X-FEL, synchrotron X-ray beam is better characterized in energy and shows little to none pulse to pulse variability. All anisotropic correction can be easily modelled and taken into account. The same method can be extended to separate Bragg peaks from powder diffraction if the powder signal is isotopic, i.e. without preferred orientation.

The initial implementation of signal separation in pyFAI [pdj2013] was relying on a 2D polar transform followed by a median filter in the azimuthal dimension to calculate the amorphous scattering curve. Although this method has been successfully used for large dataset analysis [brocades], it presents four major drawbacks:

  • The 2D averaging mixes the signal originating from several pixels and blurs the signal.

  • Pixel-splitting is needed to leverage the Moiré effect in the 2D averaging, but this further increases the blurring [moire].

  • The 1D curve obtained after the application of the median filter shows sharp jumps from one azimuthal bin to its neighbour.

  • The median filter is computationally heavy since it is required to sort out every azimuthal bin.

We improved on this by developing a new, efficient way of performing the azimuthal averaging (including the associated uncertainty propagation).

2.2 Efficient azimuthal averaging and uncertainties evaluation

2.2.1 Pre-processing:

The first step of the analysis consists in applying a pixel-wise correction for dark current and several normalization corrections [pyfai_2020]:

Icor=signalnorm=IrawIdarkFΩPAI0subscript𝐼𝑐𝑜𝑟𝑠𝑖𝑔𝑛𝑎𝑙𝑛𝑜𝑟𝑚subscript𝐼𝑟𝑎𝑤subscript𝐼𝑑𝑎𝑟𝑘𝐹Ω𝑃𝐴subscript𝐼0I_{cor}=\frac{signal}{norm}=\frac{I_{raw}-I_{dark}}{F\cdot\Omega\cdot P\cdot A% \cdot I_{0}}italic_I start_POSTSUBSCRIPT italic_c italic_o italic_r end_POSTSUBSCRIPT = divide start_ARG italic_s italic_i italic_g italic_n italic_a italic_l end_ARG start_ARG italic_n italic_o italic_r italic_m end_ARG = divide start_ARG italic_I start_POSTSUBSCRIPT italic_r italic_a italic_w end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_d italic_a italic_r italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_F ⋅ roman_Ω ⋅ italic_P ⋅ italic_A ⋅ italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG (1)

In equation 1, the numerator (referred as signal hereafter) is the subtraction of the dark current Idarksubscript𝐼𝑑𝑎𝑟𝑘I_{dark}italic_I start_POSTSUBSCRIPT italic_d italic_a italic_r italic_k end_POSTSUBSCRIPT from the the detector’s raw signal Irawsubscript𝐼𝑟𝑎𝑤I_{raw}italic_I start_POSTSUBSCRIPT italic_r italic_a italic_w end_POSTSUBSCRIPT. The denominator (hereafter norm) is a normalization factor composed of the product of F𝐹Fitalic_F: a factor accounting for the flat-field correction, ΩΩ\Omegaroman_Ω: the solid angle subtended by a given pixel, P𝑃Pitalic_P: the polarisation correction term and A𝐴Aitalic_A: the detector’s apparent efficiency due to the incidence angle of the photon on the detector plane. For integrating detectors, photons with high energy see longer sensor path with larger incidence angles compared to the normal thickness, thus they have a higher detection probability. Intensity is also normalized by the incoming flux I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, but being it independent from the pixel position, this correction can be applied when convenient.

2.2.2 Azimuthal averaging:

Historically, the azimuthal averaging was implemented using histograms [fit2d1996]. Since the geometry of the experimental setup is fixed during the acquisition, a look-up table listing all pixels contributing to each azimuthal-bin can be built and used to speed-up calculations [pyFAI_gpu]. The azimuthal transformation being a linear transformation, it can be implemented as a matrix multiplication, with a sparse-matrix representing the transformation and a dense vector containing the flattened view of the diffraction image. The compressed sparse row (CSR) matrix representation is preferred for its efficiency in performing dot products with dense vectors [SpMV]. The coefficients ci,rsubscript𝑐𝑖𝑟c_{i,r}italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT of the matrix are the fraction of area of a pixel i𝑖iitalic_i falling into the radial bin r𝑟ritalic_r. In the case where pixel splitting is deactivated these coefficients (ci,rsubscript𝑐𝑖𝑟c_{i,r}italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT) are always one (and zero elsewhere) since each pixel contributes to a single bin. The sparse matrix multiplication can be used to sum efficiently values for all pixels belonging to the same bin. The summed signal divided by the summed normalization provides the weight-averaged intensity over all pixels falling in the bin at the distance r𝑟ritalic_r from the center, as formalized in equation 2:

<I>r=ibinrci,rsignaliibinrci,rnormisubscriptexpectation𝐼𝑟subscript𝑖𝑏𝑖subscript𝑛𝑟subscript𝑐𝑖𝑟𝑠𝑖𝑔𝑛𝑎subscript𝑙𝑖subscript𝑖𝑏𝑖subscript𝑛𝑟subscript𝑐𝑖𝑟𝑛𝑜𝑟subscript𝑚𝑖<I>_{r}=\frac{\sum\limits_{i\in bin_{r}}c_{i,r}\cdot signal_{i}}{\sum\limits_{% i\in bin_{r}}c_{i,r}\cdot norm_{i}}< italic_I > start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT ⋅ italic_s italic_i italic_g italic_n italic_a italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (2)

2.2.3 Uncertainty evaluation from Poisson distribution.

Photon counting detectors, such as Eiger detectors, suffer from hardly any error beside the counting uncertainty which is often referred to as Poisson statistics. This statistical law is described by a single parameter λ𝜆\lambdaitalic_λ which is related to the mean μ𝜇\muitalic_μ and standard deviation σ𝜎\sigmaitalic_σ from a normal distribution by: λ=μ=σ2𝜆𝜇superscript𝜎2\lambda=\mu=\sigma^{2}italic_λ = italic_μ = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Other sources of noise, like the dark current noise in the case of an integrating detector, superimpose quadratically on the Poisson noise for integrating detectors, as presented in equation 3:

varI=(σI)2=<Iraw>+(σdark)2𝑣𝑎subscript𝑟𝐼superscriptsubscript𝜎𝐼2expectationsubscript𝐼𝑟𝑎𝑤superscriptsubscript𝜎𝑑𝑎𝑟𝑘2var_{I}=(\sigma_{I})^{2}=<I_{raw}>+(\sigma_{dark})^{2}italic_v italic_a italic_r start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = ( italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = < italic_I start_POSTSUBSCRIPT italic_r italic_a italic_w end_POSTSUBSCRIPT > + ( italic_σ start_POSTSUBSCRIPT italic_d italic_a italic_r italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

During the azimuthal integration, the coefficient of the sparse matrix needs to be squared at the numerator when propagating the variance (equation 4) to have uncertainties σ𝜎\sigmaitalic_σ proportional to the fraction of the pixel considered.

(σr(I))2=ibinrci,r2σi2ibinrci,rnormisuperscriptsubscript𝜎𝑟𝐼2subscript𝑖𝑏𝑖subscript𝑛𝑟superscriptsubscript𝑐𝑖𝑟2superscriptsubscript𝜎𝑖2subscript𝑖𝑏𝑖subscript𝑛𝑟subscript𝑐𝑖𝑟𝑛𝑜𝑟subscript𝑚𝑖(\sigma_{r}(I))^{2}=\frac{\sum\limits_{i\in bin_{r}}c_{i,r}^{2}\cdot\sigma_{i}% ^{2}}{\sum\limits_{i\in bin_{r}}c_{i,r}\cdot norm_{i}}( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_I ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (4)

One should distinguish the uncertainty of the mean (sometimes referred to as the standard error of the mean, sem𝑠𝑒𝑚semitalic_s italic_e italic_m), which describes the precision with which the mean is known (and described in \citeasnounpyfai_2020), from the uncertainty of the pixel value (often referred to as standard deviation, std𝑠𝑡𝑑stditalic_s italic_t italic_d) which describes the uncertainty with which the pixel value is known. Those two value differ only by the square root of the number of measurements in the case of an arithmetic mean: sem=std/N𝑠𝑒𝑚𝑠𝑡𝑑𝑁sem=std/\sqrt{N}italic_s italic_e italic_m = italic_s italic_t italic_d / square-root start_ARG italic_N end_ARG with N𝑁Nitalic_N being the number of pixel contributing to the bin. When considering the weighted average, the previous formula becomes:

semr=stdribinrci,r2normi2ibinrci,rnormi𝑠𝑒subscript𝑚𝑟𝑠𝑡subscript𝑑𝑟subscript𝑖𝑏𝑖subscript𝑛𝑟superscriptsubscript𝑐𝑖𝑟2𝑛𝑜𝑟superscriptsubscript𝑚𝑖2subscript𝑖𝑏𝑖subscript𝑛𝑟subscript𝑐𝑖𝑟𝑛𝑜𝑟subscript𝑚𝑖sem_{r}=std_{r}\frac{\sqrt{\sum\limits_{i\in bin_{r}}c_{i,r}^{2}\cdot norm_{i}% ^{2}}}{\sum\limits_{i\in bin_{r}}c_{i,r}\cdot norm_{i}}italic_s italic_e italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_s italic_t italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (5)

Thus, the more data points are collected, the more precisely the mean value is known but the uncertainty for a given point remains the same. Since this document focuses on the uncertainties of pixel values, the standard deviation will systematically be used from here on.

2.2.4 Uncertainty evaluation from the variance in a bin.

Unlike photon counting detectors, most detectors do not follow the Poisson distribution and therefore the definition of a relation σ2=f(I)superscript𝜎2𝑓𝐼\sigma^{2}=f(I)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_f ( italic_I ) is not simple, if possible at all. The integrating Jungfrau detector has a complex gain switching mechanism [jungfrau_PSI] which makes this equation complicated. A generic approach is proposed to measure the variance in every single azimuthal bin:

When considering the diffraction of an isotropic compound (liquid, amorphous or perfect powder), all pixels contributing to the same radial bin should see the same flux of photons (after correction of anisotropy like polarization) and the deviation to their intensities can be used to estimate the uncertainty. This approach is of course limited when considering the signal coming from few large crystalites (where rings becomes spotty) but it provides an upper bound for the uncertainty. Variances (thus standard deviations) are usually obtained in a two steps procedure: one pass to calculate the average value (equation 2) and a second to calculate the deviation to the average (equation 6). This double pass approach can be implemented using sparse matrix multiplication. It requires twice the access to each pixel value, and extra storage space, but it is numerically robust (i.e. not prone to numerical-error accumulation).

(σr(I))2=ibinrci,r2normi2(signalinormi<I>r)2ibinrci,r2normi2superscriptsubscript𝜎𝑟𝐼2subscript𝑖𝑏𝑖subscript𝑛𝑟superscriptsubscript𝑐𝑖𝑟2𝑛𝑜𝑟superscriptsubscript𝑚𝑖2superscript𝑠𝑖𝑔𝑛𝑎subscript𝑙𝑖𝑛𝑜𝑟subscript𝑚𝑖subscriptexpectation𝐼𝑟2subscript𝑖𝑏𝑖subscript𝑛𝑟superscriptsubscript𝑐𝑖𝑟2𝑛𝑜𝑟superscriptsubscript𝑚𝑖2(\sigma_{r}(I))^{2}=\frac{\sum\limits_{i\in bin_{r}}c_{i,r}^{2}\cdot norm_{i}^% {2}\cdot(\frac{signal_{i}}{norm_{i}}-<I>_{r})^{2}}{\sum\limits_{i\in bin_{r}}c% _{i,r}^{2}\cdot norm_{i}^{2}}( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_I ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( divide start_ARG italic_s italic_i italic_g italic_n italic_a italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - < italic_I > start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (6)

and:

(σr(<I>))2=ibinrci,r2normi2(signalinormi<I>r)2(ibinrci,rnormi)2superscriptsubscript𝜎𝑟expectation𝐼2subscript𝑖𝑏𝑖subscript𝑛𝑟superscriptsubscript𝑐𝑖𝑟2𝑛𝑜𝑟superscriptsubscript𝑚𝑖2superscript𝑠𝑖𝑔𝑛𝑎subscript𝑙𝑖𝑛𝑜𝑟subscript𝑚𝑖subscriptexpectation𝐼𝑟2superscriptsubscript𝑖𝑏𝑖subscript𝑛𝑟subscript𝑐𝑖𝑟𝑛𝑜𝑟subscript𝑚𝑖2(\sigma_{r}(<I>))^{2}=\frac{\sum\limits_{i\in bin_{r}}c_{i,r}^{2}\cdot norm_{i% }^{2}\cdot(\frac{signal_{i}}{norm_{i}}-<I>_{r})^{2}}{(\sum\limits_{i\in bin_{r% }}c_{i,r}\cdot norm_{i})^{2}}( italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( < italic_I > ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( divide start_ARG italic_s italic_i italic_g italic_n italic_a italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - < italic_I > start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_b italic_i italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (7)

Single pass implementations of variance calculation are faster than double pass ones since they access pixels only once and offer, in addition, the ability to perform parallel reductions [Blelloch], i.e. work with blocks of pixels. \citeasnounvariance2018 present a complete review on the topic, which introduces a formalism adapted here for crystallography. Assume the weight for a pixel being ωi=cinormisubscript𝜔𝑖subscript𝑐𝑖𝑛𝑜𝑟subscript𝑚𝑖\omega_{i}=c_{i}\cdot norm_{i}italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If P𝑃Pitalic_P is a partition of the ensemble of pixels falling into a given azimuthal bin, let ΩPsubscriptΩ𝑃\Omega_{P}roman_Ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, VPsubscript𝑉𝑃V_{P}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and VVP𝑉subscript𝑉𝑃VV_{P}italic_V italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT be the sum of weights (Eq 8), the weighted sum of V𝑉Vitalic_V (Eq. 10) and the weighted sum of deviation squared (Eq 11) over the partition P𝑃Pitalic_P:

ΩP=iPωi=iPcinormisubscriptΩ𝑃subscript𝑖𝑃subscript𝜔𝑖subscript𝑖𝑃subscript𝑐𝑖𝑛𝑜𝑟subscript𝑚𝑖\Omega_{P}=\sum\limits_{i\in P}\omega_{i}=\sum\limits_{i\in P}c_{i}\cdot norm_% {i}roman_Ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (8)
ΩΩP=iPωi2=iPci2normi2ΩsubscriptΩ𝑃subscript𝑖𝑃superscriptsubscript𝜔𝑖2subscript𝑖𝑃superscriptsubscript𝑐𝑖2𝑛𝑜𝑟superscriptsubscript𝑚𝑖2\Omega\Omega_{P}=\sum\limits_{i\in P}\omega_{i}^{2}=\sum\limits_{i\in P}c_{i}^% {2}\cdot norm_{i}^{2}roman_Ω roman_Ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (9)
VP=iPωivi=iPcisignalisubscript𝑉𝑃subscript𝑖𝑃subscript𝜔𝑖subscript𝑣𝑖subscript𝑖𝑃subscript𝑐𝑖𝑠𝑖𝑔𝑛𝑎subscript𝑙𝑖V_{P}=\sum\limits_{i\in P}\omega_{i}\cdot v_{i}=\sum\limits_{i\in P}c_{i}\cdot signal% _{i}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_s italic_i italic_g italic_n italic_a italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (10)
VVp=iPωi2(viVP/ΩP)2𝑉subscript𝑉𝑝subscript𝑖𝑃superscriptsubscript𝜔𝑖2superscriptsubscript𝑣𝑖subscript𝑉𝑃subscriptΩ𝑃2VV_{p}=\sum\limits_{i\in P}\omega_{i}^{2}\cdot(v_{i}-V_{P}/\Omega_{P})^{2}italic_V italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT / roman_Ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (11)

The weighted average and associated variances are then expressed as:

<I>P=VPΩP=iPcisignaliiPcinormisubscriptexpectation𝐼𝑃subscript𝑉𝑃subscriptΩ𝑃subscript𝑖𝑃subscript𝑐𝑖𝑠𝑖𝑔𝑛𝑎subscript𝑙𝑖subscript𝑖𝑃subscript𝑐𝑖𝑛𝑜𝑟subscript𝑚𝑖<I>_{P}=\frac{V_{P}}{\Omega_{P}}=\frac{\sum\limits_{i\in P}c_{i}\cdot signal_{% i}}{\sum\limits_{i\in P}c_{i}\cdot norm_{i}}< italic_I > start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = divide start_ARG italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG start_ARG roman_Ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_s italic_i italic_g italic_n italic_a italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_n italic_o italic_r italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (12)
std2=(σP(I))2=VVPΩΩP𝑠𝑡superscript𝑑2superscriptsubscript𝜎𝑃𝐼2𝑉subscript𝑉𝑃ΩsubscriptΩ𝑃std^{2}=(\sigma_{P}(I))^{2}=\frac{VV_{P}}{\Omega\Omega_{P}}italic_s italic_t italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_I ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_V italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG start_ARG roman_Ω roman_Ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG (13)
sem2=(σP(<I>))2=VVPΩP2𝑠𝑒superscript𝑚2superscriptsubscript𝜎𝑃expectation𝐼2𝑉subscript𝑉𝑃superscriptsubscriptΩ𝑃2sem^{2}=(\sigma_{P}(<I>))^{2}=\frac{VV_{P}}{\Omega_{P}^{2}}italic_s italic_e italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( < italic_I > ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_V italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG start_ARG roman_Ω start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (14)

In this formalism, equations 2 and 12 on one side and equations 6 and 13 are actually equivalent. \citeasnounvariance2018 present the way to perform the union of two sub-partitions A𝐴Aitalic_A and B𝐵Bitalic_B of a larger ensemble which opens the doors to parallel reductions, which are especially efficient when implemented on GPU:

ΩAB=ΩA+ΩBsubscriptΩ𝐴𝐵subscriptΩ𝐴subscriptΩ𝐵\Omega_{A\cup B}=\Omega_{A}+\Omega_{B}roman_Ω start_POSTSUBSCRIPT italic_A ∪ italic_B end_POSTSUBSCRIPT = roman_Ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + roman_Ω start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT (15)
VAB=VA+VBsubscript𝑉𝐴𝐵subscript𝑉𝐴subscript𝑉𝐵V_{A\cup B}=V_{A}+V_{B}italic_V start_POSTSUBSCRIPT italic_A ∪ italic_B end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT (16)
VVAb=VVA+ωb2(vbVAΩA)(vbVAbΩAb)𝑉subscript𝑉𝐴𝑏𝑉subscript𝑉𝐴superscriptsubscript𝜔𝑏2subscript𝑣𝑏subscript𝑉𝐴subscriptΩ𝐴subscript𝑣𝑏subscript𝑉𝐴𝑏subscriptΩ𝐴𝑏VV_{A\cup b}=VV_{A}+\omega_{b}^{2}(v_{b}-\frac{V_{A}}{\Omega_{A}})(v_{b}-\frac% {V_{A\cup b}}{\Omega_{A\cup b}})italic_V italic_V start_POSTSUBSCRIPT italic_A ∪ italic_b end_POSTSUBSCRIPT = italic_V italic_V start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - divide start_ARG italic_V start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG start_ARG roman_Ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG ) ( italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - divide start_ARG italic_V start_POSTSUBSCRIPT italic_A ∪ italic_b end_POSTSUBSCRIPT end_ARG start_ARG roman_Ω start_POSTSUBSCRIPT italic_A ∪ italic_b end_POSTSUBSCRIPT end_ARG ) (17)
VVABVVA+VVB+ΩΩB(VAΩBVBΩA)2(ΩAB.ΩA.ΩB2)𝑉subscript𝑉𝐴𝐵𝑉subscript𝑉𝐴𝑉subscript𝑉𝐵ΩsubscriptΩ𝐵superscriptsubscript𝑉𝐴subscriptΩ𝐵subscript𝑉𝐵subscriptΩ𝐴2formulae-sequencesubscriptΩ𝐴𝐵subscriptΩ𝐴superscriptsubscriptΩ𝐵2VV_{A\cup B}\approx VV_{A}+VV_{B}+\frac{\Omega\Omega_{B}(V_{A}\cdot\Omega_{B}-% V_{B}\cdot\Omega_{A})^{2}}{(\Omega_{A\cup B}.\Omega_{A}.\Omega_{B}^{2})}italic_V italic_V start_POSTSUBSCRIPT italic_A ∪ italic_B end_POSTSUBSCRIPT ≈ italic_V italic_V start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_V italic_V start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT + divide start_ARG roman_Ω roman_Ω start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⋅ roman_Ω start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ⋅ roman_Ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( roman_Ω start_POSTSUBSCRIPT italic_A ∪ italic_B end_POSTSUBSCRIPT . roman_Ω start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT . roman_Ω start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG (18)

While equations 15 and 16 are trivial, equation 17 describes the nominator of the variance of an ensemble when adding an extra member b𝑏bitalic_b to the A𝐴Aitalic_A. Unfortunately, the slight difference of formalism between \citeasnounvariance2018 and this work prevent some simplification to occur and leads to the approximate numerator of the variance (VV𝑉𝑉VVitalic_V italic_V) in the case of the union of two ensemble A𝐴Aitalic_A and B𝐵Bitalic_B (eq. 18), used in OpenCL reduction 111see https://github.com/silx-kit/pyFAI/blob/main/doc/source/usage/tutorial/Variance/uncertainties.ipynb. A numerical stability issue can arise from it when VAsubscript𝑉𝐴V_{A}italic_V start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT or VBsubscript𝑉𝐵V_{B}italic_V start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are very small and this issue is addressed by using double-precision arithmetic when implemented on CPU and double-word arithmetic when running on GPU [double_word].

2.2.5 Comparison of uncertainty models:

Figure 1 (right hand side) presents the uncertainties (for the pixel value) as calculated from a background frame with pure Poisson noise (displayed on the left hand-side, synthetic data) using the two algorithms previously described: the Poisson model or calculated from the variance in the azimuthal bin. While the two curves show similar amplitude, except in the corner of the detector where very few pixels contribute to each of the azimuthal bin, the variability of the “azimuthal” model is much more important from one bin to the neighboring one.

Refer to caption
Figure 1: (a) Simulated diffraction frame with pure azimuthal Poisson noise of a Pilatus 6M detector and (b) uncertainties for pixel intensity as measured with the distance to the mean (azimuthal-model, blue) or from the Poisson model (orange).

2.3 Histogram intensity

The figure 2 presents the diffraction from a single crystal of insulin collected with a Pilatus 6M detector (Fig 2a) and several curves obtained from azimuthal integration of those data: figure 2b is the azimuthally integrated signal (blue curve) where Bragg peaks are seen as spikes on top of a smooth background. Figure 2c presents the uncertainties measured according to the Poisson distribution (orange curve) or the deviation in the ring (blue curve). The later presents much larger values since Bragg peaks contribute a lot to the deviation despite they represent few pixels: this highlights the sensitivity of the mean/std to outliers. In the figure 2d are presented histograms of pixel intensity for pixels laying at 87mm and 160mm from the beam center. Each of those histograms is composed of a bell-shaped distribution with few positive outliers which are usually Bragg peaks. Those histograms in figure 2d have been fitted with a Gaussian curve and both the center (μ𝜇\muitalic_μ) and the width (σ𝜎\sigmaitalic_σ) of the curve match roughly with the average (in 2b) and uncertainties (in 2c).

Refer to caption
Figure 2: Single crystal diffraction frame obtained from insulin with a Pilatus 6M (a) with the azimuthally averaged signal (b), before and after clipping data. Uncertainties are presented in (c) when calculated assuming a Poissonian error model (orange, red) or when measuring the deviation within all pixels in a ring (green, blue). (d) Histogram of intensities for two rings at r=87mm and r=160mm from beam center with the distribution fitted as Gaussian curves: g(x)exp((xμ)22σ2)proportional-to𝑔𝑥𝑒𝑥𝑝superscript𝑥𝜇22superscript𝜎2g(x)\propto exp(-\frac{(x-\mu)^{2}}{2\sigma^{2}})italic_g ( italic_x ) ∝ italic_e italic_x italic_p ( - divide start_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ).

The core idea of the algorithm for background extraction is to model the distribution of background pixels. Unlike Bayesian statistics [Sivia2006] where the cost function is usually tuned to weight less outlier, here, those outliers are simply flagged and discarded. Positive outliers can reasonably be assigned to Bragg peaks and negative outliers to shadows or defective pixels. The distribution is recalculated after discarding pixels for which intensity differ from the average value by more than n𝑛nitalic_n times the standard deviation (Eq. 19), where n𝑛nitalic_n is called the Signal to Noise Ratio (SNR). This clipping re-centers the distribution of remaining pixels since the mean is sensitive to outlier values:

|I<I>|>nσ(I)𝐼expectation𝐼𝑛𝜎𝐼|I-<I>|>n\cdot\sigma(I)| italic_I - < italic_I > | > italic_n ⋅ italic_σ ( italic_I ) (19)

The orange plot in figure 2b presents the average after having discarded those outliers, and the orange and green curve of figure 2c are the uncertainties calculated after this clipping. After clipping, the average and uncertainties curves have lost most of their spikes, which means that most Bragg peaks and shadowed pixel have been discarded.

Of course this works only if the background signal is isotropic, ideally smoothly varying, and there are many more background pixels than peaks or shadowed pixels. While shadows can be handled with a user defined mask, anisotropy in the background scattering, as it is sometimes observed with certain stretched plastic films in fixed-target mode, is much more challenging and will not be addressable with this analysis.

2.4 Sigma-clipping

The sigma-clipping algorithm consists in applying this outlier rejection several times. If the initial distribution is mono-modal, this algorithm enforces gradually the data to be sampled symmetrically around the maximum probability, which is likely to look like a normal distribution. If the initial distribution is more complicated (typically multi-modal), the necessary larger standard-deviation will prevent most outlier pixels from being rejected, making it more conservative. The sigma-clipping algorithm takes two parameters: the number of iterations and the rejection cut-off (SNR). Despite the execution time is proportional to the number of iteration of sigma-clipping, iterations should continue until no more outliers are found, so that the background data can be treated assuming a normal distribution. Since the loop exits as soon as no more outliers were discarded at the clipping step, having an arbitrary large number of iteration is not really an issue for the execution time and the number of actual iteration is usually few (3 is commonly observed).

2.4.1 Limits of the Poissonian approach:

The evaluation of uncertainties based on the variance within a radial shell (azimuthal model) has been developed after numerical artifacts were discovered while performing sigma-clipping with a Poissonian approach. Some azimuthal bins were showing no pixel contribution at all and thus appeared without any mean nor uncertainties, jeopardizing the complete background extraction algorithm. This artifact was directly linked to the usage of Poisson statistics and can be demonstrated with a simple distribution of 2 pixels, having values 1 and 199. The mean of this distribution is 100 and the standard deviation is also close to 100 while the uncertainty derived from a Poissonian law would be close to 10 (i.e. 100100\sqrt{100}square-root start_ARG 100 end_ARG). With the azimuthal error model, both pixels are a 1σ1𝜎1\sigma1 italic_σ from the mean while with the Poissonian error-model, pixels are at 10σ10𝜎10\sigma10 italic_σ. This explains why bins featuring strong Bragg-peaks on top low background got completely emptied from any contributing pixels when sigma-clipping was performed assuming Poissonian noise.

Unlike the Poisson error-model, the uncertainties obtained from the azimuthal-model are resilient to diffraction data coming from several types of samples but show much more variability from one bin to its neighbor (Fig. 1b). PyFAI introduces an hybrid error-model which uses the azimuthal error-model for the sigma-clipping stage which trims the ensemble of pixels to become mono-modale. The uncertainties are then calculated using a the Poisson error-model on the trimmed ensemble.

2.4.2 Clipping threshold

can be automatically calculated based on a variation on Chauvenet’s criterion [chauvenet] where one would accept to discard only a single pixel in a ring with a signal already following a normal law. Thus, the threshold value is adapted to the size𝑠𝑖𝑧𝑒sizeitalic_s italic_i italic_z italic_e of the distribution, i.e. the number of pixels in each ring (Eq. 20), which can reach several thousands and shrinks with iterations. Typically the numerical value for this cut-off varies from 2 to 4.

SNRchauv.=2log(size2π)𝑆𝑁subscript𝑅𝑐𝑎𝑢𝑣2𝑙𝑜𝑔𝑠𝑖𝑧𝑒2𝜋SNR_{chauv.}=\sqrt{2log(\frac{size}{\sqrt{2\pi}})}italic_S italic_N italic_R start_POSTSUBSCRIPT italic_c italic_h italic_a italic_u italic_v . end_POSTSUBSCRIPT = square-root start_ARG 2 italic_l italic_o italic_g ( divide start_ARG italic_s italic_i italic_z italic_e end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ) end_ARG (20)

The worse case scenario for sigma-clipping corresponds to an initial distribution very far from a normal distribution, as the bimodal distribution seen in the previous paragraph. Another challenging situation occurs close to the detector corners where the background signal is low and the size of the distribution is decreasing. For example this cut-off parameter increases from 2.7 to 3.5 when the size of the ensemble increases from 100 to 1000 elements. So, for a Poissonian detector and a low count rate of one (λ=μ=σ2=1𝜆𝜇superscript𝜎21\lambda=\mu=\sigma^{2}=1italic_λ = italic_μ = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1), any pixel with intensity greater than 4 is discarded with the ensemble of 100, greater than 5 for the ensemble of 1000 pixels.

3 Application to lossy image compression for X-ray diffraction

Diffraction images from protein crystals usually exhibit an isotropic background on top of which Bragg peaks appear (discarding any diffuse scattering). The sigma-clipping algorithm can be used to select the background level and more importantly the associated uncertainty. This lossy compression algorithm consists in saving only pixels which intensity is above the average background value (μ𝜇\muitalic_μ) plus n𝑛nitalic_n standard deviation (σ𝜎\sigmaitalic_σ).

The decompression simply restores those intense pixels and builds a smooth background for the missing ones (possibly with noise). The cut-off value n𝑛nitalic_n (also called SNRpick𝑆𝑁subscript𝑅𝑝𝑖𝑐𝑘SNR_{pick}italic_S italic_N italic_R start_POSTSUBSCRIPT italic_p italic_i italic_c italic_k end_POSTSUBSCRIPT) controls the amount of data to store. It is linked to the compression ratio: assuming a normal distribution has been enforced at the sigma-clipping stage: 16% of the pixel are to be recorded with n=1𝑛1n=1italic_n = 1; 2.3% for n=2𝑛2n=2italic_n = 2 and only 0.13% for n=3𝑛3n=3italic_n = 3 as depicted in figure 3.

Refer to caption
Figure 3: Normal distribution and probability of having pixels with intensities above a certain threshold. The cut-off parameter governs how much signal is integrally kept, thus the achievable compression rate on the one hand and the limit of quality of data on the other. It is the user’s responsibility to set wisely this threshold.

The compressed data-format consists in:

  • Pixels intensities and positions for pixels worth saving.

  • Average level of the background and associated uncertainties as function of the distance to the beam-center.

  • The distance to the center of every pixel.

Since diffraction analysis software are performing some kind of noise-level analysis, the background signal has to be regenerated with intensity and noise similar to the original data.

3.1 Sparsification / Compression

The sigma-clipping algorithm was originally written for real-time sparsification of single crystal diffraction data and its integration into the LImA2 detector control system [lima] for the Jungfrau 4M detector used at ESRF ID29 is described in [sri2021]. The real-time constrain imposed to develop code running on GPU since those devices are several times faster than equivalently priced processors. All algorithms were developed in OpenCL [opencl_khronos] and implemented in the pyFAI software package (MIT-licence, available on github). A command line tool called ‘sparsify-Bragg’ is available for testing off-line since version 2023.01.

All the pixel coordinates and intensities are stored in a HDF5 container [hdf5] following the NeXus convention [nexus], together with a snippet of Python code explaining how to rebuild the dataset. All sparse datasets (averaged and uncertainties curves, pixel coordinates, etc..) are compressed with the bitshuffle-LZ4 [bitshuffle] lossless compression.

3.2 Densification / Decompression

Since no crystallographic software package can deal with this sparse format (yet), the densification code was developed to regenerate initial frames and the ‘densify-Bragg‘ program was made available as part of the FabIO [fabio] software package (MIT license). The source code is available on github and ‘densify-Bragg‘ is available since version 2022.12, available via usual channels like pip or conda install. The software constrains for this densification code are very different from the one for sparisfication since this code can be used by the final users. For this reason ‘densify-Bragg’ was optimized to run on multi-core CPU. Maybe an important consideration is whether, regardless from the file format, it is necessary to reconstruct the background or not. In fact, some crystallographic reduction program like CrysAlisPro [crysalis] provide a better result with noise-less background while XDS [xds], which performs a deep noise analysis, needs to have the noisy background properly restored. Shaded regions are never reconstructed properly and should be masked adequately in the reduction software.

Future development will focus on a HDF5-plugins able to provide access to densified images from their sparse representation using user-defined function in HDF5 [hdf5-udf]. This will allow any analysis software (already able to read HDF5 files) to treat sparse data as if they were dense, removing the burden of densifing the images from the users.

3.3 Performances on a single crystal protein dataset

The performances for a lossy compression algorithm are to be evaluated along many directions: compression and decompression speeds, compression ratio and degradation of the recorded signal. In the following example we present the sparsification of a Lysozyme (HEWL) dataset obtained using a traditional oscillation data-collection. Data were collected on an Eiger 4M detector [lysozyme], selected for its similarity in size and performances with the Jungfrau detector. Those data are then densified again to regenerate the data and processed in XDS [xds]. Data quality indicators are finally compared between the original dataset and the one which went through the lossy compression presented here.

3.3.1 Compression ratio

After sparsification (picking cut-off: 2σ2𝜎2\sigma2 italic_σ, error-model: Poisonnian), the dataset still weights 103MB which represents a 15x compression ratio compared to the standard procedure. For conformance with the state of the art, the reference dataset was re-compressed using the bitshuffle-LZ4 algorithm [bitshuffle], for which the 1800 frames weight 1500MB (instead of the 5000GB of the original files compressed in LZ4).

The maximum theoretical compression ratio for 2σ2𝜎2\sigma2 italic_σ is 22x (figure 3, neglecting the storage of the background data and effects of the lossless compression). To evaluate the effective maximal compression ratio, the dataset was median-filtered along the image stack to produce an image without peaks. A background dataset of 1800 such images sparsifies into a 11MB HDF5 file which represents a compression ratio of 136x ! Indeed, only 19 pixels were saved per frame and the compressed numerical values are mostly the same, which compresses great with bitshuffle-LZ4.

For ESRF-ID29, where the Jungfrau 4M can operate close to 1 kHz, the pedestal+gain pre-processing convert 16-bits integers into 32-bits floating point values, doubling the bandwidth for data saving. The detector outputs the data via 8×8\times8 × 10Gbit/s network links and the storage is performed via a single 25 Gbit/s link, making a minimum compression ratio of 6.4x.

In production conditions, ID29 users have to trade between:

  • Detector speed: operate at lower speed (231 Hz) to be able to save the pre-processed data.

  • Energy resolution: floating data compress badly, so they are better stored as integer, where the number of ADU per photon can be tuned. With 1 ADU per photon, one gets the best compression rates but one looses the sub-photon energy resolution of the Jungfrau detector.

  • Use the burst mode: acquire only shorter datasets and make pauses to let the data reach the disk in the central storage.

  • Discard pixel of lower intensity: this is a new possibility offered by this algorithm. A cutoff at 1.5σ1.5𝜎1.5\sigma1.5 italic_σ should already providing the 6.4x compression ratio needed (Fig. 3).

3.3.2 Compression speed:

The compression speed has been measured on the computer designed for online data-reduction of the Jungfrau detector [sri2021]: an IBM AC922 using two Power9 processors and two Nvidia Tesla V100 GPU. The sequential execution of the code on the GPU takes about 4 ms to process one image and uses one single CPU-core and a GPU. In production conditions, two such computers are diving each two GPUs, allowing to use the detector at its nominal speed close to 1kHz. The main bottleneck remains the networked saving of the different files: preprocessed files, sparse files, peak positions, accumulated frames, … not all can be saved at all time when operating at full speed.

3.3.3 Decompression speed:

The decompression of those data should typically be performed on a standard workstation (here running two Intel Xeon Gold 6134 CPU @ 3.20GHz): the reconstruction speed is found to take 30 s for the full dataset, while writing of the densified dataset (with bitshuffle-LZ4 compression) takes 45s. Densification is thus faster than writing on disk. The reading time of the input sparse dataset is negligible (<2s).

3.3.4 Quality of the restored dataset:

The densified dataset was processed via XDS and the summary indicator for the quality of the results are compared with the one coming from the reduction of the original dataset. Since those integrator are measured on integral peaks with I/σ>3𝐼𝜎3I/\sigma>3italic_I / italic_σ > 3 and the sparsification was performed with a cut-off of 2, those result should be almost unaffected, which is confirmed in the table 1.

Table 1: Quality indicators after peak integration and averaging using XDS [xds]. Lysozyme (HEWL) dataset provided by Dectris for advertising their Eiger-4M detector [lysozyme].

. Indicator Initial dataset Lossy compressed dataset (2σ2𝜎2\sigma2 italic_σ) 2.91Å 2.06Å all 2.91Å 2.06Å all Completeness 98.8 90.8 93.8% 99.8 90.6 93.5% Robs=|Ih,iIi|Ih,isubscript𝑅𝑜𝑏𝑠subscript𝐼𝑖subscript𝐼𝑖subscript𝐼𝑖R_{obs}=\frac{\sum|I_{h,i}-I_{i}|}{\sum I_{h,i}}italic_R start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT = divide start_ARG ∑ | italic_I start_POSTSUBSCRIPT italic_h , italic_i end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG ∑ italic_I start_POSTSUBSCRIPT italic_h , italic_i end_POSTSUBSCRIPT end_ARG 9.9 57.3 12.5% 9.2 61.2 11.4% Rexpectedsubscript𝑅𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑R_{expected}italic_R start_POSTSUBSCRIPT italic_e italic_x italic_p italic_e italic_c italic_t italic_e italic_d end_POSTSUBSCRIPT 8.8 73.2 15.0% 8.2 68.7 12.1% Rmeassubscript𝑅𝑚𝑒𝑎𝑠R_{meas}italic_R start_POSTSUBSCRIPT italic_m italic_e italic_a italic_s end_POSTSUBSCRIPT [Rmeas] 10.3 61.2 13.2% 9.6 65.5 12.0% CC1/2𝐶subscript𝐶12CC_{1/2}italic_C italic_C start_POSTSUBSCRIPT 1 / 2 end_POSTSUBSCRIPT [cc1/2] 99.7 94.0 99.7 99.6 95.4 99.7 <I/σ>expectation𝐼𝜎<I/\sigma>< italic_I / italic_σ > 25.80 5.39 10.52 26.33 4.09 10.17

Of course those data were collected on a test sample with very intense signal, but it demonstrates the algorithm does not destroy the signal. But with more challenging samples, exhibiting lower I/σ𝐼𝜎I/\sigmaitalic_I / italic_σ, the threshold for picking pixels has to be lower to ensure all pixels relevant for subsequent analysis are actually preserved. Unless, and as described in \citeasnounveto_2024, this sparsification would be detrimental for the quality of the reduced data.

3.4 Influence of the sparsification on the quality of serial-crystallography data acquired with Eiger detector

Tiny crystals of egg white lysozyme (HEWL) + Gadolinium were deposited on a SiN membrane and this membrane was scanned on the ID30A3 beamline at the ESRF (massif3) using a Eiger 4M detector. The dataset consists of 11637 frames of images, out of which 11512 were properly indexed.

3.4.1 Quality indicator degradation:

The Rfree/Rwork quality indicators [Rfree] were calculated for the initial dataset and compared with the same dataset sparsified at 0.8, 1.0, 1.4 and 2.0σ2.0𝜎2.0\sigma2.0 italic_σ and subsequently re-densified (figure 4).

Refer to caption
Figure 4: Degradation of Rwork and Rfree crystallographic quality indicators on the integral dataset of HEWL+Ga (11k frames) and actual compression rates, when increasing the levels of sparsification.

As expected, the crystallographic quality indicator show a gradual degradation with increasing sparsification but Rfree shows little degradation with little sparsification (0.8σ0.8𝜎0.8\sigma0.8 italic_σ, compression ratio of 2×2\times2 ×).

3.4.2 Quality indicators as function of the resolution shell:

The same Rwork and Rfree quality indicators are reported in figure 5 as function of the resolution shell in order to monitor if the degradation is uniform among shells or if it affects mostly the outer shells. Those indicators compare the inital dataset with the sparsified ones at 0.8σ0.8𝜎0.8\sigma0.8 italic_σ and 2.0σ2.0𝜎2.0\sigma2.0 italic_σ which were subsequently re-densified for the analysis.

Refer to caption
Figure 5: Rwork and Rfree crystallographic quality indicators at different resolution shells obtained from the complete dataset (11k frames) of HEWL+Ga. The quality of the sparsified data (at 0.8σ0.8𝜎0.8\sigma0.8 italic_σ and 2.0σ2.0𝜎2.0\sigma2.0 italic_σ) are compared with the initial dataset.

Both Rwork and Rfree exhibit degradation as expected and this degradation is rather uniform over all shells, it does not affect more the outer-shells.The degradation of Rfree at moderate sparsification (0.8σ0.8𝜎0.8\sigma0.8 italic_σ) is once again very limited.

3.4.3 Ability to phase sparse data and quality of the density map:

One of the main concern with sparsification is that it may degrade the weak anomalous signal which is precious for phasing. The HEWL+Ga dataset was truncated at different length (from 3000 to 7000 out of the 11512 frames) in order to "artificially" decrease the anomalous signal strength in the dataset. The figure 6 shows the number of residues which were properly placed in the electron density map with an automatic procedure.

Refer to caption
Figure 6: Influence of the sparsification (at 0.8σ0.8𝜎0.8\sigma0.8 italic_σ and 2.0σ2.0𝜎2.0\sigma2.0 italic_σ vs initial dataset) on the ability to phase a the HEWL+Ga protein with an artificially reduced number of frames, in order to limit the strength of the anomalous signal.

The sparsified dataset does not show less residues properly placed after the procedure, it even looks marginally better than the initial dataset, especially for larger compression factor where the dataset could be phased with as little as 3500 frames. With Eiger detector data, a sparsification at 0.8σ0.8𝜎0.8\sigma0.8 italic_σ offers a nice 2×2\times2 × extra compression without noticeable degradation of the quality of the processed data.

3.5 Influence of the sparsification on the quality of serial-crystallography data acquired with Jungfrau detector

3.5.1 Dataset description:

Small crystals of NQO1 sample [NQO1], complexed with NADH, were collected at ESRF-ID29 using a fixed-target sample environment and the Jungfrau 4M detector (PDB: 8RFM). The complete dataset represents 574k frames (5 TB) out of which 25809 frames were selected with nice peaks (4.5% of the total). The indexation rate for dense data, and two sparse dataset (cut-off at 1.0σ𝜎\sigmaitalic_σ and 1.4σ𝜎\sigmaitalic_σ) are reported in table 2 and the crystallographic quality indicators are reported in the figure 7 in reciprocal space and figure 8 in real space.

Table 2: Indexation statistics of the NQO1 dataset (25809 frames, 4.5% of the complete dataset)

. dense sparse 1.0σ1.0𝜎1.0\sigma1.0 italic_σ sparse 1.4σ1.4𝜎1.4\sigma1.4 italic_σ Disk space 166 MB 52 MB 32 MB Compression 1×\times× 3.2×\times× 5.2 ×\times× Frames indexed 22874 20176 20384 Indexation rate 88.6% 78.2% 79.0%

3.5.2 Data statistics:

The processing was performed with CrystFEL [CrystFEL] (version 0.11) with the geometry of the detector optimized with the millipede procedure [millepede, millepede2]. Indexing was performed with the default parameters of xgandalf [xgandalf] based on peak-position by peakfinder8 [Cheetah2014].

Refer to caption
Figure 7: Comparison of crystallographic quality indicator as function of the resolution shell for sparsified data (1.0σ1.0𝜎1.0\sigma1.0 italic_σ in orange and 1.4σ1.4𝜎1.4\sigma1.4 italic_σ in green) in comparison to the initial dense data (in blue): (a) signal/noise ratio, (b) Rsplit and (c) CC1/2. The two vertical lines indicate the resolution shell at which the signal/noise ratio drops below the sparsification threshold, i.e. the limit at which all signal is expected to be lost.

From the <I>/σexpectation𝐼𝜎<I>/\sigma< italic_I > / italic_σ curve, figure 7a, one can see there is a systematically lower SNR from sparsified data in comparison with the initial dataset, regardless to the resolution shell. This could be due to a too large amount of noise added when densifying the data, especially that this dataset was acquired (and processed) at the full energy resolution of the Jungfrau detector, here 485 ADU/photon (in comparison, an Eiger has 1 ADU/photon). The SNR curve obtained on uncompressed data was used to assess the resolution shell at which there is no more signal expected to be saved, those are the two vertical lines in green (1.4σ1.4𝜎1.4\sigma1.4 italic_σ) and orange (1.0σ1.0𝜎1.0\sigma1.0 italic_σ). On the right of those lines, there is supposed to be no more signal, while on the left, there is supposed to be no degradation of the signal in an ideal case. The evolution of Rsplit and CC1/2, plotted in figure 7 b and c, show a degradation of the quality which is much earlier, by half an Angstrom, in comparison with what would have been expected.

3.5.3 Refinement statistics:

Those data were finally refined in real-space using phenix.refine [phenix] until a resolution of 2.7Å, which is noticeably better than what can be expected from the sparse data (limited to 3.0Å and 3.1Å for 1.0σ1.0𝜎1.0\sigma1.0 italic_σ and 1.4σ1.4𝜎1.4\sigma1.4 italic_σ, respectively). The results are summarized in figure 8.

Refer to caption
Figure 8: Comparison of crystallographic quality indicator in real space after refinement using phenix.refine as function of the resolution shell for sparsified data (1.0σ1.0𝜎1.0\sigma1.0 italic_σ in orange and 1.4σ1.4𝜎1.4\sigma1.4 italic_σ in green) in comparison to the initial dense data (in blue): (a) Completeness of the dataset, (b) Rwork (dotted) and Rfree (dashed) and (c) CCwork1/2superscriptsubscriptabsent12𝑤𝑜𝑟𝑘{}_{1/2}^{work}start_FLOATSUBSCRIPT 1 / 2 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_o italic_r italic_k end_POSTSUPERSCRIPT (dotted) and CCfree1/2superscriptsubscriptabsent12𝑓𝑟𝑒𝑒{}_{1/2}^{free}start_FLOATSUBSCRIPT 1 / 2 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_r italic_e italic_e end_POSTSUPERSCRIPT. The two vertical lines indicate the resolution shell at which the signal/noise ratio drops below the sparsification threshold, i.e. the limit at which all signal is expected to be lost.

The figure 8a contains the completeness of the dataset as function of the resolution shell. It confirms the sparsification algorithm works as expected, and that the signal starts to degrade where the SNR drops bellow the picking threshold. The figure 8b superimposes the Rwork in dotted lines (full dataset) in comparison with the Rfree where the fit is performed on 95% of the dataset and the quality assessment is performed on the remaining 5% (dashed line). Thus Rwork is always expected to be better than Rfree. While the Rwork for the initial dataset looks much better than the sparsified version, the degradation is very limited for the Rfree, especially between the initial and the sparsified dataset at 1σ1𝜎1\sigma1 italic_σ. The same is observed for the CC1/2 in figure 8c: the initial dataset shows a clear degradation between the work and the free version, but this degradation is less important for the sparsified version. The CCfree1/2superscriptsubscriptabsent12𝑓𝑟𝑒𝑒{}_{1/2}^{free}start_FLOATSUBSCRIPT 1 / 2 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_r italic_e italic_e end_POSTSUPERSCRIPT indicators hardly degrades between the initial dataset and the two sparsified ones, confirming the ability to solve the protein structure from sparse data.

3.6 Conclusion on sparsification

The sparsification algorithm presented here is generally applicable to any kind of single crystal diffraction experiment where the background is isotropic. This excludes notably diffuse scattering experiment but it is generally applicable to small molecules crystallography and macro-molecular crystallography and even serial crystallography, when the images are nicely centro-symmetric and shadows are properly masked out. The additional compression offered by sparsification is especially interesting for serial crystallography where datasets of millions of frames are collected per experiment. As any lossy compression, the user has still the responsibility of choosing wisely the threshold level as it will limit later-on the quality of the results one may extract from those data. This is of crucial importance for macromolecular crystallography where valuable information is still present in reflection with SNR less than one [cc1/2]. We have demonstrated that a sparsification at 0.8σ0.8𝜎0.8\sigma0.8 italic_σ, which has an actual compression of 2.0×2.0\times2.0 × on Eiger data (2.6×2.6\times2.6 × for Jungfrau data), preserves nicely the electron density map and is hardly distinguishable from the uncompressed data. The sparsification at 1.0σ1.0𝜎1.0\sigma1.0 italic_σ, offering a 2.6×2.6\times2.6 × compression on Eiger data (3.2×3.2\times3.2 × for Jungfrau), preserves enough signal to refine a protein in with very limited degradation of the Rfree and CCfree1/2superscriptsubscriptabsent12𝑓𝑟𝑒𝑒{}_{1/2}^{free}start_FLOATSUBSCRIPT 1 / 2 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_r italic_e italic_e end_POSTSUPERSCRIPT quality indicators. The Jungfrau data show systematically larger compression rate for sparsification because data was collected with 485 ADU/photon (and is tunable) while Eiger detector always operate at 1 ADU/photon (i.e. it has less noise).

In comparison with ROIBIN-SZ [roibin], which preserves all pixels in the neighborhood of identified peaks and stores the background heavily binned, the sparsification presented here does not require a complete peak-search thus, it is simpler, with fewer parameters to be tuned, and is able to save any pixel which is intense enough. Nevertheless, peak-finding is of crucial important for identifying Bragg-peaks, one of the first stage of any data reduction pipeline.

4 Application to peak-finding for serial crystallography

A classical way of pre-processing serial-crystallography data is to shrink the amount of data by sieving-out empty or bad frames, only keeping the frames which deserve processing. This is the role of the "veto-algorithm".

The sigma clipping algorithm provides us with the background (average and deviation) and is used to pick pixels which are likely to be part of Bragg peaks, like peakfinder8 [Cheetah2014] does. For this, several additional checks are performed on a local neighbourhood which is a small square patch (typically 3x3 or 5x5 pixels, user defined):

  • Is the considered pixel the maximum of the local neighborhood ?

  • Are enough pixels of the local neighborhood satisfying the SNR condition? (user defined parameter)

For each peak, the coordinates of the centroïd, the sum of data and its propagated deviation are recorded and reported. Those peak-position are saved into a HDF5 file (represented figure 9) following the CXI format [cxi] which can be read from CrystFEL [CrystFEL]. CrystFEL allows to swap peak-picking algorithms [zaefferer, Cheetah2014, robustpeakfinder] and indexing tools [xds, mosflm, taketwo, xgandalf, pinkindexer].

Refer to caption
Figure 9: Peak-picking CXI-file produced by pyFAI and visualized with the silx viewer [silx]. The left-hand side contains the HDF5 tree structure while the right-hand side presents the default plot with the number of peaks found per frame.

The serial crystallography beamline at the ESRF (ID29) uses a LImA2 monitor [lima2] as visualization tool. It is inspired from NanoPeakCell [nanopeakcell] for online visualization and feeds back information to check if peaks found correspond actually to the crystal lattice expected for the sample. We will first compare the peak-picking algorithm with some reference implementation on a single frame before evaluating the quality of the picked points on a serial-crystallography dataset.

4.1 Comparison of picked peaks

Figure 10 presents the comparison between the original peakfinder8 described in \citeasnounCheetah2014, interfaced in Python via OnDA [onda] and the version implemented into pyFAI on the same Pilatus 6M image, already used in figure 2.

Refer to caption
Figure 10: Comparison of the reference peakfinder8 interfaced with Onda (in orange, execution time 300ms) and the version from pyFAI (in green, execution time 10ms) on top of a Pilatus 6M diffraction frame of an insulin crystal. The subplot on the right is a close-up to the red rectangle.

In figure 10, most peaks found by both implementation match and correspond to Bragg reflections, the close-up on the right allows to visualize Bragg spots in the image. There are more green peaks (found by pyFAI) closer to the beam center while more orange peaks (found by Onda) are located in the outer-shell. This plot was made with a minimum SNR of 3 and a noise level of 1 since the Pilatus detector is Poissonian. Peaks were registered if four pixels meet the SNR criterion in a 5x5 pixels patch around the peak. Those parameters have been tuned to obtain a comparable number of peaks with both implementations: 290 with pyFAI and 293 with Onda. The similarity of those figures allows a direct comparison of peaks found per resolution shell, histograms which are plotted in figure 11.

Refer to caption
Figure 11: Number of peaks found in the different resolution shells for the peakfinder implemented in pyFAI and in Onda. The width of each radial-shell is two inverse nanometer in qlimit-from𝑞q-italic_q -space. The dashed lines represents the number of those peaks which have successfully been indexed, i.e. located at less than two pixels away from an expected reflection (calculated by xgandalf).

The figure 11 presents the histogram of q𝑞qitalic_q-values (modulus of the scattering vector) of the peaks found with different methods, i.e. the number of peaks having a given q-value. For readability, those histograms (bin width: 2nm12𝑛superscript𝑚12nm^{-1}2 italic_n italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) have been represented as plots with the horizontal axis labeled in d-spacing (q=2π/d𝑞2𝜋𝑑q=2\pi/ditalic_q = 2 italic_π / italic_d). The analysis of those histograms confirms that the implementation in pyFAI is getting more points closer to the beam-center while the reference implementation is picking more point at larger q-values. This is likely due to the curvature of the Debye-Scherrer ring: the original version of peakfinder8 evaluates the variance in a neighborhood defined by some radius around the point of interest and the variance is higher close to the beam-center due to the important curvature of those rings. On the opposite, pyFAI knows about this curvature and measures the variance along the ring. Points picked by Onda at larger q-values do not look like Bragg-peaks but this could be a side-effect of the parameter tuning to get the same number of peaks for both algorithms.

The same figure presents with dashed lines the number of peaks which are coincident (within 2 pixels) with expected reflections (after indexing using xgandalf). It demonstrates that the additional peaks found by pyFAI in the inner-shell (d>3𝑑3d>3italic_d > 3Å) are consistent with Bragg-peaks, thus valuable for indexing, while the additional peaks found by Onda in the outer shell (d<2𝑑2d<2italic_d < 2Å) are not Bragg-peaks, and thus are detrimental.

A word on performances: the Python binding in Onda to the peakfinder8 algorithm from Cheetah runs in 180ms on a high-end server CPU (AMD Epyc 7262) and 300ms on a workstation (Intel Xeon E5-1650 v4). The version available in pyFAI was designed in OpenCL [opencl_khronos, opencl, pyopencl] and runs best on GPUs: 30ms on a AMD Vega 56 and in 10 ms on a Nvidia RTX A5000.

4.2 Quality of the peakfinder on serial crystallography data

The quality of peaks extracted with this algorithm is evaluated on a serial-crystallography dataset. A subset of 1000 frames of the dataset used in section 3.4 was used as a probe and was indexed with the indexamajig𝑖𝑛𝑑𝑒𝑥𝑎𝑚𝑎𝑗𝑖𝑔indexamajigitalic_i italic_n italic_d italic_e italic_x italic_a italic_m italic_a italic_j italic_i italic_g tool from CrystFEL. Since all frames show Bragg-peaks (figure 9) the number of indexed frames can be seen as a quality indicator of the peak-picking algorithm used, when all other parameter remain unchanged. The indexation was performed with the xgandalf algorithm [xgandalf] with default settings from CrystFEL v0.10.1 and was provided with the cell parameter: tetragonal a=b=78.77Å;c=39.04Å;α=β=γ=90oformulae-sequence𝑎𝑏78.77Åformulae-sequence𝑐39.04Å𝛼𝛽𝛾superscript90𝑜a=b=78.77\text{\AA};c=39.04\text{\AA};\alpha=\beta=\gamma=90^{o}italic_a = italic_b = 78.77 Å ; italic_c = 39.04 Å ; italic_α = italic_β = italic_γ = 90 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. The table 3 compares the number of frames properly indexed with the different picking algorithms available in CrystFEL and with the algorithm presented here.

Table 3: Indexation rate obtained with xgandalf [xgandalf] from peak positions extracted with different picking algorithms available from CrystFEL [CrystFEL] on a subset of 1000 frames of microcrystals of Lysozyme (HEWL+Ga) collected with an Eiger 4M at ESRF-ID30A3.
Peak-picking xgandalf (default) xgandalf (fast)
method Index. rate run-time Index. rate run-time
Zaef [zaefferer] 10.0% 2178 s 10.0% 430 s
PeakFinder8 [Cheetah2014] 49.5% 10397 s 48.5% 1757 s
PeakFinder9 [FDIP] 44.2% 8328 s 43.5% 1436 s
RobustPF [robustpeakfinder] 22.4% 6314 s 21.2% 1628 s
pyFAI (this contribution) 49.7% 9325 s 49.2% 1595 s

Since the Eiger detector is a counting detector, the global threshold for algorithms zaef and peakfinder8 had to be lowered (to 50, which is the maximum of the background signal on any frame) and the the default SNR value was used for zaef, peakfinder8. The same SNR value of five was used for pyFAI. Default parameters were used for PeakFinder9 and RobustPeakFinder algorithms. The reported run-time correspond to the execution time on a single core of an Intel Xeon Gold 6134.

The indexing rate obtained with the algorithm from pyFAI is in par with the reference implementations like peakfinder8 or peakfinder9 available from CrystFEL. Since time is mostly spent in indexing, sets of peaks which are simpler to index present lower run-time as fewer retry were needed. Retries can be deactivated but they increase significantly the success of indexing. Table 3 presents also the indexing performances when using the option –xgandalf-fast-execution from indexamajig, which is five to six times faster and exhibits a limited degradation of the indexing rate. The peak extraction in pyFAI is about as fast as the sparsification, so it can be used online to perform the pre-analysis and provide peaks to NanoPeakCell. Nevertheless the executable ‘peakfinder‘ available from pyFAI (offline tool) has a total execution time which is much larger: about 30 s for 1000 frames, most of which is spent in reading and writing the different HDF5 files.

4.3 Peak count as veto𝑣𝑒𝑡𝑜vetoitalic_v italic_e italic_t italic_o algorithm

Since the background extraction and peak finding are performed in real-time on the serial-crystallography beamline ID29 at ESRF, the information about the number of Bragg-spots can be used to assess the quality of each individual image and the acquisition system can decide to discard the frame depending on the number of peaks and a live-adjustable threshold.

Since the beginning of operation of ESRF-ID29, in 2022, the beamline operated with the veto𝑣𝑒𝑡𝑜vetoitalic_v italic_e italic_t italic_o algorithm deactivated, the number of peaks found was just recorded for future exploration. The figure 12 presents the indexation rate of "hit" (in blue) and "non-hit" frames (in orange) when changing the threshold for the minimum peaks-count per frame. The expected compression rate is displayed in green. The dataset consists of 80000 lysozyme micro-crystals between two mylar films, raster scanned with a x-ray beam of 11.56 keV at ESRF-ID29. The offline analysis has been performed with CrystFEL (v0.11.0) using ‘peakfinder8‘ and ‘xgandalf‘ as indexer.

Refer to caption
Figure 12: Indexation rate of frames which would be considered as "hit" or "non-hit" as function of the peak-count threshold. The green curve presents the disk space which could have been saved. The sample dataset consists of 80000 frames of lysozyme micro-crystals indexed off-line with Xgandalf in CrystFEL.

Since the Jungfrau is an integrating detector, the detector has relatively more background noise than a photon counting detector: here, data were saved with 8 ADU/photon, making the compression rate of hit and non-hit frames similar. In this example, discarding frames with less than 20 peaks would have allowed to save one third of the network bandwidth and disk space, loosing only 0.16% of frames index-able. The indexation rate of actually recorded frames also increased to 20%. It is noticeable the hit-rate was especially high in this experiment with 84% of frames showing diffraction peaks. One would not expect a user-experiment to have such high concentration of crystals and in normal conditions, the expected compression ratio should be higher. The veto𝑣𝑒𝑡𝑜vetoitalic_v italic_e italic_t italic_o algorithm having proved its robustness, it is now activated for most experiments, thanks to graphical helper tools to asses the optimal threshold during the experiment and allowing to set those parameters on the fly for real-time processing.

4.4 Limitations

There is a strong sensitivity of the indexation rate with data from the Jungfrau detector, related to the description of the detector in CrystFEL and in pyFAI. The Jungfrau 4M detector is built from 8 modules manually assembled and exhibits some residual misalignment, on the order of a few pixels and miss-behaving pixels. This misalignment is much larger than what is commonly encountered with Eiger detectors from Dectris, where misalignment is usually less than one pixel in size (75 µm). On Eiger detector data, where the detector is defined as a single rigid module, the indexation rate is fairly independent of the peak-picking algorithm as described in 4.2. This means the peak positions provided by the veto-algorithm are suitable for indexing and this saves the reading of the complete frame if it cannot be indexed. With data from the Jungfrau detector, the indexing rate was lower than with certain peak-finders algorithms integrated into CrystFEL. Work is ongoing to convert the geometry description from pyFAI, where every pixel is independent, to the geometry used in CrystFEL, and vice-versa. We found this difference of indexation was related to the description of the mask and of pixel position in different software. Until this issue is addressed, the safest is to re-extract peak positions for indexing using one of CrystFEL’s provided peak-finding algorithms.

5 Conclusion

Background analysis of single crystal diffraction images can be implemented efficiently using iterative azimuthal integration and allows the separation of the signal originating from Bragg-peaks from isotropic background.

A lossy compression algorithm for diffraction frames, called sparsification, has been built on top of this signal separation, with the average background saved on the one side and the position and the intensity of most intense pixels (probably belonging to peaks) on the the other. The quality of the compression has been demonstrated on macro-molecular rotational and serial crystallography data. The degradation of the signal has been monitored after a compression-decompression cycle, both in reciprocal space using crystallographic quality indicators and after Fourier transform in direct space where the number of residues placed automatically was checked. The degradation of the signal was also monitored as function of the threshold and sparsification at 0.8σ1.0σ0.8𝜎1.0𝜎0.8\sigma-1.0\sigma0.8 italic_σ - 1.0 italic_σ still enables to reconstruct the molecular structure of proteins, both with Eiger and Jungfrau detectors.

The second application is a peak-finder for serial crystallography which locates peak positions in real time and can be used as veto algorithm to discard images without (enough) diffraction peaks. The peaks picked have been evaluated against state of the art peak-finder algorithms like peakfiner8 and the results were comparable in quality while much faster thanks to the usage of GPU. This veto-algorithm is now used in production at the ESRF serial crystallography beamline (ID29) to save storage space.

One of the strength of this peak-finer algorithm is that it is optimized to work on GPU and thus ideally suited to be coupled with the next generation of crystal indexer [toro] which is running on the same kind of hardware, allowing to save two memory-transfer per frame and maybe achieve real-time integration of serial crystallography data.

\ack

Acknowledgements: The authors would like to thank Andy Götz for the management of this project and Vincent Favre-Nicolin for his support. Thanks also to Gavin Vaughan, scientist at Materials beamline at the ESRF, for the constructive discussion about sigma-clipping versus median filtering. We would like to thank also Jonathan P. Wright and Carlotta Giacobbe for offering us the ability to test those algorithms on small molecule data and validate the concept of sparsification on the ID11 beamline at the ESRF. Pierre Paleo and Jerome Lesaint are also acknowledged for the fruitful discussions on numerical methods developed in this document.

References

  • [1] \harvarditem[Afonine et al.]Afonine, Grosse-Kunstleve, Echols, Headd, Moriarty, Mustyakimov, Terwilliger, Urzhumtsev, Zwart \harvardand Adams2012phenix Afonine, P. V., Grosse-Kunstleve, R. W., Echols, N., Headd, J. J., Moriarty, N. W., Mustyakimov, M., Terwilliger, T. C., Urzhumtsev, A., Zwart, P. H. \harvardand Adams, P. D. \harvardyearleft2012\harvardyearright. Acta Crystallographica Section D, \volbf68(4), 352–367.
    \harvardurlhttps://doi.org/10.1107/S0907444912001308
  • [2] \harvarditem[Barty et al.]Barty, Kirian, Maia, Hantke, Yoon, White \harvardand Chapman2014Cheetah2014 Barty, A., Kirian, R. A., Maia, F. R. N. C., Hantke, M., Yoon, C. H., White, T. A. \harvardand Chapman, H. \harvardyearleft2014\harvardyearright. Journal of Applied Crystallography, \volbf47(3), 1118–1131.
    \harvardurlhttps://doi.org/10.1107/S1600576714007626
  • [3] \harvarditemBlelloch1996Blelloch Blelloch, G. E. \harvardyearleft1996\harvardyearright. Commun. ACM, \volbf39(3), 85–97.
    \harvardurlhttps://doi.org/10.1145/227234.227246
  • [4] \harvarditemBlobel \harvardand Kleinwort2002millepede Blobel, V. \harvardand Kleinwort, C., \harvardyearleft2002\harvardyearright. A new method for the high-precision alignment of track detectors.
    \harvardurlhttps://arxiv.org/abs/hep-ex/0208021
  • [5] \harvarditem[Bordet et al.]Bordet, Kergourlay, Pinto, Blanc \harvardand Martinetto2021brocades Bordet, P., Kergourlay, F., Pinto, A., Blanc, N. \harvardand Martinetto, P. \harvardyearleft2021\harvardyearright. J. Anal. At. Spectrom. \volbf36, 1724–1734.
    \harvardurlhttp://dx.doi.org/10.1039/D1JA00143D
  • [6] \harvarditem[Boutet et al.]Boutet, Lomb, Williams, Barends, Aquila, Doak, Weierstall, DePonte, Steinbrener, Shoeman, Messerschmidt, Barty, White, Kassemeyer, Kirian, Seibert, Montanez, Kenney, Herbst, Hart, Pines, Haller, Gruner, Philipp, Tate, Hromalik, Koerner, van Bakel, Morse, Ghonsalves, Arnlund, Bogan, Caleman, Fromme, Hampton, Hunter, Johansson, Katona, Kupitz, Liang, Martin, Nass, Redecke, Stellato, Timneanu, Wang, Zatsepin, Schafer, Defever, Neutze, Fromme, Spence, Chapman \harvardand Schlichting2012structure_sx Boutet, S., Lomb, L., Williams, G. J., Barends, T. R. M., Aquila, A., Doak, R. B., Weierstall, U., DePonte, D. P., Steinbrener, J., Shoeman, R. L., Messerschmidt, M., Barty, A., White, T. A., Kassemeyer, S., Kirian, R. A., Seibert, M. M., Montanez, P. A., Kenney, C., Herbst, R., Hart, P., Pines, J., Haller, G., Gruner, S. M., Philipp, H. T., Tate, M. W., Hromalik, M., Koerner, L. J., van Bakel, N., Morse, J., Ghonsalves, W., Arnlund, D., Bogan, M. J., Caleman, C., Fromme, R., Hampton, C. Y., Hunter, M. S., Johansson, L. C., Katona, G., Kupitz, C., Liang, M., Martin, A. V., Nass, K., Redecke, L., Stellato, F., Timneanu, N., Wang, D., Zatsepin, N. A., Schafer, D., Defever, J., Neutze, R., Fromme, P., Spence, J. C. H., Chapman, H. N. \harvardand Schlichting, I. \harvardyearleft2012\harvardyearright. Science, \volbf337(6092), 362–364.
    \harvardurlhttps://www.science.org/doi/abs/10.1126/science.1217737
  • [7] \harvarditem[Broennimann et al.]Broennimann, Eikenberry, Henrich, Horisberger, Huelsen, Pohl, Schmitt, Schulze-Briese, Suzuki, Tomizaki, Toyokawa \harvardand Wagner2006pilatus Broennimann, C., Eikenberry, E. F., Henrich, B., Horisberger, R., Huelsen, G., Pohl, E., Schmitt, B., Schulze-Briese, C., Suzuki, M., Tomizaki, T., Toyokawa, H. \harvardand Wagner, A. \harvardyearleft2006\harvardyearright. Journal of Synchrotron Radiation, \volbf13(2), 120–130.
    \harvardurlhttps://doi.org/10.1107/S0909049505038665
  • [8] \harvarditemBrünger1992Rfree Brünger, A. T. \harvardyearleft1992\harvardyearright. Nature, \volbf355, 472–475.
    \harvardurlhttps://doi.org/10.1038/355472a
  • [9] \harvarditem[Casanas et al.]Casanas, Warshamanage, Finke, Panepucci, Olieric, Nöll, Tampé, Brandstetter, Förster, Mueller, Schulze-Briese, Bunk \harvardand Wang2016Eiger Casanas, A., Warshamanage, R., Finke, A. D., Panepucci, E., Olieric, V., Nöll, A., Tampé, R., Brandstetter, S., Förster, A., Mueller, M., Schulze-Briese, C., Bunk, O. \harvardand Wang, M. \harvardyearleft2016\harvardyearright. Acta Crystallographica Section D, \volbf72(9), 1036–1048.
    \harvardurlhttps://doi.org/10.1107/S2059798316012304
  • [10] \harvarditemChaize et al.2018EBS Chaize, J. et al. \harvardyearleft2018\harvardyearright. In Proc. of International Conference on Accelerator and Large Experimental Control Systems (ICALEPCS’17), Barcelona, Spain, 8-13 October 2017, no. 16 in International Conference on Accelerator and Large Experimental Control Systems, pp. 2010–2015. Geneva, Switzerland: JACoW. Https://doi.org/10.18429/JACoW-ICALEPCS2017-FRAPL07.
    \harvardurlhttp://jacow.org/icalepcs2017/papers/frapl07.pdf
  • [11] \harvarditem[Chapman et al.]Chapman, Fromme, Barty, White, Kirian, Aquila, Hunter, Schulz, DePonte, Weierstall, Doak, Maia, Martin, Schlichting, Lomb, Coppola, Shoeman, Epp, Hartmann, Rolles, Rudenko, Foucar, Kimmel, Weidenspointner, Holl, Liang, Barthelmess, Caleman, Boutet, Bogan, Krzywinski, Bostedt, Bajt, Gumprecht, Rudek, Erk, Schmidt, Homke, Reich, Pietschner, Struder, Hauser, Gorke, Ullrich, Herrmann, Schaller, Schopper, Soltau, Kühnel, Messerschmidt, Bozek, Hau-Riege, Frank, Hampton, Sierra, Starodub, Williams, Hajdu, Timneanu, Seibert, Andreasson, Rocker, Jonsson, Svenda, Stern, Nass, Andritschke, Schröter, Krasniqi, Bott, Schmidt, Wang, Grotjohann, Holton, Barends, Neutze, Marchesini, Fromme, Schorb, Rupp, Adolph, Gorkhover, Andersson, Hirsemann, Potdevin, Graafsma, Nilsson, Spence \harvardand DESY2011Chapman2011 Chapman, H. N., Fromme, P., Barty, A., White, T. A., Kirian, R. A., Aquila, A., Hunter, M. S., Schulz, J., DePonte, D. P., Weierstall, U., Doak, R. B., Maia, F. R. N. C., Martin, A. V., Schlichting, I., Lomb, L., Coppola, N., Shoeman, R. L., Epp, S. W., Hartmann, R., Rolles, D., Rudenko, A., Foucar, L., Kimmel, N., Weidenspointner, G., Holl, P., Liang, M., Barthelmess, M., Caleman, C., Boutet, S., Bogan, M. J., Krzywinski, J., Bostedt, C., Bajt, S., Gumprecht, L., Rudek, B., Erk, B., Schmidt, C., Homke, A., Reich, C., Pietschner, D., Struder, L., Hauser, G., Gorke, H., Ullrich, J., Herrmann, S., Schaller, G., Schopper, F., Soltau, H., Kühnel, K.-U., Messerschmidt, M., Bozek, J. D., Hau-Riege, S. P., Frank, M., Hampton, C. Y., Sierra, R. G., Starodub, D., Williams, G. J., Hajdu, J., Timneanu, N., Seibert, M. M., Andreasson, J., Rocker, A., Jonsson, O., Svenda, M., Stern, S., Nass, K., Andritschke, R., Schröter, C.-D., Krasniqi, F., Bott, M., Schmidt, K. E., Wang, X., Grotjohann, I., Holton, J. M., Barends, T. R. M., Neutze, R., Marchesini, S., Fromme, R., Schorb, S., Rupp, D., Adolph, M., Gorkhover, T., Andersson, I., Hirsemann, H., Potdevin, G., Graafsma, H., Nilsson, B., Spence, J. C. H. \harvardand DESY \harvardyearleft2011\harvardyearright. Nature, \volbf470, 73–77.
    \harvardurlhttps://bib-pubdb1.desy.de/record/87161
  • [12] \harvarditem[Coquelle et al.]Coquelle, Brewster, Kapp, Shilova, Weinhausen, Burghammer \harvardand Colletier2015nanopeakcell Coquelle, N., Brewster, A. S., Kapp, U., Shilova, A., Weinhausen, B., Burghammer, M. \harvardand Colletier, J.-P. \harvardyearleft2015\harvardyearright. Acta Crystallographica Section D, \volbf71(5), 1184–1196.
    \harvardurlhttps://doi.org/10.1107/S1399004715004514
  • [13] \harvarditem[Debionne et al.]Debionne, Homs \harvardand Claustre2022alima2 Debionne, S., Homs, A. \harvardand Claustre, L., \harvardyearleft2022a\harvardyearright. Library for IMage Acquisition version 2.
    \harvardurlhttps://gitlab.esrf.fr/limagroup/lima2
  • [14] \harvarditem[Debionne et al.]Debionne, Homs, Claustre, Kieffer, De Sanctis, Santoni, Goetz \harvardand Meyer2022bsri2021 Debionne, S., Homs, A., Claustre, L., Kieffer, J., De Sanctis, D., Santoni, G., Goetz, A. \harvardand Meyer, J. \harvardyearleft2022b\harvardyearright. In Proceedings of the 14thsuperscript14𝑡14^{th}14 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT international conference on Synchrotron Radiation Instrumentation (SRI2021).
    \harvardurlhttps://indico.desy.de/event/27430/abstracts/
  • [15] \harvarditemDectris2014lysozyme Dectris, \harvardyearleft2014\harvardyearright.
    \harvardurlhttps://www.dectris.com/support/downloads/datasets/
  • [16] \harvarditemDiederichs \harvardand Karplus1997Rmeas Diederichs, K. \harvardand Karplus, P. A. \harvardyearleft1997\harvardyearright. Nature Structural Biology, \volbf4(4), 269–275.
  • [17] \harvarditemDiederichs \harvardand Wang2017ssx Diederichs, K. \harvardand Wang, M. \harvardyearleft2017\harvardyearright. Serial Synchrotron X-Ray Crystallography (SSX), pp. 239–272. New York, NY: Springer New York.
    \harvardurlhttps://doi.org/10.1007/978-1-4939-7000-1_10
  • [18] \harvarditem[Fang et al.]Fang, Wang, Wu \harvardand Zhang2020moire Fang, F., Wang, T., Wu, S. \harvardand Zhang, G. \harvardyearleft2020\harvardyearright. Information Sciences, \volbf514, 56–70.
    \harvardurlhttps://www.sciencedirect.com/science/article/pii/S0020025519311090
  • [19] \harvarditem[Galchenkova et al.]Galchenkova, Tolstikova, Klopprogge, Sprenger, Oberthuer, Brehm, White, Barty, Chapman \harvardand Yefanov2024veto_2024 Galchenkova, M., Tolstikova, A., Klopprogge, B., Sprenger, J., Oberthuer, D., Brehm, W., White, T. A., Barty, A., Chapman, H. N. \harvardand Yefanov, O. \harvardyearleft2024\harvardyearright. IUCrJ, \volbf11(2), 190–201.
    \harvardurlhttps://doi.org/10.1107/S205225252400054X
  • [20] \harvarditem[Gasparotto et al.]Gasparotto, Barba, Stadler, Assmann, Mendonça, Ashton, Janousch, Leonarski \harvardand Béjar2024toro Gasparotto, P., Barba, L., Stadler, H.-C., Assmann, G., Mendonça, H., Ashton, A. W., Janousch, M., Leonarski, F. \harvardand Béjar, B. \harvardyearleft2024\harvardyearright. Journal of Applied Crystallography, \volbf57(4), 931–944.
    \harvardurlhttps://doi.org/10.1107/S1600576724003182
  • [21] \harvarditem[Gati et al.]Gati, Bourenkov, Klinge, Rehders, Stellato, Oberthür, Yefanov, Sommer, Mogk, Duszenko, Betzel, Schneider, Chapman \harvardand Redecke2014ssx_desy Gati, C., Bourenkov, G., Klinge, M., Rehders, D., Stellato, F., Oberthür, D., Yefanov, O., Sommer, B. P., Mogk, S., Duszenko, M., Betzel, C., Schneider, T. R., Chapman, H. N. \harvardand Redecke, L. \harvardyearleft2014\harvardyearright. IUCrJ, \volbf1(2), 87–94.
    \harvardurlhttps://doi.org/10.1107/S2052252513033939
  • [22] \harvarditem[Gevorkov et al.]Gevorkov, Barty, Brehm, White, Tolstikova, Wiedorn, Meents, Grigat, Chapman \harvardand Yefanov2020pinkindexer Gevorkov, Y., Barty, A., Brehm, W., White, T. A., Tolstikova, A., Wiedorn, M. O., Meents, A., Grigat, R.-R., Chapman, H. N. \harvardand Yefanov, O. \harvardyearleft2020\harvardyearright. Acta Crystallographica Section A, \volbf76(2), 121–131.
    \harvardurlhttps://doi.org/10.1107/S2053273319015559
  • [23] \harvarditem[Gevorkov et al.]Gevorkov, Galchenkova, Mariani, Barty, White, Chapman \harvardand Yefanov2024FDIP Gevorkov, Y., Galchenkova, M., Mariani, V., Barty, A., White, T. A., Chapman, H. N. \harvardand Yefanov, O. \harvardyearleft2024\harvardyearright. Crystals, \volbf14(2).
    \harvardurlhttps://www.mdpi.com/2073-4352/14/2/164
  • [24] \harvarditem[Gevorkov et al.]Gevorkov, Yefanov, Barty, White, Mariani, Brehm, Tolstikova, Grigat \harvardand Chapman2019xgandalf Gevorkov, Y., Yefanov, O., Barty, A., White, T. A., Mariani, V., Brehm, W., Tolstikova, A., Grigat, R.-R. \harvardand Chapman, H. N. \harvardyearleft2019\harvardyearright. Acta Crystallographica Section A, \volbf75(5), 694–704.
    \harvardurlhttps://doi.org/10.1107/S2053273319010593
  • [25] \harvarditem[Ginn et al.]Ginn, Roedig, Kuo, Evans, Sauter, Ernst, Meents, Mueller-Werkmeister, Miller \harvardand Stuart2016taketwo Ginn, H. M., Roedig, P., Kuo, A., Evans, G., Sauter, N. K., Ernst, O. P., Meents, A., Mueller-Werkmeister, H., Miller, R. J. D. \harvardand Stuart, D. I. \harvardyearleft2016\harvardyearright. Acta Crystallographica Section D, \volbf72(8), 956–965.
    \harvardurlhttps://doi.org/10.1107/S2059798316010706
  • [26] \harvarditem[Grieco et al.]Grieco, Boneta, Gavira, Pey, Basu, Orlans, Sanctis, Medina \harvardand Martin-Garcia2024NQO1 Grieco, A., Boneta, S., Gavira, J. A., Pey, A. L., Basu, S., Orlans, J., Sanctis, D. d., Medina, M. \harvardand Martin-Garcia, J. M. \harvardyearleft2024\harvardyearright. Protein Science, \volbf33(4), e4957.
    \harvardurlhttps://onlinelibrary.wiley.com/doi/abs/10.1002/pro.4957
  • [27] \harvarditem[Hadian-Jazi et al.]Hadian-Jazi, Sadri, Barty, Yefanov, Galchenkova, Oberthuer, Komadina, Brehm, Kirkwood, Mills, de Wijn, Letrun, Kloos, Vakili, Gelisio, Darmanin, Mancuso, Chapman \harvardand Abbey2021robustpeakfinder Hadian-Jazi, M., Sadri, A., Barty, A., Yefanov, O., Galchenkova, M., Oberthuer, D., Komadina, D., Brehm, W., Kirkwood, H., Mills, G., de Wijn, R., Letrun, R., Kloos, M., Vakili, M., Gelisio, L., Darmanin, C., Mancuso, A. P., Chapman, H. N. \harvardand Abbey, B. \harvardyearleft2021\harvardyearright. Journal of Applied Crystallography, \volbf54(5), 1360–1378.
    \harvardurlhttps://doi.org/10.1107/S1600576721007317
  • [28] \harvarditem[Hammersley et al.]Hammersley, Svensson, Hanfland, Fitch \harvardand Hausermann1996fit2d1996 Hammersley, A. P., Svensson, S. O., Hanfland, M., Fitch, A. N. \harvardand Hausermann, D. \harvardyearleft1996\harvardyearright. High Press. Res. \volbf14(4-6), 235–248.
  • [29] \harvarditem[Joldes et al.]Joldes, Muller \harvardand Popescu2017double_word Joldes, M., Muller, J.-M. \harvardand Popescu, V. \harvardyearleft2017\harvardyearright. ACM Trans. Math. Softw. \volbf44(2).
    \harvardurlhttps://doi.org/10.1145/3121432
  • [30] \harvarditemKabsch2010xds Kabsch, W. \harvardyearleft2010\harvardyearright. Acta Crystallographica Section D, \volbf66(2), 133–144.
    \harvardurlhttps://doi.org/10.1107/S0907444909047374
  • [31] \harvarditemKarplus \harvardand Diederichs2012cc1/2 Karplus, P. A. \harvardand Diederichs, K. \harvardyearleft2012\harvardyearright. Science, \volbf336(6084), 1030–1033.
    \harvardurlhttps://www.science.org/doi/abs/10.1126/science.1218231
  • [32] \harvarditemKhronos2008opencl_khronos Khronos, \harvardyearleft2008\harvardyearright. The open standard for parallel programming of heterogeneous systems.
    \harvardurlhttp://www.khronos.org/opencl
  • [33] \harvarditemKieffer \harvardand Ashiotis2014pyFAI_gpu Kieffer, J. \harvardand Ashiotis, G. \harvardyearleft2014\harvardyearright. In PROC. OF THE 7th EUR. CONF. ON PYTHON IN SCIENCE (EUROSCIPY 2014).
    \harvardurlhttp://arxiv.org/pdf/1412.6367.pdf
  • [34] \harvarditem[Kieffer et al.]Kieffer, Valls, Blanc \harvardand Hennig2020pyfai_2020 Kieffer, J., Valls, V., Blanc, N. \harvardand Hennig, C. \harvardyearleft2020\harvardyearright. Journal of Synchrotron Radiation, \volbf27(2), 558–566.
    \harvardurlhttps://doi.org/10.1107/S1600577520000776
  • [35] \harvarditemKieffer \harvardand Wright2013pdj2013 Kieffer, J. \harvardand Wright, J. \harvardyearleft2013\harvardyearright. Powder Diffraction, \volbf28(S2), S339–S350.
  • [36] \harvarditemKleinwort2021-2024millepede2 Kleinwort, C., \harvardyearleft2021-2024\harvardyearright. Millepede-ii.
    \harvardurlhttps://gitlab.desy.de/claus.kleinwort/millepede-ii
  • [37] \harvarditem[Klöckner et al.]Klöckner, Pinto, Lee, Catanzaro, Ivanov \harvardand Fasih2012pyopencl Klöckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P. \harvardand Fasih, A. \harvardyearleft2012\harvardyearright. Parallel Computing, \volbf38(3), 157–174.
  • [38] \harvarditem[Knudsen et al.]Knudsen, Sørensen, Wright, Goret \harvardand Kieffer2013fabio Knudsen, E. B., Sørensen, H. O., Wright, J. P., Goret, G. \harvardand Kieffer, J. \harvardyearleft2013\harvardyearright. J. Appl. Cryst. \volbf46, 537–539.
  • [39] \harvarditem[Könnecke et al.]Könnecke, Akeroyd, Bernstein, Brewster, Campbell, Clausen, Cottrell, Hoffmann, Jemian, Männicke, Osborn, Peterson, Richter, Suzuki, Watts, Wintersberger \harvardand Wuttke2015nexus Könnecke, M., Akeroyd, F. A., Bernstein, H. J., Brewster, A. S., Campbell, S. I., Clausen, B., Cottrell, S., Hoffmann, J. U., Jemian, P. R., Männicke, D., Osborn, R., Peterson, P. F., Richter, T., Suzuki, J., Watts, B., Wintersberger, E. \harvardand Wuttke, J. \harvardyearleft2015\harvardyearright. Journal of Applied Crystallography, \volbf48(1), 301–305.
    \harvardurlhttps://doi.org/10.1107/S1600576714027575
  • [40] \harvarditem[Leonarski et al.]Leonarski, Mozzanica, Brückner, Lopez-Cuenca, Redford, Sala, Babic, Billich, Bunk, Schmitt \harvardand Wang2020jungfrau_PSI Leonarski, F., Mozzanica, A., Brückner, M., Lopez-Cuenca, C., Redford, S., Sala, L., Babic, A., Billich, H., Bunk, O., Schmitt, B. \harvardand Wang, M. \harvardyearleft2020\harvardyearright. Structural Dynamics, \volbf7(1), 014305.
    \harvardurlhttps://doi.org/10.1063/1.5143480
  • [41] \harvarditemMaia2012cxi Maia, F. \harvardyearleft2012\harvardyearright. Nature Methods, \volbf9, 854–855.
  • [42] \harvarditem[Maples et al.]Maples, Reichart, Konz, Berger, Trotter, Martin, Dutton, Paggen, Joyner \harvardand Salemi2018chauvenet Maples, M. P., Reichart, D. E., Konz, N. C., Berger, T. A., Trotter, A. S., Martin, J. R., Dutton, D. A., Paggen, M. L., Joyner, R. E. \harvardand Salemi, C. P. \harvardyearleft2018\harvardyearright. The Astrophysical Journal Supplement Series, \volbf238(1), 2.
    \harvardurlhttps://doi.org/10.3847/1538-4365/aad23d
  • [43] \harvarditem[Mariani et al.]Mariani, Morgan, Yoon, Lane, White, O’Grady, Kuhn, Aplin, Koglin, Barty \harvardand Chapman2016onda Mariani, V., Morgan, A., Yoon, C. H., Lane, T. J., White, T. A., O’Grady, C., Kuhn, M., Aplin, S., Koglin, J., Barty, A. \harvardand Chapman, H. N. \harvardyearleft2016\harvardyearright. Journal of Applied Crystallography, \volbf49(3), 1073–1080.
    \harvardurlhttps://doi.org/10.1107/S1600576716007469
  • [44] \harvarditem[Masui et al.]Masui, Amiri, Connor, Deng, Fandino, Höfer, Halpern, Hanna, Hincks, Hinshaw, Parra, Newburgh, Shaw \harvardand Vanderlinde2015bitshuffle Masui, K., Amiri, M., Connor, L., Deng, M., Fandino, M., Höfer, C., Halpern, M., Hanna, D., Hincks, A., Hinshaw, G., Parra, J., Newburgh, L., Shaw, J. \harvardand Vanderlinde, K. \harvardyearleft2015\harvardyearright. Astronomy and Computing, \volbf12, 181–190.
    \harvardurlhttps://www.sciencedirect.com/science/article/pii/S2213133715000694
  • [45] \harvarditem[Mozzanica et al.]Mozzanica, Bergamaschi, Brueckner, Cartier, Dinapoli, Greiffenberg, Jungmann-Smith, Maliakal, Mezza, Ramilli, Ruder, Schaedler, Schmitt, Shi \harvardand Tinti2016jungfrau2016 Mozzanica, A., Bergamaschi, A., Brueckner, M., Cartier, S., Dinapoli, R., Greiffenberg, D., Jungmann-Smith, J., Maliakal, D., Mezza, D., Ramilli, M., Ruder, C., Schaedler, L., Schmitt, B., Shi, X. \harvardand Tinti, G. \harvardyearleft2016\harvardyearright. Journal of Instrumentation, \volbf11(02), C02047–C02047.
    \harvardurlhttps://doi.org/10.1088/1748-0221/11/02/c02047
  • [46] \harvarditem[Nogly et al.]Nogly, James, Wang, White, Zatsepin, Shilova, Nelson, Liu, Johansson, Heymann, Jaeger, Metz, Wickstrand, Wu, Båth, Berntsen, Oberthuer, Panneels, Cherezov, Chapman, Schertler, Neutze, Spence, Moraes, Burghammer, Standfuss \harvardand Weierstall2015ssx_id13 Nogly, P., James, D., Wang, D., White, T. A., Zatsepin, N., Shilova, A., Nelson, G., Liu, H., Johansson, L., Heymann, M., Jaeger, K., Metz, M., Wickstrand, C., Wu, W., Båth, P., Berntsen, P., Oberthuer, D., Panneels, V., Cherezov, V., Chapman, H., Schertler, G., Neutze, R., Spence, J., Moraes, I., Burghammer, M., Standfuss, J. \harvardand Weierstall, U. \harvardyearleft2015\harvardyearright. IUCrJ, \volbf2(2), 168–176.
    \harvardurlhttps://doi.org/10.1107/S2052252514026487
  • [47] \harvarditem[Owen et al.]Owen, Axford, Sherrell, Kuo, Ernst, Schulz, Miller \harvardand Mueller-Werkmeister2017ssx_diamond Owen, R. L., Axford, D., Sherrell, D. A., Kuo, A., Ernst, O. P., Schulz, E. C., Miller, R. J. D. \harvardand Mueller-Werkmeister, H. M. \harvardyearleft2017\harvardyearright. Acta Crystallographica Section D, \volbf73(4), 373–378.
    \harvardurlhttps://doi.org/10.1107/S2059798317002996
  • [48] \harvarditemPetitdemange et al.2018lima Petitdemange, S. et al. \harvardyearleft2018\harvardyearright. In Proc. of International Conference on Accelerator and Large Experimental Control Systems (ICALEPCS’17), Barcelona, Spain, 8-13 October 2017, no. 16 in International Conference on Accelerator and Large Experimental Control Systems, pp. 886–890. Geneva, Switzerland: JACoW. Https://doi.org/10.18429/JACoW-ICALEPCS2017-TUPHA194.
    \harvardurlhttp://jacow.org/icalepcs2017/papers/tupha194.pdf
  • [49] \harvarditem[Powell et al.]Powell, Johnson \harvardand Leslie2013mosflm Powell, H. R., Johnson, O. \harvardand Leslie, A. G. W. \harvardyearleft2013\harvardyearright. Acta Crystallographica Section D, \volbf69(7), 1195–1203.
    \harvardurlhttps://doi.org/10.1107/S0907444912048524
  • [50] \harvarditemReal \harvardand de Bayser2021hdf5-udf Real, L. C. V. \harvardand de Bayser, M., \harvardyearleft2021\harvardyearright. User-defined functions for hdf5.
    \harvardurlhttps://arxiv.org/abs/2109.11709
  • [51] \harvarditemRigaku Oxford Diffraction2015crysalis Rigaku Oxford Diffraction, \harvardyearleft2015\harvardyearright. Crysalispro software system.
    \harvardurlhttps://www.rigaku.com/fr/products/smc/crysalis
  • [52] \harvarditemSchubert \harvardand Gertz2018variance2018 Schubert, E. \harvardand Gertz, M. \harvardyearleft2018\harvardyearright. In Proceedings of the 30th International Conference on Scientific and Statistical Database Management, SSDBM ’18. New York, NY, USA: Association for Computing Machinery.
    \harvardurlhttps://doi.org/10.1145/3221269.3223036
  • [53] \harvarditemSivia \harvardand Skilling2006Sivia2006 Sivia, D. S. \harvardand Skilling, J. \harvardyearleft2006\harvardyearright. Data Analysis - A Bayesian Tutorial. Oxford Science Publications. Oxford University Press, 2nd ed.
  • [54] \harvarditem[Stone et al.]Stone, Gohara \harvardand Shi2010opencl Stone, J. E., Gohara, D. \harvardand Shi, G. \harvardyearleft2010\harvardyearright. Computing in Science and Engineering, \volbf12(3), 66–73.
  • [55] \harvarditemThe HDF Group2000-2021hdf5 The HDF Group, \harvardyearleft2000-2021\harvardyearright. Hierarchical data format version 5.
    \harvardurlhttp://www.hdfgroup.org/HDF5
  • [56] \harvarditemToledo1997SpMV Toledo, S. \harvardyearleft1997\harvardyearright. IBM Journal of Research and Development, \volbf41(6), 711–725.
  • [57] \harvarditem[Underwood et al.]Underwood, Yoon, Gok, Di \harvardand Cappello2023roibin Underwood, R., Yoon, C., Gok, A., Di, S. \harvardand Cappello, F. \harvardyearleft2023\harvardyearright. Synchrotron Radiation News, \volbf36(4), 17–22.
    \harvardurlhttps://doi.org/10.1080/08940886.2023.2245722
  • [58] \harvarditem[Vincent et al.]Vincent, Valls, Payno, Kieffer, Solé, Paleo, Naudet, de Nolf, Knobel, Garriga, Retegan, Rovezzi, Fangohr, Kenter, UUSim, Favre-Nicolin, Nemoz, Picca, Caswell, Bertoldo, Campbell, Wright, Communie, Kotanski, Coutinho, N0B0dY, Kim, schooft, Farago \harvardand linupi2021silx Vincent, T., Valls, V., Payno, H., Kieffer, J., Solé, V. A., Paleo, P., Naudet, D., de Nolf, W., Knobel, P., Garriga, J., Retegan, M., Rovezzi, M., Fangohr, H., Kenter, P., UUSim, Favre-Nicolin, V., Nemoz, C., Picca, F.-E., Caswell, T. A., Bertoldo, J. P. C., Campbell, A., Wright, C. J. C., Communie, G., Kotanski, J., Coutinho, T., N0B0dY, Kim, S.-W., schooft, Farago, T. \harvardand linupi, \harvardyearleft2021\harvardyearright. silx-kit/silx: 1.0.0: 2021/12/06. Zenodo.
    \harvardurlhttps://doi.org/10.5281/zenodo.5761269
  • [59] \harvarditem[White et al.]White, Kirian, Martin, Aquila, Nass, Barty \harvardand Chapman2012CrystFEL White, T. A., Kirian, R. A., Martin, A. V., Aquila, A., Nass, K., Barty, A. \harvardand Chapman, H. N. \harvardyearleft2012\harvardyearright. Journal of Applied Crystallography, \volbf45(2), 335–341.
    \harvardurlhttps://doi.org/10.1107/S0021889812002312
  • [60] \harvarditemZaefferer2000zaefferer Zaefferer, S. \harvardyearleft2000\harvardyearright. Journal of Applied Crystallography, \volbf33(1), 10–25.
    \harvardurlhttps://doi.org/10.1107/S0021889899010894
  • [61]