Notebook exercise - working with spectral data (or patterns)

Notebook exercise - working with spectral data (or patterns)#

Experimental characterisation methods like NMR and IR commonly produce datasets in the form of 1-dimensional (or sometimes 2D) spectra. For simple 1D NMR experiments, for example, the spectrum is usually output in the form of intensity vs. chemical shift $\delta$ in ppm.

Analysing spectra#

Determining information about the structure of molecules in the sample usually starts with extracting information about the peaks present in the NMR spectrum. The peak information can then be interpreted to provide specific information about the composition and connectivity of the molecule.

In general, fitting peaks with a function that models the shape serves several purposes including: to gain some fundamental understanding of the nature of the data (e.g. does the method and/or specific instrument produce a particular peak shape); if the peaks are modelled well, the information on their centres, amplitudes can be more precise.

Aim of this notebook exercise#

To get an overview how 1D spectral data can be analysed to acquire information about the peaks - positions of the centres, widths, intensities, etc. - we will look at a simple NMR spectrum and practise some pandas, matplotlib and scipy skills along the way.

Processing spectra involves modelling data. This exercise also gives us a chance to look more closely at how a model is being applied to the data and see some of the considerations and issues you should be aware of as you work with even simple data problems.

Note

To see the complete notebook, click here.

# import statements
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.signal import find_peaks, peak_widths
from scipy.optimize import curve_fit

The ¹³C NMR spectrum of ethanol is stored as a csv file (13C_EtOH.csv) in the data directory data.

It is a good idea to check the contents of a file before you try to load it using Python.

This is particularly the case with data files to get an idea of the structure of the file, whether it has any header lines that you might want to skip over or if the column names are present, for example.

Tip

You can run shell (terminal) command from a Jupyter notebook. More info here: https://tinyurl.com/2yv37k2x

# Use the `head` shell command to look at the first few lines of the file

# Read the NMR spectrum from the csv into a pandas dataframe called nmr_spec

# nmr_spec = ...

# Uncomment line below and run cell to complete code to load the csv file
# %load ../code_snips/load_nmr_csv.txt

# Plot the spectrum
fig, ax = plt.subplots()
ax.plot(nmr_spec["ppm"], nmr_spec["intensity"])

plt.xlabel("Chemical shift $\\delta$ / ppm")
plt.ylabel("Intensity")

plt.xlim((70, 0))
plt.title(r"$\mathregular{^{13}C}$ NMR spectrum of ethanol")
plt.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 3
      1 # Plot the spectrum
      2 fig, ax = plt.subplots()
----> 3 ax.plot(nmr_spec["ppm"], nmr_spec["intensity"])
      5 plt.xlabel("Chemical shift $\\delta$ / ppm")
      6 plt.ylabel("Intensity")

NameError: name 'nmr_spec' is not defined

../_images/8852035462ddaea257dd94f8f88f277b0993cb5e85d77d5a256d5d65b3ca98dc.png

The code above uses matplotlib’s pyplot.plot to graph the spectrum. This might be how you have mostly seen plotting in matplotlib so far.

Pandas also provides a plot method on its DataFrame and Series objects that offers a convenient way to visualise data using matplotlib (it uses matplotlib by default but can be changed to others, e.g. plotly).

For example, it will pick up axis labels from the DataFrame columns, but you can also modify if preferred.

# Try it yourself:
# Plot an equivalent graph of the NMR spectrum using DataFrame.plot()

# Your code here...

# Uncomment the line below and run the cell to see some code that works
# %load ../code_snips/df_plot.txt

Decomposing the spectrum into a set of peaks can sometimes be incorporated into the processing that is done by the experimental acquisition software. There are also various Python packages dedicated to different types of spectroscopic data which can facilitate integration into automated data processing pipelines.

To get a picture of what is going in this process, we can use some of the general methods available in scipy’s signal processing and optimisation subpackages to analyse the peaks in our simple NMR spectrum.

Peak processing in scipy#

In CHEM501, you used some of the functions available in SciPy’s optimize subpackage that can be used to fit population distributions by modelling the population distribution function (PDF) as a curve that follows a Gaussian (or Lorentzian) function.

We can use the same process to fit peaks in experimentally measured datasets like NMR spectra.

The peaks in NMR spectra are usually described as Lorentzian functions, but sometimes Gaussian or pseudo-Voigt (a mixture of Gaussian and Lorentzian) shapes are used. For our ¹³C NMR spectrum, we can define the Lorentzian and Gaussian functions:

Lorentzian

\[y = \frac{A}{\pi} \frac{W/2}{(x-x_0)^2+(W/2)^2}\]

where A is the amplitude of the peak, W is the full width at half maximum (FWHM) and x₀ is value of x at the peak centre.

Gaussian

\[y = A \cdot \frac{1}{\sigma\sqrt{2\pi}}\;\exp(-\frac{(x-x_0)^2}{2\sigma^2})\]

Here, x₀ rather than $\mu$ is used to represent the centre of the peak (for the normal PDF, $\mu$ was the mean of the distribution) and $\sigma$ and the FWHM, W, of the peak are related by: $$ W = \sigma\sqrt{8\:\ln2} $$

# These functions will calculate a peak using a Gaussian or Lorentzian function as defined above.

def gaussian(x_array, ampl, centre, width):
    """Generate a signal with a Gaussian shape."""
    sigma = width/np.sqrt(8*np.log(2))
    return ampl*(1/(sigma*np.sqrt(2*np.pi)))*np.exp(-(x_array-centre)**2/(2*sigma**2))

def lorentzian(x_array, ampl, centre, width):
    """Generate a signal with a Lorentzian shape."""
    h_width = width/2
    return ampl/np.pi * h_width/((x_array-centre)**2 + h_width**2)

Before using the functions to try fitting the NMR peaks, we can look at an example of what the two peak shapes look like by simulating a peak generated by the two functions and plotting them:

# Write some code here to generate a peak of each type centred at 1, with a width of 0.1 and amplitude of 10.
# Tip: the numpy linspace method makes it straightforward to generate an equally-spaced array of numbers for the
# x-axis data.

# Your code here...


# Once you have the two peaks, plot them on the same set of axes to compare them.

# Your code here...

# Uncomment the line below and run the cell to see some code that works
# %load ../code_snips/gen_peaks.txt

We now know that the functions can model the shape of Lorentzian and Gaussian peaks. So we should be able to use them to try to fit the peaks in the NMR spectrum. The fitting process optimises the model’s parameters to obtain a calculated spectrum that is as close as possible to the experimentally observed data.

Both the Lorentzian and Gaussian models have three parameters that can be varied to modify the peak shape: the amplitude, position of the peak’s centre and the peak’s width. Depending on the type of measurement, all three can provide information about the analyte.

We will use scipy.optimize.curve_fit to fit the peaks in the spectrum. It uses non-linear least squares to fit a function - in this case the Lorentzian we defined - to a set of data.

Peak finding - getting initial model parameters#

You might remember that the optimisation needs a set of initial guesses for the parameters to fit the curve. For one or two peaks, that is easily done by hand, but we can use scipy’s signal processing scipy.signal.find_peaks to do this in Python.

The find_peaks function locates local maxima in the 1D array it is passed by comparing data points with neighbouring values.

Try running the function on the intensity data of the NMR spectrum.

# Pass the NMR intensities to the find_peaks function and check the results (take a look at the `find_peaks` docs to see what it returns). 
# Note the shape of the array storing the indices of the peaks in the intensity array.

peak_idx, peak_info = find_peaks(nmr_spec["intensity"])
peak_idx

array([ 1, 3, 5, ..., 6989, 6994, 6998], shape=(2320,))

Running the peak finding on the intensities without providing any extra constraints results in find_peaks locating over 2000 peaks in the spectrum, rather than the two we see when we plot the spectrum. The additional “peaks” are weak intensity local maxima. The vast majority (all but two) are points that are higher intensity than their surroundings, but not significantly higher than the backgground noise.

find_peaks can also take other arguments that filter the peaks it finds based on the properties of the peaks. The information it returns will also depend on the arguments passed. Check the docs and modify the call to find_peaks to filter the peaks and get the information required as initial guesses for the Lorentzian peaks.

# Modify the call to `find_peaks` below to filter the peaks to just the two real resonances.
# Your call should include parameters so that the function returns enough peak information to get initial values for 
# peak amplitude, centre and width.

peak_idx, peak_info = find_peaks(nmr_spec["intensity"])

# uncomment the line below and run the cell to load the complete code
# %load ../code_snips/find_peaks_filter.txt

# Your code here...


print(peak_idx)
peak_info

find_peaks is not aware of the data along the x-axis, i.e. the chemical shift, there is still some work to pull out the peak centres and the width of the peaks in ppm - at the moment, the widths are given in terms of the number of data points.

# Use the indices of the peaks to work out the peak centres in ppm and add them to the peak_info dictionary with the key "centres".

# Your code here...

# uncomment the line below and run the cell to load some code to do this
# %load ../code_snips/peak_centres_ppm.txt

# The widths are currently expressed in terms of number of data points, i.e. " 'widths': array([2.7734915 , 3.57548198]) " 
# means the FWHM of the first peak is the distance between 2.77 points, which assumes they are evenly spaced.
# Work out the widths in ppm. Update the widths in the peak_info dictionary to these values.
# Hint: You can calculate the distance between adjacent rows in a pandas Series using the diff() method.


# Your code here...


# uncomment the line below and run the cell to load some code to do this
# %load ../code_snips/widths_ppm.txt

# Check the peak information
peak_info

# Tidy up the peak information

peak_info = {key: val for key, val in peak_info.items() if key in ["peak_heights", "widths", "centres"]}

OK! Finally, let’s try fitting the Lorentzian model to the NMR peaks.

Peak fitting - optimising the model to get precise peak information#

Now you are ready to run the least squares optimisation to fit the curves to the peaks in the spectrum.

For this simple spectrum, the peaks are well separated so can be fitted separately and the baseline is very low and flat. Check the curve_fit docs if you need a reminder of how to call the function, passing in the information from the peak_info dictionary as the initial values for the parameters.

One thing to note: The amplitude of the fitted curve will be the integrated intensity - the area under the peak. You can estimate the initial area as that of a rectangle of height = peak height and width = peak width.

# Run `curve_fit` for each of the peaks in the NMR spectrum and store the output in the lists popt_list and pcov_list

popt_list = []
pcov_list = []

# Your code here...

# uncomment the line below and run the cell to load some code to do this
# %load ../code_snips/run_curve_fits.txt

display(popt_list, pcov_list)

Assessing the fit#

Check the optimised parameters of the fitted Lorentzian peaks stored in popt_list. These will be [amplitude, centre, width] for each peak.

Take another look at the spectrum. Do these values look reasonable? It might be difficult to tell with the areas, but remember the rectangular approximation.

The covariance matrix for the fitted functions are in pcov_list. The diagonal elements are the variances of the parameters and these can be used to estimate the errors (uncertainties) on the parameters (see docs for info).

# Calculate the 1-standard deviation errors for each of the parameters of the peaks. 
# Report the parameter values and the associated errors.


def report_peak_fit(popt, perr):
    """Print peak function parameters and 1-sigma errors"""
    report = (f"Amplitude: {popt[0]:.3f} +/- {perr[0]:.3f}\n" 
              f"Centre: {popt[1]:.4f} +/- {perr[1]:.4f}\n"
              f"Width: {popt[2]:.4f} +/- {perr[2]:.4f}\n")

    print(report)

# Write a comment to explain what this code is doing.
errors = [np.sqrt(np.diag(pcov)) for pcov in pcov_list]

for i, pk in enumerate(popt_list):
    print(f"Peak {i+1}")
    report_peak_fit(pk, errors[i])

We can also overlay a spectrum calculated from the optimised peak parameters to see if the fitted peaks look reasonable by eye.

# This function simulates a spectrum using the specified peak shape function and a set of parameters passed as a dictionary.
from simulate import simulate_spectrum

def collate_fitted_peak_parameters(popt):
    """ Make a dictionary of peak parameters """

    parameters = ["ampl", "centre", "width"]
    return dict(zip(parameters, popt))

fitted_peaks = [collate_fitted_peak_parameters(popt) for popt in popt_list] # assemble a list of dictionaries for peak parameters

lor_spec_x, lor_spec_y = simulate_spectrum(lorentzian, nmr_spec["ppm"], fitted_peaks)

# y_sigma = nmr_spec["intensity"].std()
# y_errors = [y_sigma for y in nmr_spec["intensity"]]

nmr_spec.plot(x="ppm", 
              y="intensity",
              xlabel="Chemical shift $\\delta$ / ppm",
              ylabel="Intensity",
              kind="scatter",
              marker=".",
              label="Measured")

plt.plot(lor_spec_x, lor_spec_y, color="red", label="Calculated")

plt.legend()
plt.xlim((70, 0))
plt.title(r"$\mathregular{^{13}C}$ NMR spectrum of ethanol")
plt.show()

Does this look like a good fit? Try changing the x-limits to check the fit of the peaks more closely.

Performance metrics#

We can also use a variety of metrics to assess how well the modelled spectrum fits the experimental data.

$R$² (coefficient of determination) is a statistical measure of how well the model explains the variability of the dependent variable (here, the intensity). Its form is quite intuitive and it is defined as follows:

\[R^2 = 1 - \frac{SSR}{SST}\]

where:

SSR (Sum of Squared Residuals): also called the residual sum of squares (RSS), is the quantity minimised by least square. It measures the total squared differences between the observed values and the predicted values from the model.

\[SSR = \sum (y_i - \hat{y}_i)^2\]
SST (Total Sum of Squares): measures the total variability in the observed data (without any model), based on the mean $\bar{y}$:

\[SST = \sum (y_i - \bar{y})^2\]
If the model fits perfectly, $ SSR = 0 $ and $ R^2 = 1 $, meaning all variability in the data is explained by the model.
If the model is no better than simply using the mean $ \bar{y} $, then $ SSR = SST $ and $ R^2 = 0 $.
If the model is worse than using the mean (e.g., a bad fit), $ R^2 $ can be negative.

In least squares fitting, reducing SSR improves the model fit and increases $ R^2 $, so a good fit has a high $ R^2 $ close to 1.

We could calculate $R$² manually (code in cell below), but scikit-learn’s metrics package makes it straightforward to calculate many measures of model performance.

# To calculate r^2:

# residuals = lor_spec_y - nmr_spec["intensity"]
# squared_residuals = residuals ** 2
# SSR = squared_residuals.sum()
# SST = ((nmr_spec["intensity"] - nmr_spec["intensity"].mean())**2).sum()
# r2 = 1-(SSR/SST)
# print(r2)

# Import the relevant function from sklearn.metrics to calculate r^2, the coefficient of determination


# uncomment the line below and run the cell to load some code to do this
# %load ../code_snips/r2_sklearn.txt

Key Points Summary:#

Real-World Data Modeling: In many areas of chemistry (e.g., NMR, IR spectroscopy), fitting models to experimental data is crucial for extracting meaningful information, such as peak positions and intensities.
Peak Fitting: Fitting peaks with a model function helps extract more precise information about the data (e.g., determining peak centers, amplitudes, and widths) compared to raw data points.
Use of Least Squares: Least squares optimisation plays a central role in fitting models to experimental data. It helps minimise the difference between the model predictions and the observed data, providing the best-fit parameters for the model.
Practical Considerations: When fitting models, especially in noisy or complex data, it is important to consider the quality of the fit and how well the model reflects the underlying data. This exercise highlights common issues in fitting, such as choosing an appropriate model and evaluating the fit.
Iterative Process: Fitting is often an iterative process. You may need to try different initial guesses for the model parameters, refine the model, and evaluate how well the fit matches the data.

Things to try and/or consider:#

Write some code to add a line that shows the residuals as a difference plot below the final graph of the experimental and calculated data.
Repeat the fit using the Gaussian function. Is this a better or worse model for the NMR peak shapes?
What happens if the initial guesses for the peak function’s parameters are not close to the actual values? Try fitting the peaks with one of the centres far from the real location. What happens?
We treated the spectrum using a very generic peak fitting process. Specialised NMR analysis libraries have methods to deal with more complex data much more efficiently (but many will still be using least squares underneath). Can you think of what additional complexities ¹H NMR might pose, for example? What issues might arise if peaks are much closer together?
The 7000 points of the NMR data have effectively been reduced to six numbers. This poses some questions about how data is stored, reported and used for further analysis. What factors might be important when making those decisions? Are there any disadvantages of only having the peak information available? Would the choice be different (how, why) in different scenarios?
You can see how it might be possible to automate this process for NMR spectra and other types of measured datasets. What steps would be needed now to interpret the information to translate it to knowledge about the molecule? How straightforward is this to automate?