Workshop - Exploratory Data Analysis

Workshop - Exploratory Data Analysis#

In this workshop, we will work with a dataset of thermochemical data for some molecules to explore what features or descriptors are influential in their melting and/or boiling points.

Useful resources#

We will be using some of the python libraries you have already seen and Seaborn, which you might not have yet. Here are some quick start guides and/or tutorials that might come in useful.

Pandas
- 10 minutes to pandas
Matplotlib
- Quick start guide
RDKit
- Getting started with the RDKit in Python
- RDKit tutorial from 2021 - this covers a lot of ground. We won’t be talking about reactions (towards end of notebook)
- There are also lots of videos on YouTube and of course ChatGPT (though I am not sure how well it does with RDKit, probably because the documentation is patchy).

You might also find some useful bits and pieces in the Molecular fingerprints notebook in the module book.

Note

You can find a notebook with code for the data cleaning, visualisation of the initial data and calculation of molecular descriptors here.

Visualising factors affecting thermochemical properties
of organic compounds#

Let’s start by importing some libraries:

time (needed to include a sleep)
requests
pandas
numpy
matplotlib
seaborn

# TODO: Write your import statements here.


# rdkit has a complicated structure, so we will start with these and maybe add some later

from rdkit import Chem
from rdkit.Chem import (
                        AllChem,
                        rdCoordGen,
                        Draw,
                        rdFingerprintGenerator,
                        PandasTools,
                        Descriptors
                        )

from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs

from IPython.display import SVG
from ipywidgets import interact,fixed,IntSlider

Loading the data#

The data is stored in a flat csv file in the data directory called alcohol_acid_phys_data.csv.

Check the data in the file (try the ‘head’ command)
Read the data into a pandas dataframe
Display the dataframe

# TODO:

# 0. Check the data in the file (try the 'head' command)
# 1. Read the data into a pandas dataframe
# 2. Display the dataframe

Cleaning the data#

We need to do at least a little cleaning of the data. We can check the data for the number of rows and the data types in each column using DataFrame.info() method.

There are lots of pKa values missing. We are not going to use the pKa values, so we can drop those columns.

Some rows are missing densities. And more importantly, some are missing melting and/or boiling points, which is the property we are interested in.

It might be possible to look these up somewhere, like the NIST Chemistry WebBook which unfortunately does seem not have a convenient API (there are unofficial ones if you search on the web). For now we can also drop these rows.

# TODO:
# 1. Drop the two pKa columns
# 2. Drop the rows with NaN values in density, melting point and boiling point columns.
# 3. Check the info again to see if the changes have been made.

Still a few issues:

The Class and IUPAC name columns have some odd characters which appear to encode whitespace, e.g. Alkanedioic\r\nacid.
The .info() shows that the melting and boiling points have object, i.e. string data types, which suggests there are non-numerical values. If you look at the columns, some numbers have “d” or “s” sometimes with a number, probably to denote “decomposed” or “sublimed” maybe.

Pandas has str.contains and str.replace methods for its Series structure. Try using these to check and remove the encoded characters in those columns.

Can you think of a way to deal with the non- or partly numeric phase change values?

Hint

Could this help?

# TODO:

# 1. Ensure only numeric values are present in the melting point, boiling point columns
# 2. Remove the encoded whitespace characters from the 'Class' and 'IUPAC name' columns
# 3. Convert the melting point, boiling point columns to numeric values.

Some of the compounds do not have common names. We could either drop the column or fill the missing values with something like “unknown” or “none”.

# TODO:

# Clean column with missing compounds' common names

If you converted the mp and bp columns to numeric types using pd.to_numeric with errors="coerce" then you will probably now have some additional null values in those columns, so those rows can be dropped.

# TODO: Drop any remaining rows with NaN values in mp/bp columns

Finally, we have a clean dataset with no missing values and the correct dtypes.

We can look at the summary statistics for the numerical columns we currently have, but there’s not much there yet.

There is one more thing we can do to tidy this data.

You may not be so familiar with the pandas category dtype. It is used when a variable takes a limited number of values.

Check the number of unique values for the columns. Which one could be treated as categorical data?

# TODO: Check for categorical columns and change the data type to 'category' if necessary

Visualising the data#

Have a look at this brilliant seaborn tutorial developed as by Charles J. Weiss at Augustana University in South Dakota.

Some of the data used has a similar structure to this dataset.

There are no hard and fast rules about which types of plots to use to visualise your data, but the data types of the columns will mean some are more suitable to look at the data and relationships for certain variables.

Try plotting the data to visualise some of the following:

The distribution of different classes of compound in the data set
Identify if there are any outliers for the thermochemical data or density
The distribution of boiling points, melting point and/or density with the class of the compound
Identify any correlations between the numerical features and the melting and/or boiling point.
- Is there any difference for different classes of compound?

Are there any other interesting patterns or trends in the data that you have observed?

Adding some descriptors#

We have a list of compounds and a small number of observed values and descriptors. We can add a few more by calculating them using RDKit, but we only have IUPAC names, so we need to obtain a more rigorous representation to use with RDKit.

The Chemical Identifier Resolver (CIR) service is run by the CADD Group at the NCI/NIH as part of their Cactus server. It is used in the Molecular fingerprints notebook.

# Here is a function so the process of getting the SMILES can be repeated for multiple compounds.
# It includes a sleep time (`time.sleep`) to avoid overloading the server.

def get_smiles_from_name(name):
    """Gets SMILES string from the Cactus API given a chemical name."""
    
    ROOT_URL = "https://cactus.nci.nih.gov/chemical/structure/"
    identifier = name
    representation = "smiles"

    query_url = f"{ROOT_URL}{identifier}/{representation}"

    response = requests.get(query_url)
    time.sleep(0.05)
    if response:
        return response.text
    else:
        print(f"Failed to get SMILES for {name}")
        return "not found"
        # raise Exception(f"Cactus request failed for {name}: {response.status_code}")

# TODO: Get a list of SMILES strings for the compounds in the dataframe and add this to the 
# dataframe as a new column.

Let’s generate some descriptors for these molecules using RDKit.

There is a tutorial on calculating descriptors, and they are listed in the Getting Started guide.

RDKit needs a RDKit.molecule to calculate the descriptors. You can create a separate list of molecules based on the SMILES strings in the dataframe, or you can use RDKit’s PandasTools module to work with them in a DataFrame.

Have a look at the molecular fingerprints notebook for some code to get started getting the RDKit molecules.

Choose around 5 additional descriptors to calculate for each compound.
It is up to you how you handle the calculations and getting the new data combined with the existing dataframe.

Here is one option:

You could use the getMolDescriptors function in the descriptors tutorial as starting point to calculate the new descriptors and add them to dictionary that can be read into a dataframe.
You can then use pd.concat to combine the dataframe with your thermochemical data with the new descriptors.

# Add RDKit molecule objects to the dataframe

for idx, desc in enumerate(Descriptors.descList):
    print(f"{idx} {desc[0]}")

MaxAbsEStateIndex
MaxEStateIndex
MinAbsEStateIndex
MinEStateIndex
qed
SPS
MolWt
HeavyAtomMolWt
ExactMolWt
NumValenceElectrons
NumRadicalElectrons
MaxPartialCharge
MinPartialCharge
MaxAbsPartialCharge
MinAbsPartialCharge
FpDensityMorgan1
FpDensityMorgan2
FpDensityMorgan3
BCUT2D_MWHI
BCUT2D_MWLOW
BCUT2D_CHGHI
BCUT2D_CHGLO
BCUT2D_LOGPHI
BCUT2D_LOGPLOW
BCUT2D_MRHI
BCUT2D_MRLOW
AvgIpc
BalabanJ
BertzCT
Chi0
Chi0n
Chi0v
Chi1
Chi1n
Chi1v
Chi2n
Chi2v
Chi3n
Chi3v
Chi4n
Chi4v
HallKierAlpha
Ipc
Kappa1
Kappa2
Kappa3
LabuteASA
PEOE_VSA1
PEOE_VSA10
PEOE_VSA11
PEOE_VSA12
PEOE_VSA13
PEOE_VSA14
PEOE_VSA2
PEOE_VSA3
PEOE_VSA4
PEOE_VSA5
PEOE_VSA6
PEOE_VSA7
PEOE_VSA8
PEOE_VSA9
SMR_VSA1
SMR_VSA10
SMR_VSA2
SMR_VSA3
SMR_VSA4
SMR_VSA5
SMR_VSA6
SMR_VSA7
SMR_VSA8
SMR_VSA9
SlogP_VSA1
SlogP_VSA10
SlogP_VSA11
SlogP_VSA12
SlogP_VSA2
SlogP_VSA3
SlogP_VSA4
SlogP_VSA5
SlogP_VSA6
SlogP_VSA7
SlogP_VSA8
SlogP_VSA9
TPSA
EState_VSA1
EState_VSA10
EState_VSA11
EState_VSA2
EState_VSA3
EState_VSA4
EState_VSA5
EState_VSA6
EState_VSA7
EState_VSA8
EState_VSA9
VSA_EState1
VSA_EState10
VSA_EState2
VSA_EState3
VSA_EState4
VSA_EState5
VSA_EState6
VSA_EState7
VSA_EState8
VSA_EState9
FractionCSP3
HeavyAtomCount
NHOHCount
NOCount
NumAliphaticCarbocycles
NumAliphaticHeterocycles
NumAliphaticRings
NumAmideBonds
NumAromaticCarbocycles
NumAromaticHeterocycles
NumAromaticRings
NumAtomStereoCenters
NumBridgeheadAtoms
NumHAcceptors
NumHDonors
NumHeteroatoms
NumHeterocycles
NumRotatableBonds
NumSaturatedCarbocycles
NumSaturatedHeterocycles
NumSaturatedRings
NumSpiroAtoms
NumUnspecifiedAtomStereoCenters
Phi
RingCount
MolLogP
MolMR
fr_Al_COO
fr_Al_OH
fr_Al_OH_noTert
fr_ArN
fr_Ar_COO
fr_Ar_N
fr_Ar_NH
fr_Ar_OH
fr_COO
fr_COO2
fr_C_O
fr_C_O_noCOO
fr_C_S
fr_HOCCN
fr_Imine
fr_NH0
fr_NH1
fr_NH2
fr_N_O
fr_Ndealkylation1
fr_Ndealkylation2
fr_Nhpyrrole
fr_SH
fr_aldehyde
fr_alkyl_carbamate
fr_alkyl_halide
fr_allylic_oxid
fr_amide
fr_amidine
fr_aniline
fr_aryl_methyl
fr_azide
fr_azo
fr_barbitur
fr_benzene
fr_benzodiazepine
fr_bicyclic
fr_diazo
fr_dihydropyridine
fr_epoxide
fr_ester
fr_ether
fr_furan
fr_guanido
fr_halogen
fr_hdrzine
fr_hdrzone
fr_imidazole
fr_imide
fr_isocyan
fr_isothiocyan
fr_ketone
fr_ketone_Topliss
fr_lactam
fr_lactone
fr_methoxy
fr_morpholine
fr_nitrile
fr_nitro
fr_nitro_arom
fr_nitro_arom_nonortho
fr_nitroso
fr_oxazole
fr_oxime
fr_para_hydroxylation
fr_phenol
fr_phenol_noOrthoHbond
fr_phos_acid
fr_phos_ester
fr_piperdine
fr_piperzine
fr_priamide
fr_prisulfonamd
fr_pyridine
fr_quatN
fr_sulfide
fr_sulfonamd
fr_sulfone
fr_term_acetylene
fr_tetrazole
fr_thiazole
fr_thiocyan
fr_thiophene
fr_unbrch_alkane
fr_urea

# From https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html

def getMolDescriptors(mol, descriptor_list=None, missingVal=None):
    ''' calculate the full list of descriptors for a molecule
    
        missingVal is used if the descriptor cannot be calculated
    '''
    res = {}
    if descriptor_list is None:
        for nm,fn in Descriptors._descList:
            # some of the descriptor fucntions can throw errors if they fail, catch those here:
            try:
                val = fn(mol)
            except:
                # print the error message:
                import traceback
                traceback.print_exc()
                # and set the descriptor value to whatever missingVal is
                val = missingVal
            res[nm] = val
    # TODO: Add else clause to handle a list numbers corresponding to the descriptor indices
    else:
        pass
    return res

# TODO: Add the descriptors to the dataframe as new columns

Back to visualisation#

Using your new seaborn skills, visualise the distributions and identify any correlations in your new data.

You will probably find plots like pairplots or heatmaps of more use now that you have a few more variables.

Summary#

You have used the pandas library to clean and prepare a dataset, and to get descriptive statistics for the data.
You have visualised distributions and relationships in the data to look for anomalies and patterns.
You have used an API to obtain molecular identifiers/representations for a set of compounds.
You have generated molecular descriptors for a set of compounds using RDKit.