Workshop - Exploratory Data Analysis#
In this workshop, we will work with a dataset of thermochemical data for some molecules to explore what features or descriptors are influential in their melting and/or boiling points.
Useful resources#
We will be using some of the python libraries you have already seen and Seaborn, which you might not have yet. Here are some quick start guides and/or tutorials that might come in useful.
Pandas
Matplotlib
RDKit
RDKit tutorial from 2021 - this covers a lot of ground. We won’t be talking about reactions (towards end of notebook)
There are also lots of videos on YouTube and of course ChatGPT (though I am not sure how well it does with RDKit, probably because the documentation is patchy).
You might also find some useful bits and pieces in the Molecular fingerprints notebook in the module book.
Note
You can find a notebook with code for the data cleaning, visualisation of the initial data and calculation of molecular descriptors here.
Visualising factors affecting thermochemical properties
of organic compounds#
Let’s start by importing some libraries:
time (needed to include a sleep)
requests
pandas
numpy
matplotlib
seaborn
# TODO: Write your import statements here.
# rdkit has a complicated structure, so we will start with these and maybe add some later
from rdkit import Chem
from rdkit.Chem import (
AllChem,
rdCoordGen,
Draw,
rdFingerprintGenerator,
PandasTools,
Descriptors
)
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
from IPython.display import SVG
from ipywidgets import interact,fixed,IntSlider
Loading the data#
The data is stored in a flat csv file in the data
directory called alcohol_acid_phys_data.csv
.
Check the data in the file (try the ‘head’ command)
Read the data into a pandas dataframe
Display the dataframe
# TODO:
# 0. Check the data in the file (try the 'head' command)
# 1. Read the data into a pandas dataframe
# 2. Display the dataframe
Cleaning the data#
We need to do at least a little cleaning of the data. We can check the data for the number of rows and the data types in each column using DataFrame.info()
method.
There are lots of pKa values missing. We are not going to use the pKa values, so we can drop those columns.
Some rows are missing densities. And more importantly, some are missing melting and/or boiling points, which is the property we are interested in.
It might be possible to look these up somewhere, like the NIST Chemistry WebBook which unfortunately does seem not have a convenient API (there are unofficial ones if you search on the web). For now we can also drop these rows.
# TODO:
# 1. Drop the two pKa columns
# 2. Drop the rows with NaN values in density, melting point and boiling point columns.
# 3. Check the info again to see if the changes have been made.
Still a few issues:
The
Class
andIUPAC name
columns have some odd characters which appear to encode whitespace, e.g. Alkanedioic\r\nacid.The .info() shows that the melting and boiling points have object, i.e. string data types, which suggests there are non-numerical values. If you look at the columns, some numbers have “d” or “s” sometimes with a number, probably to denote “decomposed” or “sublimed” maybe.
Pandas has str.contains
and str.replace
methods for its Series
structure. Try using these to check and remove the encoded characters in those columns.
Can you think of a way to deal with the non- or partly numeric phase change values?
Hint
Could this help?
# TODO:
# 1. Ensure only numeric values are present in the melting point, boiling point columns
# 2. Remove the encoded whitespace characters from the 'Class' and 'IUPAC name' columns
# 3. Convert the melting point, boiling point columns to numeric values.
Some of the compounds do not have common names. We could either drop the column or fill the missing values with something like “unknown” or “none”.
# TODO:
# Clean column with missing compounds' common names
If you converted the mp and bp columns to numeric types using pd.to_numeric
with errors="coerce"
then you will probably now have some additional null values in those columns, so those rows can be dropped.
# TODO: Drop any remaining rows with NaN values in mp/bp columns
Finally, we have a clean dataset with no missing values and the correct dtypes.
We can look at the summary statistics for the numerical columns we currently have, but there’s not much there yet.
There is one more thing we can do to tidy this data.
You may not be so familiar with the pandas category
dtype. It is used when a variable takes a limited number of values.
Check the number of unique values for the columns. Which one could be treated as categorical data?
# TODO: Check for categorical columns and change the data type to 'category' if necessary
Visualising the data#
Have a look at this brilliant seaborn tutorial developed as by Charles J. Weiss at Augustana University in South Dakota.
Some of the data used has a similar structure to this dataset.
There are no hard and fast rules about which types of plots to use to visualise your data, but the data types of the columns will mean some are more suitable to look at the data and relationships for certain variables.
Try plotting the data to visualise some of the following:
The distribution of different classes of compound in the data set
Identify if there are any outliers for the thermochemical data or density
The distribution of boiling points, melting point and/or density with the class of the compound
Identify any correlations between the numerical features and the melting and/or boiling point.
Is there any difference for different classes of compound?
Are there any other interesting patterns or trends in the data that you have observed?
Adding some descriptors#
We have a list of compounds and a small number of observed values and descriptors. We can add a few more by calculating them using RDKit, but we only have IUPAC names, so we need to obtain a more rigorous representation to use with RDKit.
The Chemical Identifier Resolver (CIR) service is run by the CADD Group at the NCI/NIH as part of their Cactus server. It is used in the Molecular fingerprints notebook.
# Here is a function so the process of getting the SMILES can be repeated for multiple compounds.
# It includes a sleep time (`time.sleep`) to avoid overloading the server.
def get_smiles_from_name(name):
"""Gets SMILES string from the Cactus API given a chemical name."""
ROOT_URL = "https://cactus.nci.nih.gov/chemical/structure/"
identifier = name
representation = "smiles"
query_url = f"{ROOT_URL}{identifier}/{representation}"
response = requests.get(query_url)
time.sleep(0.05)
if response:
return response.text
else:
print(f"Failed to get SMILES for {name}")
return "not found"
# raise Exception(f"Cactus request failed for {name}: {response.status_code}")
# TODO: Get a list of SMILES strings for the compounds in the dataframe and add this to the
# dataframe as a new column.
Let’s generate some descriptors for these molecules using RDKit.
There is a tutorial on calculating descriptors, and they are listed in the Getting Started guide.
RDKit needs a RDKit.molecule
to calculate the descriptors. You can create a separate list of molecules based on the SMILES strings in the dataframe, or you can use RDKit’s PandasTools module to work with them in a DataFrame.
Have a look at the molecular fingerprints notebook for some code to get started getting the RDKit molecules.
Choose around 5 additional descriptors to calculate for each compound.
It is up to you how you handle the calculations and getting the new data combined with the existing dataframe.
Here is one option:
You could use the getMolDescriptors function in the descriptors tutorial as starting point to calculate the new descriptors and add them to dictionary that can be read into a dataframe.
You can then use
pd.concat
to combine the dataframe with your thermochemical data with the new descriptors.
# Add RDKit molecule objects to the dataframe
for idx, desc in enumerate(Descriptors.descList):
print(f"{idx} {desc[0]}")
0 MaxAbsEStateIndex
1 MaxEStateIndex
2 MinAbsEStateIndex
3 MinEStateIndex
4 qed
5 SPS
6 MolWt
7 HeavyAtomMolWt
8 ExactMolWt
9 NumValenceElectrons
10 NumRadicalElectrons
11 MaxPartialCharge
12 MinPartialCharge
13 MaxAbsPartialCharge
14 MinAbsPartialCharge
15 FpDensityMorgan1
16 FpDensityMorgan2
17 FpDensityMorgan3
18 BCUT2D_MWHI
19 BCUT2D_MWLOW
20 BCUT2D_CHGHI
21 BCUT2D_CHGLO
22 BCUT2D_LOGPHI
23 BCUT2D_LOGPLOW
24 BCUT2D_MRHI
25 BCUT2D_MRLOW
26 AvgIpc
27 BalabanJ
28 BertzCT
29 Chi0
30 Chi0n
31 Chi0v
32 Chi1
33 Chi1n
34 Chi1v
35 Chi2n
36 Chi2v
37 Chi3n
38 Chi3v
39 Chi4n
40 Chi4v
41 HallKierAlpha
42 Ipc
43 Kappa1
44 Kappa2
45 Kappa3
46 LabuteASA
47 PEOE_VSA1
48 PEOE_VSA10
49 PEOE_VSA11
50 PEOE_VSA12
51 PEOE_VSA13
52 PEOE_VSA14
53 PEOE_VSA2
54 PEOE_VSA3
55 PEOE_VSA4
56 PEOE_VSA5
57 PEOE_VSA6
58 PEOE_VSA7
59 PEOE_VSA8
60 PEOE_VSA9
61 SMR_VSA1
62 SMR_VSA10
63 SMR_VSA2
64 SMR_VSA3
65 SMR_VSA4
66 SMR_VSA5
67 SMR_VSA6
68 SMR_VSA7
69 SMR_VSA8
70 SMR_VSA9
71 SlogP_VSA1
72 SlogP_VSA10
73 SlogP_VSA11
74 SlogP_VSA12
75 SlogP_VSA2
76 SlogP_VSA3
77 SlogP_VSA4
78 SlogP_VSA5
79 SlogP_VSA6
80 SlogP_VSA7
81 SlogP_VSA8
82 SlogP_VSA9
83 TPSA
84 EState_VSA1
85 EState_VSA10
86 EState_VSA11
87 EState_VSA2
88 EState_VSA3
89 EState_VSA4
90 EState_VSA5
91 EState_VSA6
92 EState_VSA7
93 EState_VSA8
94 EState_VSA9
95 VSA_EState1
96 VSA_EState10
97 VSA_EState2
98 VSA_EState3
99 VSA_EState4
100 VSA_EState5
101 VSA_EState6
102 VSA_EState7
103 VSA_EState8
104 VSA_EState9
105 FractionCSP3
106 HeavyAtomCount
107 NHOHCount
108 NOCount
109 NumAliphaticCarbocycles
110 NumAliphaticHeterocycles
111 NumAliphaticRings
112 NumAmideBonds
113 NumAromaticCarbocycles
114 NumAromaticHeterocycles
115 NumAromaticRings
116 NumAtomStereoCenters
117 NumBridgeheadAtoms
118 NumHAcceptors
119 NumHDonors
120 NumHeteroatoms
121 NumHeterocycles
122 NumRotatableBonds
123 NumSaturatedCarbocycles
124 NumSaturatedHeterocycles
125 NumSaturatedRings
126 NumSpiroAtoms
127 NumUnspecifiedAtomStereoCenters
128 Phi
129 RingCount
130 MolLogP
131 MolMR
132 fr_Al_COO
133 fr_Al_OH
134 fr_Al_OH_noTert
135 fr_ArN
136 fr_Ar_COO
137 fr_Ar_N
138 fr_Ar_NH
139 fr_Ar_OH
140 fr_COO
141 fr_COO2
142 fr_C_O
143 fr_C_O_noCOO
144 fr_C_S
145 fr_HOCCN
146 fr_Imine
147 fr_NH0
148 fr_NH1
149 fr_NH2
150 fr_N_O
151 fr_Ndealkylation1
152 fr_Ndealkylation2
153 fr_Nhpyrrole
154 fr_SH
155 fr_aldehyde
156 fr_alkyl_carbamate
157 fr_alkyl_halide
158 fr_allylic_oxid
159 fr_amide
160 fr_amidine
161 fr_aniline
162 fr_aryl_methyl
163 fr_azide
164 fr_azo
165 fr_barbitur
166 fr_benzene
167 fr_benzodiazepine
168 fr_bicyclic
169 fr_diazo
170 fr_dihydropyridine
171 fr_epoxide
172 fr_ester
173 fr_ether
174 fr_furan
175 fr_guanido
176 fr_halogen
177 fr_hdrzine
178 fr_hdrzone
179 fr_imidazole
180 fr_imide
181 fr_isocyan
182 fr_isothiocyan
183 fr_ketone
184 fr_ketone_Topliss
185 fr_lactam
186 fr_lactone
187 fr_methoxy
188 fr_morpholine
189 fr_nitrile
190 fr_nitro
191 fr_nitro_arom
192 fr_nitro_arom_nonortho
193 fr_nitroso
194 fr_oxazole
195 fr_oxime
196 fr_para_hydroxylation
197 fr_phenol
198 fr_phenol_noOrthoHbond
199 fr_phos_acid
200 fr_phos_ester
201 fr_piperdine
202 fr_piperzine
203 fr_priamide
204 fr_prisulfonamd
205 fr_pyridine
206 fr_quatN
207 fr_sulfide
208 fr_sulfonamd
209 fr_sulfone
210 fr_term_acetylene
211 fr_tetrazole
212 fr_thiazole
213 fr_thiocyan
214 fr_thiophene
215 fr_unbrch_alkane
216 fr_urea
# From https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html
def getMolDescriptors(mol, descriptor_list=None, missingVal=None):
''' calculate the full list of descriptors for a molecule
missingVal is used if the descriptor cannot be calculated
'''
res = {}
if descriptor_list is None:
for nm,fn in Descriptors._descList:
# some of the descriptor fucntions can throw errors if they fail, catch those here:
try:
val = fn(mol)
except:
# print the error message:
import traceback
traceback.print_exc()
# and set the descriptor value to whatever missingVal is
val = missingVal
res[nm] = val
# TODO: Add else clause to handle a list numbers corresponding to the descriptor indices
else:
pass
return res
# TODO: Add the descriptors to the dataframe as new columns
Back to visualisation#
Using your new seaborn skills, visualise the distributions and identify any correlations in your new data.
You will probably find plots like pairplots or heatmaps of more use now that you have a few more variables.
Summary#
You have used the pandas library to clean and prepare a dataset, and to get descriptive statistics for the data.
You have visualised distributions and relationships in the data to look for anomalies and patterns.
You have used an API to obtain molecular identifiers/representations for a set of compounds.
You have generated molecular descriptors for a set of compounds using RDKit.