Workshop - Structured data and digital chemistry

# If you do not have sqlite installed, you can install it by uncommenting the following command:
#!conda install -y sqlite

# If you do not have sqlalchemy installed, you can install it by uncommenting the following command:
#!conda install -y sqlalchemy

# only needed for mySQL
#!conda install -y mysql-connector-python

import pandas as pd

import sqlalchemy as db
from sqlalchemy import inspect
# from sqlalchemy import create_engine

Workshop - Structured data and digital chemistry#

Data storage in the automated lab#

Being able to carry out data processing and analysis in python and then adding the transformed data to a database management system (DBMS) can not only makes the data flows in research labs more effective, but also aligns with FAIR principles.

The importance of effective data storage becomes increasingly important with the increasing integration of automation and to make the most of advances in data analysis and prediction.[1, 2]

Computationally-generated data can its own issues, particularly in terms of volume. As discussed previously, experimentally-generated and measured data comes in many different forms, from varied sources and can vary greatly in terms of its structure; for example, instrument readings, reaction conditions, experimental outcomes, and metadata about samples

Managing the processing and storage of data from a lab setting is complex, even for relatively simple experimental workflows. Keeping the data structured, persistent, and easily accessible is crucial for reproducibility, analysis, and automation. The integration of automated processes might alleviate issues around structuring metadata and organising measurements, but can increase challenges in terms of the volume of data produced.

Database systems for chemistry#

Relational database management systems (RDBMS) offer one approach to structuring data storage: They store data in tables, where relationships between data points are explicitly defined. Unlike spreadsheets, which can contain duplicated or inconsistent information, relational databases enforce data integrity and reduce redundancy through normalisation.

Advantages of Relational Databases#

Persistence Data is stored safely and remains available even after the program or system shuts down.
Structure and integrity Data is organised into tables with clearly defined relationships, reducing errors and inconsistencies.
Scalability and Efficiency Efficient querying through SQL (Structured Query Language) allows rapid retrieval and filtering of large datasets.
Accessibility Multiple users and systems can access and manipulate the data concurrently without conflicts.

Using a structured storage system like a relational database (db) enables a transition from unstructured data (e.g. scattered experimental records in spreadsheets) to a system where data is systematically stored, so that it remains accessible and useful for downstream automation and decision-making.

Aims#

Part 1 SQL + Python#

A sample of data from a set of lab experiments is available in the data directory as a flat file: cox_experiments.csv.

Use the SQLAlchemy package to connect to a SQLite database
Read a sample of data from an experimental lab study into a pandas dataframe
Process the data in pandas
Load the data into the SQLite database to store in a structured, persistent format

Part 2 Database structure#

One of the key advantages of using a relational database system is that it imposes some degree of structure on the data. This can help with ensuring the data is complete and avoids redundancy. Simply storing available data in a single table is not the most effective way to achieve those goals. Designing the structure of the database using concepts such as data normalisation can help to avoid inconsistencies in the data and overly complex relationships between tables, for example.

Using the experiments table as a starting point

Consider how the data could be organised using the concept of normalisation
Using the current columns, propose a design that achieves at least 3rd form normalisation
- List the tables required
- Columns in each table
- Any primary and foreign key relationships you think would be needed
Suggest additional columns and/or tables you would add to improve the completeness or functionality of the database
Investigate ways to visualise the design a database, such as EER diagrams

Note

A notebook with some completed code for part 1 is available here

Part 1 SQL + Python#

Structured Query Language (SQL)#

Structured Query Language (SQL) is a language used to query and manipulate databases. In CHEM501, you used a SQLite database to store data in a structured form. You were able to connect to the database using python and execute queries in raw SQL, and this included adding data from a pandas DataFrame to a table in an SQL database.

SQL and Python#

Being able to connect Python with a database such as a SQL db, offers the advantages of being able carry out data processing, analysis and modelling using the wide range of Python libraries, standardising data processes before storing the processed and analysed data in a persistent, structured format that is accessible to multiple users.

Automation and integration with data sources and workflows
- Python can enable integration with lab instruments, data pipelines, and machine learning models.
- This can faciliate automated data entry, validation, and retrieval without manual intervention.
Data processing and analysis
- Python provides a wide variety of libraries (e.g. NumPy, Pandas, SciPy, etc.) for analysing and visualising stored data.
- Data can be accessed directly from the database, processed, and piped into further optimisation or scientific models.
Reproducibility & Consistency
- Developing Python scripts for data processing can ensure that it is standardised and reproducible across experiments.

SQLAlchemy#

SQLalchemy is a package that lets you interact with a SQL database using python. It provides an abstraction layer over the raw SQL: It distances you from SQL, so you do not need to write raw SQL. This can make database interactions easier and more ‘Pythonic’.

It also means that it is more straightforward to change the underlying SQL database (e.g. from SQLite to mySQL or PostgreSQL) if you decide to later.

SQLAlchemy manages connections to the database, so you do not need to worry about closing connections. The abstraction away from raw SQL queries means data validation can be built in and can prevent security issues like SQL injection.

Here is an SQLalchemy with sqlite tutorial that will help as you work through the notebook. The main points you will see in this notebook focus on getting started quickly using familiar data structures in Python, by showing how straightforward working with pandas dataframes and a sql database can be.

# TODO: read the data into a dataframe

Pandas can output a dataframe directly to a sql table using the dataframe’s to_sql method and a SQLAlchemy connection.

Tasks#

Create an engine to connect to the sqlite database

SQLAlchemy’s create_engine method makes an engine object representing a way to connect to a db. If the database does not already exist, it will be created.
Connect to the database

The engine object is then instructed to make the connection using the connect function.
Create an object for the metadata

Information about the structure of the database can be accessed or manipulated via a MetaData object.
Output the contents of the dataframe to a database table

Use the pandas dataframe to_sql method to load the dataframe contents into a table in the SQL db. The if_exists argument specifies what happens if the table already exists. It returns the number of rows added to the database.

# TODO: create a connection to the database
# Call the engine and connection objects engine and connection

# Create an object to hold the table metadata

# If you want to list all existing tables in the database
# inspector = inspect(engine)
# inspector.get_table_names()

# TODO: Load the contents of the experiment DataFrame into a table called test_experiments in the database

You can check the columns of the new table in the SQL database by reading the table and printing their name and the data types:

# This code assumes the SQL db table you created is called test_experiments
# It creates a table object that you can use to query the database and uses
# the metadata object to reflect the table from the db so you can see the 
# existing structure of the table and columns

# table = db.Table('test_experiments', metadata, autoload_with=engine)
# for col in table.columns:
#     print(col, col.type)

The pandas to_csv method uses the data types for the columns to set the column types in the SQL table. The pandas dtypes need to be correct to ensure the correct structure of the db table is created.

# TODO: Check the dtypes of the pandas DataFrame columns and compare them to the dtypes of the columns in the SQL table
# experiment_df.info()

The date column of the dataframe, is currently stored as a string, so should be converted to a datetime.

temperature is an int64, but using a float would better allow for higher precision in future data.

# TODO: change the date column to datetime and temperature to float

It is possible to change the data types of the table, but as the table has very little data, we can just delete/replace it to start again.

Here, any the existing table will be replaced by the to_sql function, so you do not have to explicitly delete (drop) the table. You do need to clear the MetaData object so that it does not impose the previous structure of the table.

# The line below explicitly deletes the table from the database.
# table.drop(engine)

# TODO: use the clear method to remove the previous table metadata from the metadata object
# This allows the table to be redefined with the new datatypes

# TODO: Check that data types have been changed in the SQL table

At this point, you have created a SQLite database and a table within it to store the sample experimental data from a csv file.

You have imposed a specific structure on the data so that specific data types are required for some columns.

As well as loading data directly from a pandas dataframe into an sql database table, you can do the opposite process: Read the contents of the db table into a dataframe to check that the data does actually persist in the database.

# TODO: Read data from the SQLite database into a dataframe. Try the read_sql_table method or use
# a query to read the data.




# or use a query to read the data
# query = "SELECT * FROM test_experiments"
# result_df = pd.read_sql(query, engine)
# result_df

# This gives you an idea of how to filter the data returned from the database, e.g.
# query = "SELECT * FROM test_experiments WHERE temperature > 20"
# result_df = pd.read_sql(query, engine)
# result_df

Other flavours of SQL#

SQLite is an excellent option for local storage, prototyping, and simple applications. It is lightweight, does not require any setup, so is perfect for single-user applications or embedded systems.

More fully-featured versions of SQL are available, both commercial (e.g. Microsoft SQL Server, Oracle DB, IBM Db2) and open source (PostgreSQL, MySQL, MariaDB), in addition to versions for cloud and big data.

PostgreSQL and MySQL are both widely-used open source, true server-based databases designed for multi-user environments. They support concurrency, security, and large-scale operations.

If you have time or you want to try out a real SQL database system, you can install MySQL and connect to it using SQLAlchemy. The MySQL Community Edition is free to use under the GPL licence. It has a very user-friendly user interface MySQL Workbench that makes it easy to visualise the database contents and structure.

To use SQLAlchemy with MySQL, you will also need to install a python driver such as MySql Connector (available on pip or conda as mysql-connector-python) or PyMySQL.

A major advantage of using SQLAlchemy is that it makes swapping the underlying SQL db system relatively pain-free (usually). The code below shows how to create a new connection to a MySQL database and

# DB password is stored in a separate file - add this file to gitignore if you are using git

# from db_cred import mspwd

# # only needed for mySQL
# connection_string = f"mysql+mysqlconnector://root:{mspwd}@127.0.0.1:3306/chem_test"
# engine = db.create_engine(connection_string) # add for logging db actions: echo=True)
# metadata = db.MetaData()

# experiment_df

# experiment_df.to_sql("experiments", engine, if_exists='append', index=False)

Part 2 Database structure#

Brief introduction to data normalisation#

Why normalise data?#

Reduces redundancy: Avoids storing the same information multiple times (e.g., CAS numbers for chemicals).
Improves consistency: Changes in one place update automatically across the database.
Enhances scalability: Structured data is easier to query, update, and expand.

The experiment data example#

The experimental data you looked at in the earlier part has several issues in the way it is currently structured. It repeats chemical names and CAS numbers** for coformers, and stores file paths as plain text, which can make querying difficult. A normalised database would separate:

Experiments - linked to unique chemicals, solvents, and characterisation data.
Chemicals - each chemical has a unique CAS number, reducing repetition.
Characterisation Data - storing file paths separately allows easy retrieval.

For you to consider#

What issues might arise if the same chemical has multiple CAS numbers in different rows?
How could we store characterisation data (NMR, PXRD) in a structured way instead of file paths in a single table?
How can we design a system where updating or correcting a chemical’s CAS number does not require changing existing experiment records?
What other information could be included in the database that would make the current data more complete or useful?

Design a normalised structure for the example data#

The experimental dataset you have looked at is a flat file that stores all information in a single table.

Suggest a new set of tables to accommodate the data
Decide what columns (and think about data types) the tables need
The structure should normalise the data up to at least normal form 3
Identify primary keys and/or foreign keys that can be used to embed relationships between tables
Sketch or explore ways to visualise the database structure, such as ER diagrams
- DBeaver might be an option for generating an ER diagram for an SQLite database.
- MySQL Workbench has a built in ER tool if you have tried out an MySQL db. If you have an existing MySQL db, go to Database > Reverse engineer to get the ER for an existing db.

If you have time#

Create a new SQLite or MySQL database with the normalised structure
- You can use raw SQL with sqlite or SQLAlchemy using Python
- OR if you have MySQL installed, create the database in MySQL Workbench and generate the structure

Summary#

As you worked through this notebook, you have

used the SQLAlchemy package to interact with a SQLite database
loaded data from a pandas dataframe directly into a SQL database
read data from an SQL database table into a dataframe
gained awareness of core principles of database design
considered an example of data normalisation based on a typical flat file containing experimental lab information

For you to think about#

In your project, what are the sources of the data you will be working with?
How will they be processed before you store them?
How will they be stored? As flat files distributed across locations/servers or consolidated somehow?
How will you deal with metadata and connecting data from different sources, experiments, calculations?
If you were designing a data management system, where would you start? What kind of structure would be appropriate?
How could this improve the quality of the data (e.g. reduce errors, redundancy) or improve accessibility in the context of your project and/or research group?

References#

R. Duke, V. Bhat and C. Risko, Data storage architectures to accelerate chemical discovery: data accessibility for individual laboratories and the community, Chem. Sci., 2022, 13, 13646–13656.
1 K. M. Jablonka, L. Patiny and B. Smit, Making the collective knowledge of chemistry open and machine actionable, Nat. Chem., 2022, 14, 365–376.

Workshop - Structured data and digital chemistry

Contents

Workshop - Structured data and digital chemistry#

Data storage in the automated lab#

Database systems for chemistry#

Advantages of Relational Databases#

Aims#

Part 1 SQL + Python#

Part 2 Database structure#

Part 1 SQL + Python#

Structured Query Language (SQL)#

SQL and Python#

SQLAlchemy#

Tasks#

Other flavours of SQL#

Part 2 Database structure#

Brief introduction to data normalisation#

Why normalise data?#

The experiment data example#

For you to consider#

Design a normalised structure for the example data#

If you have time#

Summary#

For you to think about#

References#