Tutorial
We’ll show here how to explain molecular property prediction tasks without access to the gradients or any properties of a molecule. To set-up this activity, we need a black box model. We’ll use something simple here – the model is classifier that says if a molecule has an alcohol (1) or not (0). Let’s implement this model first
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
# set-up rdkit drawing preferences
IPythonConsole.ipython_useSVG = True
IPythonConsole.drawOptions.drawMolsSameScale = False
def model(smiles):
mol = Chem.MolFromSmiles(smiles)
match = mol.GetSubstructMatches(Chem.MolFromSmarts("[O;!H0]"))
return 1 if match else 0
Let’s now try it out on some molecules
smi = "CCCCCCO"
print("f(s)", model(smi))
Chem.MolFromSmiles(smi)
f(s) 1
smi = "OCCCCCCO"
print("f(s)", model(smi))
Chem.MolFromSmiles(smi)
f(s) 1
smi = "c1ccccc1"
print("f(s)", model(smi))
Chem.MolFromSmiles(smi)
f(s) 0
Counterfacutal explanations
Let’s now explain the model - pretending we don’t know how it works - using counterfactuals
import exmol
instance = "CCCCCCO"
space = exmol.sample_space(instance, model, batched=False)
cfs = exmol.cf_explain(space, 1)
exmol.plot_cf(cfs)
We can see that removing the alcohol is the smallest change to affect the prediction of this molecule. Let’s see the space and look at where these counterfactuals are.
exmol.plot_space(space, cfs)
Explain using substructures
Now we’ll try to explain our model using substructures.
exmol.lime_explain(space)
exmol.plot_descriptors(space)
This seems like a pretty clear explanation. Let’s take a look at using substructures that are present in the molecule
import skunk
exmol.lime_explain(space, descriptor_type="ECFP")
svg = exmol.plot_descriptors(space, return_svg=True)
skunk.display(svg)
svg = exmol.plot_utils.similarity_map_using_tstats(space[0], return_svg=True)
skunk.display(svg)
We can see that most of the model is explained from the presence of the alcohol group - as expected.
Text
We can prepare a natural language summary of these results using exmol
:
exmol.lime_explain(space, descriptor_type="ECFP")
e = exmol.text_explain(space)
for ei in e:
print(ei[0], end="")
To prepare the natural language summary, we need to convert to a prompt that a model like GPT-3 can parse. Insert the output below into a language model to get a summary.
Or you can pass it directly, by installing the langchain
package and setting-up an openai key
print(exmol.text_explain_generate(e, property_name="active"))