API

add_descriptors(examples, descriptor_type='MACCS', mols=None)

Add descriptors to passed examples

Parameters:

examples (List[Example]) – List of example
descriptor_type (str) – Kind of descriptors to return, choose between ‘Classic’, ‘ECFP’, or ‘MACCS’. Default is ‘MACCS’.
mols (Optional[List[Any]]) – Can be used if you already have rdkit Mols computed.

Return type:

List[Example]

Returns:

List of examples with added descriptors

cf_explain(examples, nmols=3, filter_nondrug=None)

From given Examples, find closest counterfactuals (see Getting Started)

Parameters:

examples (List[Example]) – Output from sample_space()
nmols (int) – Desired number of molecules
filter_nondrug (Optional[bool]) – Whether or not to filter out non-drug molecules. Default is True if input passes filter

Return type:

List[Example]

check_multiple_aromatic_rings(mol)

clear_descriptors(examples)

Clears all descriptors from examples

Parameters:

examples (List[Example]) – list of examples
descriptor_type – type of descriptor to clear, if None, all descriptors are cleared

Return type:

List[Example]

get_basic_alphabet()

Returns set of interpretable SELFIES tokens

Generated by removing P and most ionization states from selfies.get_semantic_robust_alphabet()

Return type:: Set[str]
Returns:: Set of interpretable SELFIES tokens

get_functional_groups(mol, return_all=False, cutoff=300)

Get a set of functional groups present in a molecule, sorted by priority, avoiding overlaps.

Parameters:

mol (Any) – RDKit molecule
return_all (bool) – If True, will return all functional groups found in the molecule
cutoff (int) – Maximum rank of functional groups to consider based on popularity (increase to include groups like methyl, ethyl, etc.)

Return type:

set[str]

Returns:

set of unique functional group names present in the molecule. If mol is None, returns an empty set.

lime_explain(examples, descriptor_type='MACCS', return_beta=True)

From given Examples, find descriptor t-statistics (see :doc: index)

Parameters:

examples (List[Example]) – Output from :func: sample_space
descriptor_type (str) – Desired descriptors, choose from ‘Classic’, ‘ECFP’ ‘MACCS’

Return_beta:

Whether or not the function should return regression coefficient values

merge_text_explains(*args, filter=None)

Merge multiple text explanations into one and sort.

Return type:: List[Tuple[str, float]]

name_morgan_bit(m, bitInfo, key)

Get the name of a Morgan bit using a SMARTS dictionary

Parameters:

m (Any) – RDKit molecule
bitInfo (Dict[Any, Any]) – bitInfo dictionary from rdkit.Chem.AllChem.GetMorganFingerprint
key (int) – bit key corresponding to the fingerprint you want to have named

Return type:

Optional[str]

Returns:

Name of the bit, or None if no match is found

plot_cf(exps, fig=None, figure_kwargs=None, mol_size=(200, 200), mol_fontsize=10, nrows=None, ncols=None)

Draw the given set of Examples in a grid

Parameters:

exps (List[Example]) – Small list of Example which will be drawn
fig (Any) – Figure to plot onto
figure_kwargs (Optional[Dict]) – kwargs to pass to plt.figure
mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles
mol_fontsize (int) – minimum font size passed to rdkit
nrows (Optional[int]) – number of rows to draw in grid
ncols (Optional[int]) – number of columns to draw in grid

plot_descriptors(examples, output_file=None, fig=None, figure_kwargs=None, title=None, return_svg=False)

Plot descriptor attributions from given set of Examples.

Parameters:

examples (List[Example]) – Output from sample_space()
output_file (Optional[str]) – Output file name to save the plot - optional except for ECFP
fig (Any) – Figure to plot on to
figure_kwargs (Optional[Dict]) – kwargs to pass to plt.figure
title (Optional[str]) – Title for the plot
return_svg (bool) – Whether to return svg for plot

plot_space(examples, exps, figure_kwargs=None, mol_size=(200, 200), highlight_clusters=False, mol_fontsize=8, offset=0, ax=None, cartoon=False, rasterized=False)

Plot chemical space around example and annotate given examples.

Parameters:

examples (List[Example]) – Large list of :obj:Example which make-up points
exps (List[Example]) – Small list of :obj:Example which will be annotated
figure_kwargs (Optional[Dict]) – kwargs to pass to plt.figure
mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles
highlight_clusters (bool) – if True, cluster indices are rendered instead of :obj:Example.yhat
mol_fontsize (int) – minimum font size passed to rdkit
offset (int) – offset annotations to allow colorbar or other elements to fit into plot.
ax (Any) – axis onto which to plot
cartoon (bool) – do cartoon outline on points?
rasterized (bool) – raster the scatter?

rcf_explain(examples, delta=(-1, 1), nmols=4, filter_nondrug=None)

From given Examples, find closest counterfactuals (see Getting Started) This version works with regression, so that a counterfactual is if the given example is higher or lower than base.

Parameters:

examples (List[Example]) – Output from sample_space()
delta (Union[Any, Tuple[float, float]]) – float or tuple of hi/lo indicating margin for what is counterfactual
nmols (int) – Desired number of molecules
filter_nondrug (Optional[bool]) – Whether or not to filter out non-drug molecules. Default is True if input passes filter

Return type:

List[Example]

run_chemed(origin_smiles, num_samples, similarity=0.1, fp_type='ECFP4', _pbar=None)

This method is similar to STONED but works by quering PubChem

Parameters:

origin_smiles (str) – Base SMILES
num_samples (int) – Minimum number of returned molecules. May return less due to network timeout or exhausting tree
similarity (float) – Tanimoto similarity to use in query (float between 0 to 1)
fp_type (str) – Fingerprint type

Return type:

Tuple[List[str], List[float]]

Returns:

SMILES and SCORES

run_custom(origin_smiles, data, fp_type='ECFP4', _pbar=None, **kwargs)

This method is similar to STONED but uses a custom dataset provided by the user

Parameters:

origin_smiles (str) – Base SMILES
data (List[Union[str, Mol]]) – List of SMILES or RDKit molecules
fp_type (str) – Fingerprint type

Return type:

Tuple[List[str], List[float]]

Returns:

SMILES and SCORES

run_stoned(start_smiles, fp_type='ECFP4', num_samples=2000, max_mutations=2, min_mutations=1, alphabet=None, return_selfies=False, _pbar=None)

Run ths STONED SELFIES algorithm. Typically not used, call sample_space() instead.

Parameters:

start_smiles (str) – SMILES string to start from
fp_type (str) – Fingerprint type
num_samples (int) – Number of total molecules to generate
max_mutations (int) – Maximum number of mutations
min_mutations (int) – Minimum number of mutations
alphabet (Union[List[str], Set[str], None]) – Alphabet to use for mutations, typically from get_basic_alphabet()
return_selfies (bool) – If SELFIES should be returned as well

Return type:

Union[Tuple[List[str], List[float]], Tuple[List[str], List[str], List[float]]]

Returns:

SELFIES, SMILES, and SCORES generated or SMILES and SCORES generated

sample_space(origin_smiles, f, batched=True, preset='medium', data=None, method_kwargs=None, num_samples=None, stoned_kwargs=None, quiet=False, use_selfies=False, sanitize_smiles=True)

Sample chemical space around given SMILES

This will evaluate the given function and run the run_stoned() function over chemical space around molecule. num_samples will be set to 3,000 by default if using STONED and 150 if using chemed. If using custom then num_samples will be set to the length of of the data list. If using synspace then num_samples will be set to 1,000. See run_stoned() and run_chemed() for more details. synspace comes from the package synspace <https://github.com/whitead/synspace>. It generates synthetically feasible molecules from a given SMILES.

Parameters:

origin_smiles (str) – starting SMILES
f (Union[Callable[[str, str], List[float]], Callable[[str], List[float]], Callable[[List[str], List[str]], List[float]], Callable[[List[str]], List[float]]]) – A function which takes in SMILES or SELFIES and returns predicted value. Assumed to work with lists of SMILES/SELFIES unless batched = False
batched (bool) – If f is batched
preset (str) – Can be “wide”, “medium”, “narrow”, “chemed”, “custom”, or “synspace”. Determines how far across chemical space is sampled. Try “chemed” preset to only sample pubchem compounds.
data (Optional[List[Union[str, Mol]]]) – If not None and preset is “custom” will use this data instead of generating new ones.
method_kwargs (Optional[Dict]) – More control over STONED, CHEMED and CUSTOM can be set here. See run_stoned(), run_chemed() and run_custom()
num_samples (Optional[int]) – Number of desired samples. Can be set in method_kwargs (overrides) or here. None means default for preset
stoned_kwargs (Optional[Dict]) – Backwards compatible alias for methods_kwargs
quiet (bool) – If True, will not print progress bar
use_selfies (bool) – If True, will use SELFIES instead of SMILES for f
sanitize_smiles (bool) – If True, will sanitize all SMILES

Return type:

List[Example]

Returns:

List of generated Example

text_explain(examples, descriptor_type='maccs', count=5, presence_thresh=0.2, include_weak=None)

Take an example and convert t-statistics into text explanations

Parameters:

examples (List[Example]) – Output from sample_space()
descriptor_type (str) – Type of descriptor, either “maccs”, or “ecfp”.
count (int) – Number of text explanations to return
presence_thresh (float) – Threshold for presence of descriptor in examples
include_weak (Optional[bool]) – Include weak descriptors. If not set, the function

Return type:

List[Tuple[str, float]]

will be first have this set to False, and if no descriptors are found, will be set to True and function will be re-run

text_explain_generate(text_explanations, property_name, llm_model='gpt-4o', single=True)

Insert text explanations into template, and generate explanation.

Return type:: str

Args:: text_explanations: List of text explanations. property_name: Name of property. llm: Language model to use. single: Whether to use a prompt about a single molecule or multiple molecules.

class Descriptors(descriptor_type, descriptors, descriptor_names, plotting_names=(), tstats=())

Molecular descriptors

descriptor_names: tuple

descriptor_type: str: Descriptor type

descriptors: tuple: Descriptor values

plotting_names: tuple = ()

tstats: tuple = ()

class Example(smiles, selfies, similarity, yhat, index, position=<factory>, is_origin=False, cluster=0, label=None, descriptors=None)

Example of a molecule

cluster: int = 0: Index of cluster, can be -1 for no cluster

descriptors: Descriptors = None: Descriptors for this example

index: int: Index relative to other examples

is_origin: bool = False: True if base

label: str = None: Label for this example

position: ndarray: PCA projected position from similarity

selfies: str: SELFIES for molecule, as output from selfies.encoder()

similarity: float: Tanimoto similarity relative to base

smiles: str: SMILES string for molecule

yhat: float: Output of model function

insert_svg(exps, mol_size=(200, 200), mol_fontsize=10)

Replace rasterized image files with SVG versions of molecules

Parameters:

exps (List[Example]) – The molecules for which images should be replaced. Typically just counterfactuals or some small set
mol_size (Tuple[int, int]) – If mol_size was specified, it needs to be re-specified here

Return type:

str

Returns:

SVG string that can be saved or displayed in juypter notebook

moldiff(template, query)

Compare the two rdkit molecules.

Parameters:

template – template molecule
query – query molecule

Return type:

Tuple[List[int], List[int]]

Returns:

list of modified atoms in query, list of modified bonds in query

plot_space_by_fit(examples, exps, beta, mol_size=(200, 200), mol_fontsize=8, offset=0, ax=None, figure_kwargs=None, cartoon=False, rasterized=False)

Plot chemical space around example by LIME fit and annotate given examples. Adapted from plot_space().

Parameters:

examples (List[Example]) – Large list of :obj:Example which make-up points
exps (List[Example]) – Small list of :obj:Example which will be annotated
beta (List) – beta output from lime_explain()
mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles
mol_fontsize (int) – minimum font size passed to rdkit
offset (int) – offset annotations to allow colorbar or other elements to fit into plot.
ax (Any) – axis onto which to plot
figure_kwargs (Optional[Dict]) – kwargs to pass to plt.figure
cartoon (bool) – do cartoon outline on points?
rasterized (bool) – raster the scatter?

similarity_map_using_tstats(example, mol_size=(300, 200), return_svg=False)

Create similarity map for example molecule using descriptor t-statistics. Only works for ECFP descriptors

Parameters:

example (Example) – Example object
mol_size (Tuple[int, int]) – size of molecule image
return_svg (bool) – return svg instead of saving to file

Return type:

Optional[str]

Returns:

svg if return_svg is True, else None

trim(im)

Implementation of whitespace trim

credit: https://stackoverflow.com/a/10616717

Parameters:: im – PIL image
Returns:: PIL image