API

add_descriptors(examples, descriptor_type='MACCS', mols=None)

Add descriptors to passed examples

Parameters:
  • examples (List[Example]) – List of example

  • descriptor_type (str) – Kind of descriptors to return, choose between ‘Classic’, ‘ECFP’, or ‘MACCS’. Default is ‘MACCS’.

  • mols (List[Any]) – Can be used if you already have rdkit Mols computed.

Return type:

List[Example]

Returns:

List of examples with added descriptors

cf_explain(examples, nmols=3, filter_nondrug=None)

From given Examples, find closest counterfactuals (see Getting Started)

Parameters:
  • examples (List[Example]) – Output from sample_space()

  • nmols (int) – Desired number of molecules

  • filter_nondrug (Optional[bool]) – Whether or not to filter out non-drug molecules. Default is True if input passes filter

Return type:

List[Example]

check_multiple_aromatic_rings(mol)
clear_descriptors(examples)

Clears all descriptors from examples

Parameters:
  • examples (List[Example]) – list of examples

  • descriptor_type – type of descriptor to clear, if None, all descriptors are cleared

Return type:

List[Example]

get_basic_alphabet()

Returns set of interpretable SELFIES tokens

Generated by removing P and most ionization states from selfies.get_semantic_robust_alphabet()

Return type:

Set[str]

Returns:

Set of interpretable SELFIES tokens

lime_explain(examples, descriptor_type='MACCS', return_beta=True)

From given Examples, find descriptor t-statistics (see :doc: index)

Parameters:
  • examples (List[Example]) – Output from :func: sample_space

  • descriptor_type (str) – Desired descriptors, choose from ‘Classic’, ‘ECFP’ ‘MACCS’

Return_beta:

Whether or not the function should return regression coefficient values

merge_text_explains(*args, filter=None)

Merge multiple text explanations into one and sort.

Return type:

List[Tuple[str, float]]

name_morgan_bit(m, bitInfo, key)

Get the name of a Morgan bit using a SMARTS dictionary

Parameters:
  • m (Any) – RDKit molecule

  • bitInfo (Dict[Any, Any]) – bitInfo dictionary from rdkit.Chem.AllChem.GetMorganFingerprint

  • key (int) – bit key corresponding to the fingerprint you want to have named

Return type:

str

plot_cf(exps, fig=None, figure_kwargs=None, mol_size=(200, 200), mol_fontsize=10, nrows=None, ncols=None)

Draw the given set of Examples in a grid

Parameters:
  • exps (List[Example]) – Small list of Example which will be drawn

  • fig (Any) – Figure to plot onto

  • figure_kwargs (Dict) – kwargs to pass to plt.figure

  • mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles

  • mol_fontsize (int) – minimum font size passed to rdkit

  • nrows (int) – number of rows to draw in grid

  • ncols (int) – number of columns to draw in grid

plot_descriptors(examples, output_file=None, fig=None, figure_kwargs=None, title=None, return_svg=False)

Plot descriptor attributions from given set of Examples.

Parameters:
  • examples (List[Example]) – Output from sample_space()

  • output_file (str) – Output file name to save the plot - optional except for ECFP

  • fig (Any) – Figure to plot on to

  • figure_kwargs (Dict) – kwargs to pass to plt.figure

  • title (str) – Title for the plot

  • return_svg (bool) – Whether to return svg for plot

plot_space(examples, exps, figure_kwargs=None, mol_size=(200, 200), highlight_clusters=False, mol_fontsize=8, offset=0, ax=None, cartoon=False, rasterized=False)

Plot chemical space around example and annotate given examples.

Parameters:
  • examples (List[Example]) – Large list of :obj:Example which make-up points

  • exps (List[Example]) – Small list of :obj:Example which will be annotated

  • figure_kwargs (Dict) – kwargs to pass to plt.figure

  • mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles

  • highlight_clusters (bool) – if True, cluster indices are rendered instead of :obj:Example.yhat

  • mol_fontsize (int) – minimum font size passed to rdkit

  • offset (int) – offset annotations to allow colorbar or other elements to fit into plot.

  • ax (Any) – axis onto which to plot

  • cartoon (bool) – do cartoon outline on points?

  • rasterized (bool) – raster the scatter?

rcf_explain(examples, delta=(-1, 1), nmols=4, filter_nondrug=None)

From given Examples, find closest counterfactuals (see Getting Started) This version works with regression, so that a counterfactual is if the given example is higher or lower than base.

Parameters:
  • examples (List[Example]) – Output from sample_space()

  • delta (Union[Any, Tuple[float, float]]) – float or tuple of hi/lo indicating margin for what is counterfactual

  • nmols (int) – Desired number of molecules

  • filter_nondrug (Optional[bool]) – Whether or not to filter out non-drug molecules. Default is True if input passes filter

Return type:

List[Example]

run_chemed(origin_smiles, num_samples, similarity=0.1, fp_type='ECFP4', _pbar=None)

This method is similar to STONED but works by quering PubChem

Parameters:
  • origin_smiles (str) – Base SMILES

  • num_samples (int) – Minimum number of returned molecules. May return less due to network timeout or exhausting tree

  • similarity (float) – Tanimoto similarity to use in query (float between 0 to 1)

  • fp_type (str) – Fingerprint type

Return type:

Tuple[List[str], List[float]]

Returns:

SMILES and SCORES

run_custom(origin_smiles, data, fp_type='ECFP4', _pbar=None, **kwargs)

This method is similar to STONED but uses a custom dataset provided by the user

Parameters:
  • origin_smiles (str) – Base SMILES

  • data (List[Union[str, Mol]]) – List of SMILES or RDKit molecules

  • fp_type (str) – Fingerprint type

Return type:

Tuple[List[str], List[float]]

Returns:

SMILES and SCORES

run_stoned(start_smiles, fp_type='ECFP4', num_samples=2000, max_mutations=2, min_mutations=1, alphabet=None, return_selfies=False, _pbar=None)

Run ths STONED SELFIES algorithm. Typically not used, call sample_space() instead.

Parameters:
  • start_smiles (str) – SMILES string to start from

  • fp_type (str) – Fingerprint type

  • num_samples (int) – Number of total molecules to generate

  • max_mutations (int) – Maximum number of mutations

  • min_mutations (int) – Minimum number of mutations

  • alphabet (Union[List[str], Set[str]]) – Alphabet to use for mutations, typically from get_basic_alphabet()

  • return_selfies (bool) – If SELFIES should be returned as well

Return type:

Union[Tuple[List[str], List[float]], Tuple[List[str], List[str], List[float]]]

Returns:

SELFIES, SMILES, and SCORES generated or SMILES and SCORES generated

sample_space(origin_smiles, f, batched=True, preset='medium', data=None, method_kwargs=None, num_samples=None, stoned_kwargs=None, quiet=False, use_selfies=False, sanitize_smiles=True)

Sample chemical space around given SMILES

This will evaluate the given function and run the run_stoned() function over chemical space around molecule. num_samples will be set to 3,000 by default if using STONED and 150 if using chemed. If using custom then num_samples will be set to the length of of the data list. If using synspace then num_samples will be set to 1,000. See run_stoned() and run_chemed() for more details. synspace comes from the package synspace <https://github.com/whitead/synspace>. It generates synthetically feasible molecules from a given SMILES.

Parameters:
  • origin_smiles (str) – starting SMILES

  • f (Union[Callable[[str, str], List[float]], Callable[[str], List[float]], Callable[[List[str], List[str]], List[float]], Callable[[List[str]], List[float]]]) – A function which takes in SMILES or SELFIES and returns predicted value. Assumed to work with lists of SMILES/SELFIES unless batched = False

  • batched (bool) – If f is batched

  • preset (str) – Can be “wide”, “medium”, “narrow”, “chemed”, “custom”, or “synspace”. Determines how far across chemical space is sampled. Try “chemed” preset to only sample pubchem compounds.

  • data (List[Union[str, Mol]]) – If not None and preset is “custom” will use this data instead of generating new ones.

  • method_kwargs (Dict) – More control over STONED, CHEMED and CUSTOM can be set here. See run_stoned(), run_chemed() and run_custom()

  • num_samples (int) – Number of desired samples. Can be set in method_kwargs (overrides) or here. None means default for preset

  • stoned_kwargs (Dict) – Backwards compatible alias for methods_kwargs

  • quiet (bool) – If True, will not print progress bar

  • use_selfies (bool) – If True, will use SELFIES instead of SMILES for f

  • sanitize_smiles (bool) – If True, will sanitize all SMILES

Return type:

List[Example]

Returns:

List of generated Example

text_explain(examples, descriptor_type='maccs', count=5, presence_thresh=0.2, include_weak=None)

Take an example and convert t-statistics into text explanations

Parameters:
  • examples (List[Example]) – Output from sample_space()

  • descriptor_type (str) – Type of descriptor, either “maccs”, or “ecfp”.

  • count (int) – Number of text explanations to return

  • presence_thresh (float) – Threshold for presence of descriptor in examples

  • include_weak (Optional[bool]) – Include weak descriptors. If not set, the function

Return type:

List[Tuple[str, float]]

will be first have this set to False, and if no descriptors are found, will be set to True and function will be re-run

text_explain_generate(text_explanations, property_name, llm_model='gpt-4o', single=True)

Insert text explanations into template, and generate explanation.

Return type:

str

Args:

text_explanations: List of text explanations. property_name: Name of property. llm: Language model to use. single: Whether to use a prompt about a single molecule or multiple molecules.

class Descriptors(descriptor_type, descriptors, descriptor_names, plotting_names=(), tstats=())

Molecular descriptors

descriptor_names: tuple
descriptor_type: str

Descriptor type

descriptors: tuple

Descriptor values

plotting_names: tuple = ()
tstats: tuple = ()
class Example(smiles, selfies, similarity, yhat, index, position=<factory>, is_origin=False, cluster=0, label=None, descriptors=None)

Example of a molecule

cluster: int = 0

Index of cluster, can be -1 for no cluster

descriptors: Descriptors = None

Descriptors for this example

index: int

Index relative to other examples

is_origin: bool = False

True if base

label: str = None

Label for this example

position: ndarray

PCA projected position from similarity

selfies: str

SELFIES for molecule, as output from selfies.encoder()

similarity: float

Tanimoto similarity relative to base

smiles: str

SMILES string for molecule

yhat: float

Output of model function

insert_svg(exps, mol_size=(200, 200), mol_fontsize=10)

Replace rasterized image files with SVG versions of molecules

Parameters:
  • exps (List[Example]) – The molecules for which images should be replaced. Typically just counterfactuals or some small set

  • mol_size (Tuple[int, int]) – If mol_size was specified, it needs to be re-specified here

Return type:

str

Returns:

SVG string that can be saved or displayed in juypter notebook

moldiff(template, query)

Compare the two rdkit molecules.

Parameters:
  • template – template molecule

  • query – query molecule

Return type:

Tuple[List[int], List[int]]

Returns:

list of modified atoms in query, list of modified bonds in query

plot_space_by_fit(examples, exps, beta, mol_size=(200, 200), mol_fontsize=8, offset=0, ax=None, figure_kwargs=None, cartoon=False, rasterized=False)

Plot chemical space around example by LIME fit and annotate given examples. Adapted from plot_space().

Parameters:
  • examples (List[Example]) – Large list of :obj:Example which make-up points

  • exps (List[Example]) – Small list of :obj:Example which will be annotated

  • beta (List) – beta output from lime_explain()

  • mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles

  • mol_fontsize (int) – minimum font size passed to rdkit

  • offset (int) – offset annotations to allow colorbar or other elements to fit into plot.

  • ax (Any) – axis onto which to plot

  • figure_kwargs (Dict) – kwargs to pass to plt.figure

  • cartoon (bool) – do cartoon outline on points?

  • rasterized (bool) – raster the scatter?

similarity_map_using_tstats(example, mol_size=(300, 200), return_svg=False)

Create similarity map for example molecule using descriptor t-statistics. Only works for ECFP descriptors

Parameters:
  • example (Example) – Example object

  • mol_size (Tuple[int, int]) – size of molecule image

  • return_svg (bool) – return svg instead of saving to file

Return type:

Optional[str]

Returns:

svg if return_svg is True, else None

trim(im)

Implementation of whitespace trim

credit: https://stackoverflow.com/a/10616717

Parameters:

im – PIL image

Returns:

PIL image