API

add_descriptors(examples, descriptor_type='MACCS', mols=None, multiple_bases=None)

Add descriptors to passed examples

Parameters
  • examples (List[Example]) – List of example

  • descriptor_type (str) – Kind of descriptors to return, choose between ‘Classic’, ‘ECFP’, or ‘MACCS’. Default is ‘MACCS’.

  • mols (Optional[List[Any]]) – Can be used if you already have rdkit Mols computed.

  • multiple_bases (Optional[bool]) – Consider multiple bases for plotting (default: infer from examples)

Return type

List[Example]

Returns

List of examples with added descriptors

cf_explain(examples, nmols=3)

From given Examples, find closest counterfactuals (see Getting Started)

Parameters
Return type

List[Example]

get_basic_alphabet()

Returns set of interpretable SELFIES tokens

Generated by removing P and most ionization states from selfies.get_semantic_robust_alphabet()

Return type

Set[str]

Returns

Set of interpretable SELFIES tokens

lime_explain(examples, descriptor_type='MACCS', return_beta=True, multiple_bases=None)

From given Examples, find descriptor t-statistics (see :doc: index)

Parameters
  • examples (List[Example]) – Output from :func: sample_space

  • descriptor_type (str) – Desired descriptors, choose from ‘Classic’, ‘ECFP’ ‘MACCS’

  • multiple_bases (Optional[bool]) – Consider multiple bases for explanation (default: infer from examples)

Return_beta

Whether or not the function should return regression coefficient values

plot_cf(exps, fig=None, figure_kwargs=None, mol_size=(200, 200), mol_fontsize=10, nrows=None, ncols=None)

Draw the given set of Examples in a grid

Parameters
  • exps (List[Example]) – Small list of Example which will be drawn

  • fig (Optional[Any]) – Figure to plot onto

  • figure_kwargs (Optional[Dict]) – kwargs to pass to plt.figure

  • mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles

  • mol_fontsize (int) – minimum font size passed to rdkit

  • nrows (Optional[int]) – number of rows to draw in grid

  • ncols (Optional[int]) – number of columns to draw in grid

plot_descriptors(examples, output_file=None, fig=None, figure_kwargs=None, title=None, multiple_bases=None, return_svg=False)

Plot descriptor attributions from given set of Examples.

Parameters
  • examples (List[Example]) – Output from sample_space()

  • output_file (Optional[str]) – Output file name to save the plot - optional except for ECFP

  • fig (Optional[Any]) – Figure to plot on to

  • figure_kwargs (Optional[Dict]) – kwargs to pass to plt.figure

  • title (Optional[str]) – Title for the plot

  • multiple_bases (Optional[bool]) – Consider multiple bases for explanation (default: infer from examples)

  • return_svg (bool) – Whether to return svg for plot

plot_space(examples, exps, figure_kwargs=None, mol_size=(200, 200), highlight_clusters=False, mol_fontsize=8, offset=0, ax=None, cartoon=False, rasterized=False)

Plot chemical space around example and annotate given examples.

Parameters
  • examples (List[Example]) – Large list of :obj:Example which make-up points

  • exps (List[Example]) – Small list of :obj:Example which will be annotated

  • figure_kwargs (Optional[Dict]) – kwargs to pass to plt.figure

  • mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles

  • highlight_clusters (bool) – if True, cluster indices are rendered instead of :obj:Example.yhat

  • mol_fontsize (int) – minimum font size passed to rdkit

  • offset (int) – offset annotations to allow colorbar or other elements to fit into plot.

  • ax (Optional[Any]) – axis onto which to plot

  • cartoon (bool) – do cartoon outline on points?

  • rasterized (bool) – raster the scatter?

rcf_explain(examples, delta=(- 1, 1), nmols=4)

From given Examples, find closest counterfactuals (see Getting Started) This version works with regression, so that a counterfactual is if the given example is higher or lower than base.

Parameters
  • examples (List[Example]) – Output from sample_space()

  • delta (Union[Any, Tuple[float, float]]) – float or tuple of hi/lo indicating margin for what is counterfactual

  • nmols (int) – Desired number of molecules

Return type

List[Example]

run_chemed(origin_smiles, num_samples, similarity=0.1, fp_type='ECFP4', _pbar=None)

This method is similar to STONED but works by quering PubChem

Parameters
  • origin_smiles (str) – Base SMILES

  • num_samples (int) – Minimum number of returned molecules. May return less due to network timeout or exhausting tree

  • similarity (float) – Tanimoto similarity to use in query (float between 0 to 1)

  • fp_type (str) – Fingerprint type

Return type

Tuple[List[str], List[float]]

Returns

SMILES and SCORES

run_custom(origin_smiles, data, fp_type='ECFP4', _pbar=None, **kwargs)

This method is similar to STONED but uses a custom dataset provided by the user

Parameters
  • origin_smiles (str) – Base SMILES

  • data (List[Union[str, Mol]]) – List of SMILES or RDKit molecules

  • fp_type (str) – Fingerprint type

Return type

Tuple[List[str], List[float]]

Returns

SMILES and SCORES

run_stoned(start_smiles, fp_type='ECFP4', num_samples=2000, max_mutations=2, min_mutations=1, alphabet=None, return_selfies=False, _pbar=None)

Run ths STONED SELFIES algorithm. Typically not used, call sample_space() instead.

Parameters
  • start_smiles (str) – SMILES string to start from

  • fp_type (str) – Fingerprint type

  • num_samples (int) – Number of total molecules to generate

  • max_mutations (int) – Maximum number of mutations

  • min_mutations (int) – Minimum number of mutations

  • alphabet (Union[List[str], Set[str], None]) – Alphabet to use for mutations, typically from get_basic_alphabet()

  • return_selfies (bool) – If SELFIES should be returned as well

Return type

Union[Tuple[List[str], List[float]], Tuple[List[str], List[str], List[float]]]

Returns

SELFIES, SMILES, and SCORES generated or SMILES and SCORES generated

sample_space(origin_smiles, f, batched=True, preset='medium', data=None, method_kwargs=None, num_samples=None, stoned_kwargs=None, quiet=False, use_selfies=False, sanitize_smiles=True)

Sample chemical space around given SMILES

This will evaluate the given function and run the run_stoned() function over chemical space around molecule. num_samples will be set to 3,000 by default if using STONED and 150 if using chemed.

Parameters
  • origin_smiles (str) – starting SMILES

  • f (Union[Callable[[str, str], List[float]], Callable[[str], List[float]], Callable[[List[str], List[str]], List[float]], Callable[[List[str]], List[float]]]) – A function which takes in SMILES or SELFIES and returns predicted value. Assumed to work with lists of SMILES/SELFIES unless batched = False

  • batched (bool) – If f is batched

  • preset (str) – Can be wide, medium, or narrow. Determines how far across chemical space is sampled. Try “chemed” preset to only sample commerically available compounds.

  • data (Optional[List[Union[str, Mol]]]) – If not None and preset is “custom” will use this data instead of generating new ones.

  • method_kwargs (Optional[Dict]) – More control over STONED, CHEMED and CUSTOM can be set here. See run_stoned(), run_chemed() and run_custom()

  • num_samples (Optional[int]) – Number of desired samples. Can be set in method_kwargs (overrides) or here. None means default for preset

  • stoned_kwargs (Optional[Dict]) – Backwards compatible alias for methods_kwargs

  • quiet (bool) – If True, will not print progress bar

  • use_selfies (bool) – If True, will use SELFIES instead of SMILES for f

  • sanitize_smiles (bool) – If True, will sanitize all SMILES

Return type

List[Example]

Returns

List of generated Example

class Descriptors(descriptor_type, descriptors, descriptor_names, tstats=())

Molecular descriptors

descriptor_names: tuple
descriptor_type: str

Descriptor type

descriptors: tuple

Descriptor values

tstats: tuple = ()
class Example(smiles, selfies, similarity, yhat, index, position=array(None, dtype=object), is_origin=False, cluster=0, label=None, descriptors=None)

Example of a molecule

cluster: int = 0

Index of cluster, can be -1 for no cluster

descriptors: Descriptors = None

Descriptors for this example

index: int

Index relative to other examples

is_origin: bool = False

True if base

label: str = None

Label for this example

position: ndarray = array(None, dtype=object)

PCA projected position from similarity

selfies: str

SELFIES for molecule, as output from selfies.encoder()

similarity: float

Tanimoto similarity relative to base

smiles: str

SMILES string for molecule

yhat: float

Output of model function

insert_svg(exps, mol_size=(200, 200), mol_fontsize=10)

Replace rasterized image files with SVG versions of molecules

Parameters
  • exps (List[Example]) – The molecules for which images should be replaced. Typically just counterfactuals or some small set

  • mol_size (Tuple[int, int]) – If mol_size was specified, it needs to be re-specified here

Return type

str

Returns

SVG string that can be saved or displayed in juypter notebook

moldiff(template, query)

Compare the two rdkit molecules.

Parameters
  • template – template molecule

  • query – query molecule

Return type

Tuple[List[int], List[int]]

Returns

list of modified atoms in query, list of modified bonds in query

plot_space_by_fit(examples, exps, beta, mol_size=(200, 200), mol_fontsize=8, offset=0, ax=None, figure_kwargs=None, cartoon=False, rasterized=False)

Plot chemical space around example by LIME fit and annotate given examples. Adapted from plot_space().

Parameters
  • examples (List[Example]) – Large list of :obj:Example which make-up points

  • exps (List[Example]) – Small list of :obj:Example which will be annotated

  • beta (List) – beta output from lime_explain()

  • mol_size (Tuple[int, int]) – size of rdkit molecule rendering, in pixles

  • mol_fontsize (int) – minimum font size passed to rdkit

  • offset (int) – offset annotations to allow colorbar or other elements to fit into plot.

  • ax (Optional[Any]) – axis onto which to plot

  • figure_kwargs (Optional[Dict]) – kwargs to pass to plt.figure

  • cartoon (bool) – do cartoon outline on points?

  • rasterized (bool) – raster the scatter?

similarity_map_using_tstats(example)

Create similarity map for example molecule using descriptor t-statistics. Only works for ECFP descriptors

trim(im)

Implementation of whitespace trim

credit: https://stackoverflow.com/a/10616717

Parameters

im – PIL image

Returns

PIL image