Minimum Information About a Bioactive Entity (MIABE)

Version 0.4

Sandra Orchard

Henning Hermjakob

Input/support to date

Steve Bryant NCBI

Dominic Clark EMBL-EBI

Ian Dix AstraZeneca

Ola Engkvist AstraZeneca

Mark Forster Syngenta

Michael Gilson BindingDB

Martin Grigorov Nestlé

Kim Hammond-Kosack Rothampsted

Lee Harland Pfizer

Andrew Hopkins U. Dundee

Christopher Larminie GSK

Elena Lo Piparo Nestlé

John Overington EMBL-EBI

Chris Southern EMBL-EBI

Christoph Steinbeck EMBL-EBI

Janet Thornton EMBL-EBI

David Wishart DrugBank

Feedback to

Introduction The process of the identification and development of molecules with useful bioactive properties, such as pharmaceuticals and pesticides, is fraught with difficulty and many compounds will fall by the wayside on the road from New Chemical Entity to licensed product. In the pharmaceutical industry only a very small percentage of Investigational New Drugs will make it through to clinical usage. The causes for this high attrition rate are many, with lack of efficacy, unexpected drug side effects, and undesirable drug-drug interactions being just some of the more common pitfalls in the drug discovery process. Similarly, in the world of pesticides, compounds which prove to be active show undesirable side-effects against organisms other than their original target will not make it through to the market.

However, published reports of the activities of these ‘failed’ compounds, in addition to detailed information on those which go on to become fully licensed, commercially-available bioactive entities, are crucial for an understanding of how improved molecules may be developed. Details of their molecular structure and mechanism of action may give clues as to how related analogues may be developed that hit the same target, but with improved effectiveness or increased selectivity for a specific target over closely related molecules. A full disclosure of observed toxicity or an understanding of the pharmacokinetic properties of an agent may assist in improving these properties in subsequent generations of molecules. Even those molecules which fell by the wayside at an early stage in the development process may have a role to play as tools, to enable the verification of potential new targets which may have been identified by micro-array or proteomic studies in diseased tissues.

In 2002, Hopkins and Groom introduced the concept of the ‘Druggable Genome’ [1] suggesting that some proteins/protein families in the (human) genome are more amenable to modulation by small exogenous molecules than others. Since that time, work has been directed at finding ways to develop a computational approach to calculate druggability and to predict

‘druggable’ proteins. In 2006 Overington et al estimated the number of molecular targets for approved drugs to be as low as 266 protein targets [2], out of the 20,400 protein coding genes [3] in the human genome. A further 58 targets are either from pathogenic organisms or are non-protein molecules and similar models may be developed to look at the susceptibility of plant, parasitic, bacterial or viral genomes to xenobiotics. This apparent very low percentage of success may be significantly increased by adding to the available data on licensed drugs information which lies in the literature and in databases held by commercial companies, of molecular targets which have proven to be successfully modulated but the agents have subsequently failed to reached the end of the discovery pipeline due to unfavourable pharmacokinetics, or toxicity linked to molecule class rather than target.

As a consequence of the current productivity crisis, the pharmaceutical and biotechnology industries are increasingly disposed towards pre-competitive release of compound-related bioactivity data into public domain repositories. It has becoming widely acknowledged that the production of such an aggregated resource allows a combinatorial increase in the value of the information, in that the total amount of information that can be mined from it will be of much more valuable than from a single isolated collection. Increasingly, companies are recognising that regarding such an exercise as a pre-competitive activity allows not only a general benefit to the fields of human health and welfare but also a commercial advantage, in that the data need only be collected once. The manual harvesting and curation of data is an expensive process and involves resources often not available in even the largest pharmaceutical company. Additionally, any one company has only a limited range of in-house products targeting an equally restricted number of targets. By adding to these the knowledge contributed by other groups in both the commercial and academic sectors, a broader understanding of the druggability of a potential target protein may be reached, well before expensive resources are committed to an active research program. This pre-competitive activity has wider benefits, for example in the fields of academic or non-profit development of drugs for orphan and tropical diseases.

However, in order to fully understand the context, methods, data and conclusions that pertain to the published description of an experiment, detailed background information needs to be included. The diversity of experimental designs and analytical techniques is becoming more of a problem, not only as new methods supplant old ones but also as the scale of data production increases due to the widespread application of automation. While the problem is unlikely to be completely soluble it can be significantly improved by the specification of this metadata (‘data about the data’). By being associated with the results this metadata makes both the biological and methodological contexts of the experiment explicit. The archetype of such a specification is the Minimum Information about a Microarray Experiment (MIAME) [4]. Many journals and funding agencies now require that authors reporting microarray-based transcriptomics experiments comply with the MIAME checklist as a prerequisite for publication. The adoption and development of such specifications has had a much broader impact beyond merely increasing the comprehension and comparability of journal articles, the most important of which is facilitating the transfer of data from journal articles into databases (i.e. converting unstructured to structured data) in a form that will allow mining across combined data sets.

We therefore propose a new such document, the Minimum Information About a Bioactive Entity (MIABE) that is predominantly concerned with, but is not restricted to, bioactive chemical compounds. We believe the timing of this is apposite for a number of reasons. The first of these is the revolution in bioactive chemical information catalysed by the appearance of ChEBI [5] and PubChem [6] towards the end of 2004. When the ”missing entity” of chemical structure was embedded within the global Web of bioinformatic relationships it became possible to search across biological effects, protein names, sequence data, and chemical information. The second is that we now have the deposition, not just of HTS results but also other types of bioactivity screening data, directly linked to chemical structure information in public repositories such as PubChem Bioassay [6], ChemBank [7], and, in the near future, ChEMBL (www.ebi.ac.uk/chembl). Finally, increasing legislation is requiring that information on these compounds be more readily available. For example, on 1st June 2007, the REACH (Registration, Evaluation, Authorisation and Restriction of Chemical substances) legislation came into force across the European Union. This requires that additional information on chemicals be made available, dependent on the quantity imported or manufactured, and it has been estimated that 30,000 chemicals may need to be re-evaluated in accordance with these requirements [8].

MIABE: Principles and process.

In order that the maximum benefit be derived from the publication of data on one or a series of bioactive entities, it is important that certain crucial information is included in every such paper such that the properties of these molecules are fully represented, their effects on biological systems (both positive and negative) are accurately detailed and any factors which may contribute to the activity of the molecule are stated. To this end, a group of both industry and academic groups have come together to layout a checklist of the information that is felt to be important to include with any published dataset, the minimum information required for the activity of a molecule to be both fully understood and compared with other molecules, either sharing a common chemical structure or similar mechanism of action. It is intended that these guidelines be adhered to by anyone planning on publishing such a paper and also by the implementers of resources such as a databases to hold this information and any body or institution funding such discovery work and requiring the results to be published at the end of the granting period. As with other such reporting guidelines [4], MIABE adheres to the criteria of Sufficiency (a reader should be able to understand and critically evaluate the interpretation and conclusions, and to support their experimental corroboration) and Practicability (the guidelines should not be so burdensome as to prohibit its widespread use).

It should be noted that the scope of this series of documents is limited to data regarded as pre-clinical in drug studies, a discussion of the publication of clinical data is a subject more appropriate for a separate effort. It is also recognised that the list of requirements described in the documentation represents the entire path taken by a molecule from synthesis to pre-clinical development. Many compounds, which fail to fulfil one or more of the criteria necessary for the development of a lead compound, may only travel a part of this route but still be a valuable research tool so all data generated on this compound should appear in any publication on its activity. Similarly, data on a successful agent may be published in multiple papers – in this case the minimum reporting guidelines should be followed over the series of articles if this is more appropriate. Finally, it should also be remembered that, although this document targets bioactive agents, data on inactive, closely related analogues and orthologues is often of equal value, providing negative controls and information for those trying to build activity into a molecular scaffold. Such data should also be fully reported using these guidelines.

The production and development of MIABE documents

This parent document exists to make explicit the scope, purpose and manner of use of the MIABE guidelines that accompany it and, as such, should be stable, as the principles described should therefore remain valid for the forseeable future. Each domain within the drug discovery process will then be discussed in a modular guideline document, the content of which may be more subject to change over time as both experimental techniques and the formats for recording data develop and increase. Initial versions of all modules will be produced through the Pharmaceutical Industry Forum hosted at the European Bioinformatics Institute and made available for extensive community input prior to publication through pre-publication on Human Proteomics Organisation, Proteomics Standards Initiative (HUPO-PSI) website and documentation process [9] and also accessible via the MIBBI portal (www.mibbi.org/index.php/Projects/MIABE, see below). New documents and proposed updates to existing guidelines will be advertised on appropriate websites, discussion groups and input requested from domain experts. Each guideline document will then enter the formal peer-review process of the journal in which it is to be published. Updates to the documents will be discussed, agreed and published when necessary, but it intended to make these as infrequent as advances in technology will allow, thus providing long periods of stability to encourage adoption and implementation. Potential authors may move between the guidelines (and other related community efforts), ensuring that their data conform to those aspects of the checklists which are relevant to their publication.

Data Formats

It is still common practise in many publication for a compound to be described by no more than the structural representation of a pharmacophore, with the modified regions then exemplified and the resulting molecules given an identifier, often either specific to the originating source or to that particular publication. Specific rules and conventions for compound nomenclature have been developed and allow the specific reconstruction of a compound’s structure, if strictly adhered to. However, as the needs of computational chemistry become more central to the daily workings of molecular science, the need for more systematic, and computationally translatable means of describing compounds were required, and lead to the development of SMILES strings [10] and the International Chemical Identifier (InChI) [11]. Whilst it is recognised that it would be impractical to publish several hundred of these identifiers in a large paper describing the structure-activity relationships of several hundred compounds, it is becoming increasingly important that such data be made available for published small molecules, such that subsequent data capture no longer relies on individuals piecing together molecules descriptions from a disparate figure and table(s) within a publication to redraw the molecule in electronic format.

It has not previously been the practice in the field of bioactive molecules to consider data exchange formats other than the published paper however as public domain data repositories become established, the requirement to exchange data between them, or for users to download non-redundant datasets in a common format will become of increasing importance. Such a common format does already exist and has been publicly available and in wide usage for some years in the molecular interaction field. The HUPO PSI-MI XML2.5 interchange format is capable of capturing extensive details about many interactions types, including bioactive entities and their target molecules, including the biological role of each molecule within that interaction, detailed description of interacting domains, and the kinetic parameters of the interaction [12]. The ability to describe the structure of individual molecules and to carry meta-data on that molecule is also inherent to the format. The format is supported by data management and analysis tools and has been adopted by major interaction data providers, toll developers and used in visualisation and analytical software. Additionally, a simpler, tab-delimited format MITAB2.5 has been developed for the benefit of users who require only minimal information in an easy to access configuration. It is suggested that supporting the continued development of this format, rather than an attempt to “reinvent the wheel” will be the most practical step forward. This will allow produces of drug-target information to merge their data with the information existing the molecular interaction databases and use PSI-MI XML2.5 compliant resources, such as Cytoscape (www.cytoscape.org), to visualise small molecule data in conjunction with cellular interactomes or pathways.