Feed aggregator

chem-bla-ics : Bioclipse 2.6.2 with recent hacks #1: Wikidata & Linked Data Fragments

Planet Bioclipse - Sat, 2015-04-18 17:30
Bioclipse dialog to upload chemical
structures to an OpenTox repository.Us chem- and bioinformaticians have it easy when it comes to Open Science. Sure, writing documentation, doing unit testing, etc, takes a lot of time, but testing some new idea is done easily. Yes, people got used to that, so trying to explain that doing it properly actually takes long (documentation, unit testing) can be rather hard.

Important for this is a platform that allows you to easy experiment. For many biologists this environment is R or Python. To me, with most of the libraries important to me written in Java, this is Groovy (e.g. see my Groovy Cheminformatics book) and Bioclipse (doi:10.1186/1471-2105-8-59). Sometimes these hacks grow to be full papers, like with what started with OpenTox support (doi:10.1186/1756-0500-4-487) which even paved (for me) the way to the eNanoMapper project!

But often these hacks are just for me personal, or at least initially. However, I have no excuse to not make this available to a wider audience too. Of course, the source code is easy, and I normally have even the smallest Bioclipse hack available somewhere on GitHub (look for bioclipse.* repositories). But it is getting even better, now that Arvid Berg (Bioclipse team) gave me the pointers to ensure you can install those hacks, taking advantage from Uppsala's build system.

So, from now on, I will blog how to install Bioclipse hacks I deem useful for a wider audience, starting with this post on my Wikidata/Linked Data Fragments hack I used to get more CAS registry number mappings to other identifiers.

Install Bioclipse 2.6.2
The first thing you need is Bioclipse 2.6.2. That's the beta release of Bioclipse, and required for my hacks. From this link you can download binary nightly builds for GNU/Linux, MS-Windows, and OS/X. For the first two 32 and 64 bit build are available. You may need to install Java and version 1.7 should do fine. Unpack the archive, and then start the Bioclipse executable. For example, on GNU/Linux:

  $ tar zxvf Bioclipse.2.6.2-beta.linux.gtk.x86.tar.gz
  $ cd Bioclipse.2.6.2-beta/
  $ ./bioclipse

Install the Linked Data Fragments manager
The default update site already has a lot of goodies you can play with. Just go to Install → New Feature.... That will give you a nice dialog like this one (which allows you to install the aforementioned Bioclipse-OpenTox feature):



But that update site doesn't normally have my quick hacks. This is where Arvid's pointers come in, which I hope to carefully reproduce here so that my readers can install other Bioclipse extensions too.

Step 1: enable the 'Advanced Mode'
The first step is to enable the 'Advanced Mode'; that is, unless you are advanced, forget about this. Fortunately, the fact that you haven't given up on reading my blog yet is a good indicated you are advanced. Go to the Window → Preferences menu and enable the 'Advanced Mode' in the dialog, as shown here:


When done, click Apply and close the dialog with OK.

Step 2: add an update site from the Uppsala build system
The first step enables you to add arbitrary new update sites, like update sites available from the Uppsala build system, by adding a new menu option. To add new update sites, use this new menu option and select Install → Software from update site...:


By clicking the Add button, you go this dialog where you should enter the update site information:


This dialog will become a recurrent thing in this series, though the content may change from time to time. The information you need to enter is (the name is not too important and can be something else that makes sense to you):

  1. Name: Bioclipse RDF Update Site
  2. Location: http://pele.farmbio.uu.se/jenkins/job/Bioclipse.rdf/lastSuccessfulBuild/artifact/site.p2/

After clicking OK in the above dialog, you will return to the Available Software dialog (shown earlier).

Step 3: installing the Linked Data Fragments Feature
The  Available Software dialog will now show a list of features available from the just added update site:


You can see the Linked Data Fragments Feature is now listed which you can select with the checkbox in front of the name (as shown above). The Next button will walk you through a few more pages in this dialog, providing information about dependencies and a page that requires you to accept the Open Source licenses involved. And at the end of these steps, it may require you to reboot Bioclipse.

Step 4: opening the JavaScript Console and verify the new extension is installed
Because the Linked Data Fragments Feature extends Bioclipse with a new, so-called manager (see doi:10.1186/1471-2105-10-397), we need to use the JavaScript Console (or Groovy Console, or Python Console, if you prefer those languages). Make sure the JavaScript Console is open, or do this via the menu Windows → Show View → JavaScript Console and type in the console view man ldf which should result in something like this:


You can also type man ldf.createStore to get a brief description of the method I used to get a Linked Data Fragments wrapper for Wikidata in my previous post, which is what you should reread next.

Have fun and looking forward to hear how you use Linked Data Fragments with Bioclipse!

chem-bla-ics : Getting CAS registry numbers out of WikiData

Planet Bioclipse - Fri, 2015-04-10 20:27
doi 10.15200/winn.142867.72538

I have promised my Twitter followers the SPARQL query you have all been waiting for. Sadly, you had to wait for it for more than two months. I'm sorry about that. But, here it is:
    PREFIX wd: <http://www.wikidata.org/entity/>

    SELECT ?compound ?id WHERE {
      ?compound wd:P231s [ wd:P231v ?id ] .
    }
What this query does is ask for all things (let's call whatever is behind the identifier is a "compound"; of course, it can be mixtures, ill-defined chemicals, nanomaterials, etc) that have a CAS registry identifier. This query results in a nice table of Wikidata identifiers (e.g. Q47512 is acetic acid) and matching CAS numbers, 16298 of them.

Because Wikidata is not specific to the English Wikipedia, CAS numbers from other origin will show up too. For example, the CAS number for N-benzylacrylamide (Q10334928) is provided by the Portuguese Wikipedia:


I used Peter Ertl's cheminfo.org (doi:10.1186/s13321-015-0061-y) to confirm this compound indeed does not have an English page, which is somewhat surprising.

The SPARQL query uses a predicate specifically for the CAS registry number (P231). Other identifiers have similar predicates, like for PubChem compound (P662) and Chemspider (P661). That means, Wikidata can become a community crowdsource of identifier mappings, which is one of the things Daniel Mietchen, me, and a few others proposed in this H2020 grant application (doi:10.5281/zenodo.13906). The SPARQL query is run by the Linked Data Fragments platform, which you should really check out too, using the Bioclipse manager I wrote around that.

The full Bioclipse script looks like:
    wikidataldf = ldf.createStore(
      "http://data.wikidataldf.com/wikidata"
    )

    // P231 CAS
    identifier = "P231"
    type = "cas"

    sparql = """
    PREFIX wd:

    SELECT ?compound ?id WHERE {
      ?compound wd:${identifier}s [ wd:${identifier}v ?id ] .
    }
    """
    mappings = rdf.sparql(wikidataldf, sparql)

    // recreate an empty output file
    outFilename = "/Wikidata/${type}2wikidata.csv"
    if (ui.fileExists(outFilename)) {
      ui.remove(outFilename)
      ui.newFile(outFilename)
    }

    // safe to a file
    for (i=1; i<=mappings.rowCount; i++) {
      wdID = mappings.get(i, "compound").substring(3)
      ui.append(
        outFilename,
        wdID + "," + mappings.get(i, "id") + "\n"
      )
    }
BTW, of course, all this depends on work by many others including the core RDF generation with the Wikidata Toolkit. See also the paper by Erxleben et al. (PDF).


Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D., 2014. Introducing wikidata to the linked data web. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (Eds.), The Semantic Web – ISWC 2014. Vol. 8796 of Lecture Notes in Computer Science. Springer International Publishing, pp. 50-65. URL http://dx.doi.org/10.1007/978-3-319-11964-9_4

Mietchen, D., Others, M., Anonymous, Hagedorn, G., Jan. 2015. Enabling open science: Wikidata for research. URL http://dx.doi.org/10.5281/zenodo.13906

Ertl, P., Patiny, L., Sander, T., Rufener, C., Zasso, M., Mar. 2015. Wikipedia chemical structure explorer: substructure and similarity searching of molecules from Wikipedia. Journal of Cheminformatics 7 (1), 10+. URL http://dx.doi.org/10.1186/s13321-015-0061-y

chem-bla-ics : Open Notebook Science ONSSP #1: http://onsnetwork.org/

Planet Bioclipse - Wed, 2015-03-11 15:55
As promised, I slowly set out to explore ONSSPs (Open Notebook Science Service Providers). I do not have a full overview of solutions yet but found LabTrove and Open Notebook Science Network. The latter is a more clear ONSSP while the first seems to be the software.

So, my first experiment is with Open Notebook Science Network (ONSN). The platform uses WordPress, a proven technology. I am not a huge fan of the set up which has a lot of features making it sometimes hard to find what you need. Indeed, my first write up ended up as a Page rather than a Post. On the upside, there is a huge community around it, with experts in every city (literally!). But my ONS is now online and you can monitor my Open research with this RSS feed.

One of the downsides is that the editor is not oriented at structured data, though there is a feature for Forms which I may need to explore later. My first experiment was a quick, small hack: upgrade Bioclipse with OPSIN 1.6. As discussed in my #jcbms talk, I think it may be good for cheminformatics if we really start writing up step-by-step descriptions of common tasks.

My first observations are that it is an easy platform to work with. Embedding images is easy, and there should be option for chemistry extensions. For example, there is a Jmol plugin for WordPress, there are plugins for Semantic Web support (no clue which one I would recommend), an extensions for bibliographies are available too, if not mistaken. And, we also already see my ORCID prominently listed, and I am not sure if I did this, or whether this the ONSN people added this as a default feature.

Even better is the GitHub support @ONScience made me aware of, by @benbalter. The instructions were not crystal clear to me (see issues #25 and #26), some suggested fixes (pull request #27), it started working, and I now have a backup of my ONS at GitHub!

So, it looks like I am going to play with this ONSSP a lot more.

chem-bla-ics : First steps in Open Notebook Science

Planet Bioclipse - Wed, 2015-03-11 15:53
Scheme 2 from this Beilstein Journal of Organic
Chemistry paper
by Frank Hahn et al.I blogged a few weeks back I blogged about my first Open Notebook Science entry. The post suggest I will look at a few ONS service providers, but, honestly, Open Notebook Science Network serves my needs well.

What I have in mind, and will soon advocate, is that the total synthesis approach from organic chemistry fits chem- and bioinformatics research. It may not be perfect, and perhaps somewhat artificial (no pun intended), but I like the idea.

Compound to Compound
Basically, a lab notebook entry should be a step of something larger. You don't write Bioclipse from scratch. You don't do a metabolomics pathway enrichment analysis in one step, either. It's steps, each one taking you from one state to another. Ah, another nice analogy (see automata theory)! In terms of organic chemistry, from one compound to another. The importance here is that the analogy shows that there is no step you should not report. The same applies to cheminformatics: you cannot report a QSAR model without explaining how your cleaned up that SDF file you got from paper X (which still commonly is practised).

Methods Sections
Organic chemistry literature has well-defined templates on how to report the method for a reaction, including minimal reporting standards for the experimental results. For example, you must report chemical shifts, an elemental composition. In cheminformatics we do not have such templates, but there is no reason not too. Another feature that must be reported is the yield.

Reaction yield
The analogy with organic chemistry continues: each step has a yield. We must report this. I am not sure how, and this is one of the things I am exploring and will be part of my argument. In fact, the point of keeping track of variance introduced is something I have been advocating for longer. I think it really matters. We, as a research field, now publish a lot of cheminformatics and chemometrics work, without taking into account the yield of methods, though, for obvious reasons, very much more in chemometrics than in cheminformatics. I won't go into that now, but there is indeed a good part of benchmark work, but the point is, any cheminformatics "reaction" step should be benchmarked.

Total synthesis
The final aspect is, is that by taking this analogy, there is a clear protocol how cheminformatics, or bioinformatics, work must be reported: as a sequence of detailed small steps. It also means that intermediate "products" can be continued with in multiple ways: you get a directed graph of methods you applied and results you got.

You get something like this:

Created with Graphviz Workspace.
The EWx codes refer to entries in my lab notebook:
  1. EW4: Finding nodes in Anopheles gambiae pathways with IUPAC names
  2. EW5: Finding nodes in Homo sapiens pathways with IUPAC names
  3. EW6: Finding nodes in Rattus norvegicus pathways with IUPAC names
  4. EW7: converting metabolite Labels into DataNodes in WikiPathways GPML

Open Notebook Science
Of course, the above applies also if you do not do Open Notebook Science (ONS). In fact, the above outline is not really different from how I did my research before. However, I see value in using the ONS approach here. By having it Open, it

  1. requires me to be as detailed as possible
  2. allows others to repeat it
Combine this with the advantage of the total synthesis analogy:
  1. "reactions" can be performed in reasonable time
  2. easy branching of the synthesis
  3. clear methodology that can be repeated for other "compounds
  4. step towards minimal reporting standards for cheminformatics methods
  5. clear reporting structure that is compatible with journal requirements
OK, that is more or less the paper I want to write up and submit to the Jean-Claude Bradley Memorial Issue in the Journal of Cheminformatics and Chemistry Central. It is an idea, something that helps me, and I hope more people find useful bits in this approach.