Thursday, May 15, 2008

Dealing with abundance – getting more out of the science literature than you thought possible

Open access is adding to the abundance of scientific information available to us. It is to be expected that this abundance will be growing fast, with the growth of open access. This is good, because only comprehensive and unfettered access to the science literature will make it possible for us to be truly abreast of the scientific progress that's being made.

On the other hand, however, it will present us with even more challenges than we already face in terms of being able to deal with all that information. In certain disciplines reading all the relevant papers to our research topic means digesting thousands of papers per year – enough to fill our entire working time. Without assistance from the processing capabilities and speed of computers, we cannot hope to keep up with emerging trends in our chosen fields.

Few scientists can properly cope with mushrooming information and were they to read all the articles relevant to them, they would find that they almost always contain a very large amount of information already known to them. That redundant information is usually provided for the sole purpose of context and readability. The amount of actual new information is often surprisingly small and could have been conveyed in one or two sentences if the context were clear. Yet the essence of the scientific discourse is captured in those few sentences. The surrounding text of articles is, if you wish, the packaging in which the essence is transported, and analogous to the mass of fluffy stuff that's surrounding breakable item that's being shipped: emballage.

At Knewco, the company that I now work for, we aim to provide an environment for concentrating this scientific discourse – 'distilling' it from the abundance of sources, if you wish – and make it more productive by making it computer-processable. Very few scientists can read and digest all the articles and database entries that they would need to read and digest in order to synthesize the essence of the knowledge they need. So what we do is to enable and foster collaborative intelligence between machine processing power and human brainpower. Knewco 'distills' information to the essence of knowledge content from millions of documents, enriching it in the process with linked concepts and context.

This is not the same as making it possible to locate the one right document out of the abundance available. It is identifying 'atoms' of knowledge about a given concept from the literature and combining these atoms into 'molecules' of knowledge (we call those "knowlets" – a knowlet connects facts). Just as a graph can give you in one glance the essence of an enormous array of numbers in one glance, the knowlet gives you the essence of an enormous amount of scientific literature. It's like reading out of a picture instead of text. And as "a picture is worth more than a thousand words", a knowlet could be said to be worth more than the text of a thousand articles. Knowledge redesigned, as it were.

Perhaps more importantly, since a knowlet is a computer artifact, it can be used to identify related information, predict trends and intersections in data (see it as a kind of topology of knowledge), be used in combination with other knowlets of more complex concepts, and be updated in real time to keep information current up to the minute.

For technology of this kind to be optimally effective for scientific knowledge discovery, access to the literature is not sufficient by itself. It goes without saying that the source documents must be computer-readable to be optimally usable. Publishers as well as repositories may wish to take this to heart if they are serious about helping to speed up the pace of scientific progress.

Jan Velterop

No comments:

Post a Comment