Big data for big ecology

As buzz words go, ‘big data’ is right up there just now. It seems that every question you care to think of, in every field from public policy to evolutionary biology, can be hit with the big data hammer. Add an ‘omics’ or two too, and you’re laughing. So I’m slightly ashamed that we decided to call our workshop at the British Ecological Society’s Annual Meeting ‘Big Data for Big Ecology’. But when I say ‘we’ I mean the BES Macroecology Special Interest Group, and Macroecology is – as its name suggests – ‘big ecology’, so it seemed natural to combine this with the buzz word du jour.

And as it turned out, I think we were vindicated. We held the first of two 1 hour workshops in a room that could comfortably seat 50. Over 100 squeezed in, and we had to turn some people away. So clearly the interest is there, perhaps at least partly because ecological ‘big data’ differ from the data collected in other fields, and we’re still feeling for how best to deal with issues of storage, access, and analysis. This contrasts with some other fields. For instance, sequence data take a pretty standard form, and it’s relatively straightforward to design a system to collate all sequence data – Genbank is testament to this. Ecological data are much more heterogeneous – people measure different things in different systems, there’s no universally agreed common unit of measurement, people work at different spatial scales, in different habitats and environments, and so on. There is also the matter of what we mean by ‘big’. Again, there’s a contrast here with genomics, where a million sequences is now almost a trivially small number. I think in ecology we’re much more likely to be dealing with records in the thousands or hundreds of thousands, so again the computational challenges are different: doing something clever with a large quantity of complex data, rather than with an absolutely huge amount of more simple (or at least, relatively standard) data.

The aim of this first workshop was to introduce a couple of major ecological datasets, then to discuss the issues associated with sharing data. Importantly, by involving figshare, we were able to present some solutions rather than simply rehashing the same old (perceived) problems. I posted a storify of this first hour here, but briefly we heard from Paula Lightfoot, data access officer for the UK’s National Biodiversity Network Trust. The NBN holds >80 million distribution records from around 150 data providers, consisting of almost 800 individual datasets. Data cover a very wide range of taxa, although birds, lepidoptera and flowering plants make up ¾ of the records. The NBN gateway has always been a fantastic public-facing portal to biodiversity data (go and have a play if you want to confirm this), but these data are underused in research. So for me it was particularly interesting to learn about recent improvements to the NBN’s data delivery system to try to address concerns such as those raised by a BES working group involving several of the Macroecology group (including myself and group chair Nick Isaac). Some of the data on NBN is sensitive or otherwise restricted access, but now you can trigger a single request which goes to all relevant data owners. Likewise, you can download information from multiple datasets as a single text file – which, as ecological data analysts, is often all that we want.

Charly Griffiths from the Marine Biological Association data team then gave an overview of the data holdings in Plymouth, which was really valuable I think to raise awareness of some of these phenomenal datasets among the overwhelmingly terrestrial community of the BES. Things like the Continuous Plankton Recorder data held by SAHFOS, which at >80y is among the longest-running and most spatially extensive ecological time series in existence. Or the Western Channel Observatory data, which is one of the very few long-term datasets to collect information across an entire community (“from genes to fish, from seconds to centuries”).

Then we changed tack, from talking about where we might find data, to what we should do with our own. A quick show of hands revealed that almost everyone in the room had used other people’s data in their work; rather fewer had shared their own data. Mark Hahnel from figshare gave a quick demo to show how easy it can be to share all kinds of outputs – from static figures to code to very large datasets – on the figshare platform, where it instantly gains a doi, and thus becomes citable.

Given how easy this process is, why don’t more people share their data? Our discussion identified two main objections. First, people remain highly protective of their data, and suspicious that there are armies of people just waiting for it to become public so that they can do the same analyses (only faster) that the data owner had planned. I think this is understandable – ecological data are often collected in pretty extreme environments, involving a huge amount of hard work, and it is natural to want to get the full benefit of this toil before others are able to profit.

There are two counters to this. First, the idealistic one: in most cases you were paid to collect your data, very often with public money; the data are not yours to hoard; you were not funded to advance your career, but to advance science. Second, more pragmatically: it’s unlikely that many people are especially interested in what you do. Only a small fraction of those who are will have both the time to start to work on your data, and the expertise to do anything useful. Fewer still will be inclined to screw you over, especially (and this is important) if you have taken the step of laying out your stall in public (on figshare or wherever). And academic karma will sort them out soon anyway…

The second issue, that of data ownership, is harder to address, regardless of any mandate to make data available. This is a particular problem for someone like me, who uses other people’s data all the time. The value that I add lies in combining existing datasets and analysing them in novel ways. Often I have had to secure various permissions to use the data in the first place, and the extent to which what I have produced is an original data product is not clear. So while my inclination is to share everything, I do have to be very careful that I’m not sharing anything where I have previously signed an agreement to say that I won’t. Even in these cases though it is still possible to share extensive metadata and the code used to access and analyse the data.

Scott Chamberlain, who delivered the second workshop, touched on some of these kinds of issues, as well as potential solutions. Scott and the rest of the ROpenSci team use APIs to access large datasets, and it is perfectly possible for a data provider to restrict access to their data via this API route. In which case, one can publish a load of R code documenting how data were accessed, manipulated and analysed, which could be replicated by anyone having the same data access privileges that you do (often gained through personal contact with the data provider). This could be a really neat solution to accessing multiply-owned datasets. Scott’s presentation is online here, and if you have any interest in accessing data using R, it is a must read, and highly endorsed by all of the 100 or so of us who were at the workshop (see some of the comments in my second storify).

So where do we go from here? That’s a genuine question: we clearly hit a nerve and got a huge amount of interest, so we want to take it forward. But how? Should we be writing a set of standards for ecological data? A catalogue of existing datasets? A set of tutorials? I appreciate that we are far from the only people interested in this, and don’t want to replicate the efforts of others – so maybe a list of these other efforts would be a good place to start? Any thoughts gratefully received, either in the comments here or via Twitter (@besmacroecol, @tomjwebb, #besbigdata) or our facebook group.

How to Ace an Essay

Preamble I’ve seen a few posts recently on ‘how to write’ for scientists, from the technical (this on how to write a paper) to the more general (this on how to write clearly). So here’s my contribution: how to write an essay. Now I’m all to well aware that, as Brian McGill, author of the ‘how to write clearly’ piece states, this kind of enterprise inevitably teeters on the edge of hubris (both pieces referred to above fall the on the right side, I hasten to add!), so let me make clear my motivation. This is essentially an extended piece of feedback to my undergraduate tutees, following the first essays that they submitted to me this month. The aim of this exercise was to make sure they all achieve a mark for a tutorial essay that is commensurate with their abilities by the end the academic year, and that they also do as well as they can in exams. Much of what appears below is obvious or well-known; some of it is perhaps rather specific to our assessment procedures. But it directly addresses issues that our students have found challenging on occasions, so on the off chance that it is more widely useful, I thought I’d post it in its entirity here…

Before you start

Read any instructions regarding format, filenames etc. If working in Word, use File>Save as to give your essay an informative filename (including your name) rather than just accepting the default. Don’t call it ‘essay.docx’. Please!

Especially in exams - make sure that everything in your essay is geared towards answering the question.

Read around the subject. Use reasonably general search terms in google scholar or web of knowledge, and search back through the reference sections of papers you read, and forward through papers that have cited them, for other relevant work. The primary literature (peer-reviewed papers) should be the first and major place you look for information. Looking to e.g. Wikipedia first then following up references from there is not good practice.

Make sure that any notes you take clearly state their source; in addition, if you copy and paste directly from a source, it’s sensible to paste into quotation marks so you’ll remember that you can’t copy directly from your notes into your final essay. So for instance, you might end up in your notes with something like:

Webb et al. PLOS1 2010 show that deep pelagic ocean is underexplored compared to coastal seas: “The deep pelagic ocean is the largest habitat by volume on Earth, yet it remains biodiversity's big wet secret, as it is hugely under-represented in global databases of marine biological records.”

You’ll know then that the first statement can be used in your essay (referencing the paper, of course – see below), but the exact phrasing in the quotation marks should be avoided (unless you want to include it as a direct quote – which can be done from time to time, but only really if the quote is particularly striking).

Because you’ll probably forget to do that from time to time, never copy and paste from your notes into your final essay. This is a good way to plagiarise by accident – you read something in your notes, think it sounds great, forget that you copied it wholesale from somewhere else. (Note: plagiarism-detection software is dumb, but thorough; careless copy-pastes will get picked up.)

For better or worse (OK, for worse) we use MS Word for our essays. Word is a good word processor, but it’s pretty awful at many other things. So keep your formatting simple (single column, don’t try to wrap text prettily around figures, etc.) – it is far  more important that your essay is easy to read (single column, decent sized font) and to comment on (consider 1.5 or double line spacing), than that it bears any kind of physical resemblance to a published paper or magazine article.

Next, and most importantly: write a plan. This can take several forms. A nice idea is to try to write a 25 word summary of what you plan to cover in the essay – this forces you to focus on what you think is important to answer the question (and is borrowed from a nice piece on how to write a scientific paper from Conservation Bytes).

You should also write a list of subheadings. Make these quite specific: ‘Introduction’, ‘Body’, ‘Conclusion’ are useless; rather, list the topics or points of view that you think it is important to cover in your essay. You may have a dozen or more for a standard 1200-2000 word essay. This approach has a couple of advantages. First, you can shuffle the subheadings around until the order makes sense to you – each heading should logically lead on from the previous one – and by keeping the plan to hand when you write your essay, you can avoid drifting off on tangents. In addition, by breaking down the essay in this way you are left with a series of short and quite specific paragraphs to write, rather than an entire essay (which can seem daunting). There’s nothing to stop you retaining some subheadings in your final essay, although probably fewer than in your plan.

Finally, as in everything you write, you need to think carefully about your audience. This is a general tip for clear writing in any context (see this post by Brian McGill), but takes on a rather specific meaning when you’re writing an assessed essay: think about whether the person marking it will be marking 150 others (common for exam essays) or 5 others (e.g. in a tutorial situation). In the former case, content and structure become more important than style: you really want to hammer home at the beginning exactly how you plan to answer the question, and at the end to reiterate exactly how you have just answered it. A tutorial situation, where both you and the marker have a little more time, gives you more opportunity to indulge your literary pretensions (although style will not win you marks lost on poor content).

Now you’ve worked out what you want to write, you just need to…

Write the damn thing

Obviously, you should start with an Introduction. This should cover three key points: What, Why, and How.

What is a brief introduction to the topic, and Why says why this is important, what is controversial and/or unknown (and hence why an essay is required). Finally, How sets out exactly how you intend to address the topic in your essay. Don’t be afraid of finishing your introduction with a sentence along the lines of “In this essay, I will show that…”

A note on style: in my view, personal pronouns and opinions are not only fine, but pretty much obligatory for a good essay (although some colleagues will disagree). So yes, you can say “I will show…”, “I think…”, and so on. Far better than “…it will be argued that…” and other passive voice horrors. But keep the tone formal: “do not” rather than “don’t”.

The structure of the Body of your essay will be drawn directly from your essay plan, and you can retain some subheadings if you like. Use diagrams, graphs or other images if appropriate – but only if appropriate. A well-designed or well-chosen figure can really enhance an essay, but a gratuitous one is unlikely to garner you any extra marks.

Finally, you want to draw some Conclusions. Broadly speaking, what you want to do here is to re-state the What and Why from the Introduction, and to sum up How you’ve addressed the issue. If the essay title is a question (and especially in an exam), hammer home here how you have answered it. For example, if the essay asks you to weigh up the evidence between two theories and determine which is more likely to be true, make sure you get into your conclusion something like “In sum, the weight of evidence supports x and I therefore conclude that x is more likely than y…” This is no place for subtlety!

Referencing

OK, this is the tough one. I have been writing scientific prose for so long that referencing is second nature to me, and I can’t for the life of me work out why it causes so many problems. But following discussion in the tutorial here are a few guidelines.

First: reference stuff you’ve read. Don’t reference stuff you haven’t. I know from personal experience that work gets cited wrongly (e.g. “Webb (2012) states…” when in fact I had stated the opposite!), so don’t rely on somebody else’s interpretation of what Webb (2012) actually said. If you really can’t track down the original source, then you can put e.g. “Darwin (1859, cited in Webb 2012)”, but try not to make a habit of this.

Second: every factual statement you make (that is not something that you have derived for yourself) needs to be backed up with a reference to the literature. If that means that in one paragraph, you reference the same work 10 times, then so be it (although see the point about synthesis below). Sometimes this can be tough – if, for example, you want to quote a very well-known fact, such as the diameter of the Earth, it’s may to be difficult to find a paper that gives this information. Text books can be very useful here in providing an authoritative source of basic information (whilst recognising that every practising scientist will simply have looked at Wikipedia, you shouldn’t!)

In terms of how you actually cite stuff, your best guide is the papers you read, but here are some general pointers: a single author study is cited as Webb (2012), a two author study as Webb & Freckleton (2007), and a study with more than two authors as Webb et al. (2010) (forgive my lack of imagination, it’s late…) Note that ‘et al.’ is short for ‘et alia’, hence the ‘.’ after ‘al’. Notice too that author surname only is given in the text – no initials, no first name.

It’s also useful to think about where the brackets come. So, you can write “As has previously been shown (Webb 2012) this is nonsense…” or “As Webb (2012) showed, this is nonsense…” – in reading phrases with a reference, you mentally skip everything in brackets.

Another point of style: it’s punchier to refer to authors by name, so “Webb et al. (2010) demonstrated that…” rather than “researchers have shown… (Webb et al. 2010)” or worse, “scientists believe… (Webb et al. 2010)”. In doing this, the grammar of the sentence should follow the number of authors to whom you’re referring – so “Webb (2012) states that… he also showed…” but “Webb et al. (2010) state that… they also showed…”

Now: how can you pick up extra points? One of the things we look for is synthesis. One way to demonstrate this is to summarise the findings of several studies in a single sentence, so “…multiple lines of evidence suggest that this effect is weak at best (Smith 2000, Jones et al. 2005, Smith & Jones 2007, Webb 2012)”. Once you get into this habit, then the problem noted above – having to cite a single reference repeatedly over the course of a paragraph – ought to diminish, as each of your paragraphs will be a synthesis of insights you’ve drawn from a number of papers.

Another thing we’re very keen on is critical analysis, which again you can demonstrate through citation, e.g. “Smith (2000) claimed that… However, I agree with Webb et al. (2010) that this was highly unlikely because…” You have shown that there are two points of view, and that, after due consideration and critical analysis, you have come to your own conclusion (the correct one, in this case, naturally).

Finally, everything you reference should appear in a single reference list at the end of the essay (ordered alphabetically by the first author’s surname); nothing that you have not referenced in your essay should be in this list. (Don’t look for a list at the end of this piece, I’m just making stuff up…) Formatting this list is easy: just pick a journal and copy their style (although best to choose a journal that uses the Name (Year) format for referencing, rather than numbered references).

There are numerous minor permutations of formatting, what’s important is that you’re consistent, and that you get the key information in. So, let’s dissect one of my papers:

Webb, T.J. (2012) Marine and terrestrial ecology: unifying concepts, revealing differences. Trends in Ecology & Evolution 27: 535-541

Here we have the author’s name and initials – if there are multiple authors, include all names here. We have the year. Then comes the title of the paper, the name of the journal, the volume of the journal, and the page numbers. That’s all that is necessary, and all of this information should be easily accessible, usually from the front page of any paper you read. For some online articles, you may find only what’s called a ‘doi’ (digital object identifier), which can replace the volume and page no. information (but not the rest). You can cite books in a similar way (authors, year, title, publisher), and chapters from edited books too (authors, year, title, book editors and book title, publisher) – again, have a look at the reference sections of the papers you read for tips. Bold, italics, etc. are optional – again, consistency is the key. Likewise your decision regarding whether or not to abbreviate journal titles (e.g. I could have used TREE for the reference above, but would need to abbreviate all others in my list then too).

And that’s it. Begin by following this advice, and hopefully as your confidence (and marks) increase you will start to break some of the rules and forge your own style. If you want an independent view, you can download a guide written by University of Exeter students from here.

Oh – one final thing. Spell check. And proof read. Read each other’s essays if that’s helpful. Silly, avoidable mistakes will tip the balance downwards if you’re on the borderline between grades.

B+: When ‘good enough’ is good enough

About a year ago, I built a cupboard to fit into an alcove to one side of the fire in our living room. By any professional standards, it’s a decidedly average piece of joinery. But as a piece of DIY, it is a source of immense (and no doubt disproportionate) pride. It looks how I wanted it to, the joints are neat, edges square, and it functions exactly as intended. Fast forward to this weekend just past, and it seemed about time to finally finish it. And while applying a second coat of Ronseal Quick and Easy Brushing Wax (which performed as expected – if only there were a neat phrase to encapsulate that…), I started pondering on perfectionism.

More precisely, it was as I concluded that the back edge of a shelf, invisible from all accessible angles, could probably forego its second coat – as I decided, in other words, that ‘good enough’ was, in this case, good enough – that I realised this is perhaps one of the more important lessons I’ve learnt in my working life.

Let’s face it, if you’ve got anywhere at all in science, you’re probably a bit of a perfectionist in at least some areas of your work. You enjoy the challenge posed by difficult problems, and are prepared to work hard until that challenge is met. This is, of course, admirable; it’s also probably necessary in order to get research of any worth done. But as your responsibilities increase beyond research and into management, administration and teaching, and so the number of tasks requiring your attention proliferates (while days remain stubbornly stuck at 24h), then focusing this beam of perfectionism on every single item becomes untenable.

That’s when it’s helpful to realise that sometimes (whisper it) it really doesn’t matter if this isn’t the finest report ever written.

This is simply another way of looking at prioritising. Now, faced with a list of things to do, all of which absolutely have to be done TODAY, prioritising can seem impossible. But although you can’t choose what to do and what to neglect, you can decide which tasks require an A+ effort, and which can be safely relegated to B or C. This grading can occur within different realms of responsibility as well as between them. For a given piece of work, you should ask yourself, what am I doing this for? Who am I doing this for, and do I need to impress them? Most important, what do they expect out of it, what will they do with this output? And then you tailor your effort accordingly. For instance, if a final report on a grant will get graded as either ‘satisfactory’ or ‘not satisfactory’, is there really any point in aiming for ‘excellent’?

This is not meant as an excuse for slapdash or unacceptable work. Rather, the aim is to make sure that your most important tasks – the career-defining paper, the big seminar – get the time they deserve.

Of course, some tasks always warrant an A+ performance. Blogging, for example, is one…