Pushing the Frontier on Text Mining: A conversation with Heather Piwowar

Updated May 17, 2012

SPARC 

Pushing the Frontier of Access for Text Mining: A conversation with Heather Piwowar on one researcher’s attempt to break new ground 

As the academic community struggles to ensure that research articles become more accessible than ever before, a new front in the battle is brewing – a battle to ensure that the information in those articles can also be fully used in the digital research environment.

Heather PiwowarIn particular, researchers increasingly expect that once they have legitimate access to an article (either through an Open Access model or a subscription model), they should be able to process – or “text mine” – the contents of that article to help further their research. However, this is far from easy to accomplish, as University of British Columbia (UBC) postdoctoral student Heather Piwowar recently discovered when she attempted to do just that. It all began with a simple comment on Twitter. In the face of boycotts and bad publicity, are publishers realizing they must loosen their tight grip on usage restrictions as well?

Earlier this year, Heather Piwowar, a UBC postdoc in the Department of Zoology, was engaged in an online conversation with colleagues about available Elsevier content in PubMed Central’s Open Access Subset. Frustrated by the small number of Elsevier articles available for text mining, she tweeted that Elsevier should make its back issues available for text mining, for the progress of science. Much to her surprise Alicia Wise, Director of Universal Access for Elsevier and a participant in the dialogue, tweeted back and started a dialogue that led to an offer to work with Piwowar to facilitate text-mining access for researchers at her university. (For details of the conversation, see Piwowar’s blog posts: http://researchremix.wordpress.com/2012/04/17/elsevier-agrees/)

Through numerous Skype calls and email, Elsevier executives, UBC librarians, and Piwowar herself have been negotiating an agreement to provide UBC researchers with broad text-mining rights to Elsevier subscription content. While this individual negotiation is not the ideal solution in the eyes of many Open Access advocates (including Piwowar!) many find it an encouraging step that may pressure Elsevier – and other publishers – to make digital articles more useful to researchers.

In a recent conversation, Piwowar shared her story…..

Q: What is your research interest?

A:  I do research on scientists and what scientists do – specifically how they archive and reuse collected dataset in their research. My studies require collecting evidence of what exactly scientists write in their research papers: how do they describe their data? Ideally I’d be looking for patterns across millions of papers. It is too many papers for me to read with my eyes; I have to read them with my computer. My computer can read the text well enough to find the sorts of sentences I am looking for in the methods and acknowledgements sections, extract the important keywords, keep counts of what it found where, and get the data ready for analysis.

Q: What limitations have you faced in gathering your research?

A: The biggest barrier is easy access to full text to feed to my computer programs. The PubMed Central Open Access Subset is a good source – it contains open access papers in medicine and biology. The articles are in a zip file, easy to find, easy to download, and clearly marked with permission to do text-mining. Unfortunately, the rest of the literature isn’t like that! There are many open access publications that aren’t in PubMed Central, but they are in scattered domains and there is currently no good way to get them all. Getting content from subscription publishers for text-mining is even worse, because subscription agreements almost always forbid systematic downloading and text-mining. My university has paid for subscriptions, but according to those subscription terms, I can only read the papers with my eyes, not my computer programs. This is stifling my research and the research of others.

Q: Did you ever formally approach Elsevier to ask for permission for text mining?

A: I’m a postdoc, so I didn’t think it was possibility. I thought by the time I got access, my two-year appointment would be up. Also, I didn’t know where to start or who to talk to…would I just sit down and write: “Dear Elsevier: Please help?” Besides, subscription publishers usually charge research groups extra money for this sort of access, and I didn’t have any extra money.

One way that might have been productive, in retrospect, would have been to talk to a University librarian. I had talked to several librarians earlier in my postdoc about access to a subscription citation database, but they weren’t able to help me in that case… I guess I assumed this situation would be similar. Given that it seemed unlikely to work out, I decided to just buckle down to my research, using the papers I could get access to and try to generalize to the rest. This isn’t the best science, but it is all that seemed achievable.

Q: How did the Twitter exchange that started this whole process go?

A:  There was a conversation amongst five of us on Twitter, talking about what Elsevier content was in the PubMed Central Open Access Subset. At some point, Alicia Wise of Elsevier clarified how many Elsevier papers were in the subset. I commented that it was an embarrassingly small percentage of Elsevier’s portfolio, and asked if back papers could be opened up for text mining – I advocated as best I could in 140 characters. Wise’s response was that she was confused by this — subscription content could indeed be text mined. We had a few more messages back and forth, then Wise suggested we continue the conversation in email – Twitter does have its limits!

Q: What happened next?

A: A few days later, Wise set up a Skype call with six high-level Elsevier employees, me, and a UBC librarian. The call lasted one hour. I was asked to describe my research projects that require text mining.

The truth is that many of my research projects – big ones, and ones that take just a few days – require text mining. There aren’t just one or two or three projects…I want to play and explore and figure out what is possible to help inform the methods I propose going forward. Anyway, to keep it easy I described three of my projects that spanned a range of text-mining use cases. One of them was a traditional research project, the second involved identifying and sharing excerpts for citizen science annotation, and the third involved openly disseminating facts and excerpts derived from text mining within a research tool.

I got the strong impression that Elsevier wanted to support my research needs. I left the call thinking a lot might be possible. After they talked to their lawyers, a few days later, Elsevier sent me the connection keys to let me start exploring. They have said they want to make text-mining access available to everyone at the University of British Columbia.

Q: What’s the latest with the process?

A: Elsevier sent a letter to my University librarian essentially an addendum to the subscription agreement. The agreement is being reviewed and amended by UBC. We expect it to be signed soon. In the meantime, I have a personal agreement with Elsevier to do my research.

Q: Did you think Elsevier’s offer for access to back issues for text mining was for you individually or for the university?

A: At the beginning, I had no idea where this was going to go. Elsevier quickly volunteered to extend access to the university. The University librarians have been very engaged and insightful from the beginning. There is some thought that this could be an opportunity to create a model agreement.

Q: Do you think this particular arrangement holds promise as a new model for text mining?
A: I hope not. I hope it goes a different way. It’s just not scalable to have every researcher or every library negotiate the terms of access. It should just be part of the subscription agreement.

Q: Why isn’t this the right approach?

A: First, Elsevier could easily change its mind – and the terms – going forward. Renewal with these terms depends their goodwill. Second, even if publishers want to do the right thing, their interests are not aligned with text mining rights for commercial purposes. For-profit publishers want a piece of all pies built on top of the literature, even though publishing costs are covered by subscriptions. Furthermore, many for-profit publishers sell derivative databases: they don’t want competition from commercial text-miners. This sort of monopoly may be good for publishers’ high profit margins, but discouraging commercial reuse comes at the cost of healthy competition for value-added research tools and thus hurts research progress. Third, because publishers control the distribution, text-mining access is limited to their technical capabilities and interests, whereas if it were OA the papers could be replicated and others could innovate with new computer interface access. Finally, this approach only gives text-mining access to subscribers. Often, the best text-miners are in computer science departments. Some universities with great computer science departments don’t have medical schools, so don’t subscribe to medical journals: our most cutting-edge text-miners can’t work on some of our most important problems.

Q: Are you involved in the boycott of Elsevier?

Yes, I was one of the early signatories. I have been tweeting and blogging against Elsevier’s business policies and practices since they endorsed the despicable Research Works Act in January. It’s my opinion that Elsevier’s policies and practices are hindering science.

Q: Is it awkward, then, to be negotiating with Elsevier?

A: Nope. Elsevier knew my stance when they approached me on Twitter – it isn’t a secret.  My goal is more efficient and effective science. I personally think that gold open access is the best approach for this, but even if funders were to mandate OA going forward, publishers hold the copyright to all the back content. How can we open that up?  It is more useful to keep talking with an organization than to just label them as an enemy, and walk away.

Q: What impact will this have on your research? 

A: I never thought I was going to get this. It’s happened really quickly. Frankly, I thought Elsevier would say no and it would illuminate the limitations that publishers put on access to material. Now I’m excited to pull in the Elsevier content to my research studies tracking reuse of data sets. It will make my research results stronger.

Q: What’s next?

A: My research results would be even more accurate if I could include results from other publishers. It shouldn’t stop at Elsevier. We plan to now ask other publishers to eliminate their text-mining contract terms too, and communicate our results in the open. No one wants to look more closed than Elsevier right now. I encourage other universities to point to this agreement and expect similar terms.

That said, this approach doesn’t scale. These terms shouldn’t depend on the negotiating prowess of the university. That would be unfair and such a waste of time. These terms should be expected as part and parcel of what it means to subscribe to an article.

Future research methods depend on not just reading the literature with our eyes, but using it with our computers.

Do you have any specific advice for librarians going forward?

Imagine what is possible if all research literature could be easily processed by computers. Librarians can be at the center of research through text-mining, if they start leading soon (in the next few months). Insist that subscription contracts are public so that authors know the implications of their publishing decisions (as some universities now do), refuse to sign contracts that discriminate against computers, continue driving for CC-BY open access, and become a hub for text-mining learning on your campus! A cross-discipline mailing list for text-mining practitioners, some learning lunches, and a few did-you-know-this-is-possible guides will inspire and enable researchers on your campus to deeply *use* the literature.

Since this is such a recent – and hotly debated – development, SPARC turned to active members of the research community for their thoughts on this evolving arena. Here’s some reaction to the negotiations between Elsevier and UBC….

Cameron Neylon, a biochemist at the Rutherford Appleton Laboratory in Didcot, England:

We shouldn’t be in this situation in the first place. We should have access to the corpus with no need for negotiations. When we do have subscription access, we don’t always have access to what we need in our research. So we have this ridiculous situation where it’s fine to send hundreds of students in to take notes manually and release the results, but not to do it more efficiently with a machine. Subscription publishers are trapped in this space of trying desperately not to give up the possibility of income.

All that said, we need to give credit where credit is due. Elsevier has been more proactive and open to suggestions in this process – and open to having the conversation in public – than any other subscription publisher. What they have done, for them, is a big step and it’s a positive step. But you have to keep it in context. Once one publisher agrees, the rest are more or less obliged to fall in line, especially when it’s as large as Elsevier. It’s likely to increase pressure.

The fact that the terms have been made openly available sets that now as a minimum expectation going into negotiations. The question here is the balance. Some smart people at Elsevier see where the wind is blowing; others are standing on the beach demanding the tide go back out. The people involved with this deserve some credit for pushing forward. Broadly, I think it’s a good thing.

Lea Starr, Associate University Librarian at the University of British Columbia:

I’m pleased to see it happen. Initially, I was really excited. But now stepping back, my reaction is a bit tempered. We need to have more available through Open Access. I don’t think this solves our access problem, but it’s a good step.

I’m hopeful that Elsevier realizes that the true value of the back content isn’t its income potential – rather, the content can be used to let people use it to publish more research. The fact that Elsevier stepped forward will encourage others to do it.

We’d like to take what Elsevier gave us to make it a model that we can use with other publishers. We are currently working with university counsel for a common addendum. Once the agreement is finalized, we will try to get the word out about the availability of this data to the university. We’ll put it in our own library website, distribute it through the publication of our daily public affairs information, and work with our library liaison.

We hope that the agreement with Elsevier can be applied to our contract with the Canadian Research Knowledge Network, which negotiates for 71 institutions in the country.

Mike Taylor, Research Associate, Department of Earth Sciences, University of Bristol, UK: It’s good news that Elsevier has made this concession to Heather, but it took a vice president, three directors, a deputy director and an account manager, plus the involvement of a lawyer and who knows how many other people, to establish that one researcher could text-mine Elsevier content for one project (and even that still with restrictions). The process is so ludicrously inefficient that even Elsevier’s own director of universal access is worried that she might be overwhelmed by requests from others who also want text mining access.

Elsevier desperately needs some good press right now. The sane, sensible thing for them to do would be just to lift all restrictions on mining, and say that anyone who has legitimate access to Elsevier-published articles (whether via subscriptions or because they are sponsored articles) is free to use them in text-mining. Elsevier will argue – and other publishers – that the need to do text mining has never been demonstrated. But the speculative mining projects that would demonstrate that demand aren’t happening because the barriers are too high. I don’t think this specific concession is worth very much, but I hope it opens the issue up.

Casey Bergman, Faculty of Life Sciences, University of Manchester:
This is great for Heather’s work, which is really cutting edge and making a big splash. It is also an important precedent for one of the big publishers to publicly agree to go beyond their standard restrictive contracts. Finally, it shows by taking a non-militant stance with the big publishers, researchers can extract important pragmatic victories.

I think one of the big lessons here is that blogging and tweeting about your troubles getting access to text for mining is a powerful strategy for raising awareness about these issues and getting publishers to change their policies. Big publishers are keenly aware and very sensitive to what is being said about them in the blogosphere and on Twitter. Social media now provides researchers new tools for changing the scholarly landscape, since the “means of production” are not solely in the hands of the big publishers.

Yet, I doubt this will pressure other publishers. Each publisher is its own operational and legal entity. Also, the UBC deal is not finalized. From our experience with the text2genome project it takes six months to two years per publisher to get a final written agreement to mine and release extracted content (http://www.nature.com/news/trouble-at-the-text-mine-1.10184), and often when pushed these deals fall through on the issue of releasing extracted content. I do, however, think this event will make it much harder for Elsevier to say no to other groups in the future, since they have clearly said yes in public already.

The best solution is for governments to mandate universal gold open access with deposition of all scholarly works in a single repository in a single format with a single license. This goal is not attainable immediately, so a better interim solution is for researchers to choose to publish in gold open access journals so that that they are deposited in various repositories that allow unrestricted text mining. I don’t think this deal with Elsevier will lead to a meaningful opening up of the literature for text mining just yet. While researchers can raise awareness and chip away at this issue on a small scale, real change can only come from governments and funding agencies who have the power to make universal open access a requirement and provide the funds to make this a reality.

Peter Murray-Rust, Reader in Molecular Informatics at the University of Cambridge and Senior Research Fellow of Churchill College: 

I’ve been in close touch with Heather and she is part of a group we’ve set up under the Open Knowledge Foundation to look at a more comprehensive solution to this issue. I’m calling it open content mining because it’s not just restricted to text. I think we have the rights to images, tables, numbers, graphs, video, audio – the technology exists for all of these. I have a long history with several publishers on content mining. Progress from the publisher’s point of view is very limited. Legally, publishers hold the upper hand.

I don’t think the negotiation with the University of British Columbia is progress. I think it takes us backwards. It sets the scene for every library in world to negotiate with up to 100 publishers and every publisher to negotiate with thousands of research institutions. It’s not efficient; that’s why we have licenses.

It will take several things to turn the tide on this issue. In the U.K., there is hope that access to content mining may be mandated. Politicians are getting effectively involved and are taking the view that publicly funded content must be free. It is also conceivable – but not likely – that publishers might recognize an economic advantage to providing access to content mining. Or, it might take something close to civil disobedience, in which scientists feel they have waited long enough and they download the material anyway – but technically and legally there would be barriers.

The group working on this issue through the Open Knowledge Foundation plans to draft a manifesto on content mining soon. I have no optimism that a natural state of things will take us forward, but have some hope now that politicians have now discovered this cause.


SPARC Global