0 users and 68 guests online

TWOMDE 2011

Transforming and Weaving OWL Ontologies and MDE

 

Tags

Our corpus is your corpus.

Posted by: Ralf Laemmel

Tagged in: Untagged 

Ralf Laemmel

 

[ Here is our P3P corpus, be it your corpus as well. ]
Giving away your corpus (in empirical language analysis or elsewhere) is perhaps nothing extremely established, but it's nothing original or strange either. It does make sense a lot! Stop limiting the impact of your research efforts! Stop wasting the time of your community members! Sharing corpora is one of these many good ideas of Research 2.0: seeSSE'10 (and friend events), eScience @ Microsoft, R2oSE, ...
 

Computer Science vs. Science

When you do academic CS research in programming- or software development-related contexts, then the culture of validation is these days such that you are often expected to provide online access to your program, application, library, tool, what have you as an implementation or illustration. There are various open-source repositories that are used to this end--as a backend (a storage facility), but any sort of author-hosted download locations are also used widely. In basic terms, if you write a paper, you include a URL. (There is one exception: if your work leverages Haskell, you can usually include the complete source code right into your paper so that one gets convenient access through copy and paste. Sorry for the silly joke.) Metadating-wise, common practice is nowhere perfect, but it's perfect compared to what follows.
 
When you do empirical analysis in CS, which results in some statements or data about software/IT artifacts, then the culture of validation is essentially the one of science. In particular, reproducibility is the crucial requirement. You describe the methodology of your analysis in a detailed manner. So you define your hypotheses, your input, your techniques for measurement, your results (which you also interpret), your threats to validity, what have you.Downloads aren't integral with science. What would you want to download anyway?

Message of this post?!

 
I suggest that various artifacts of an empirical analysis in CS, in general; in empirical languageanalysis, in particular, qualify for a valuable download. In this post, I want to call out thecorpora (as in corpora of source projects, buildable projects, built projects, runnable projects, ran projects, demos, etc.).
 

Beyond reproducibility in CS

 
What's indeed not yet commonplace (if done ever) is that the corpora underlying empirical analyses are given convenient access to. Consider for example Baxter et. al's paper on structural properties of Java software, or Cranor et al.'s paper on P3P deployment. These are two seminal papers in their fields. I would loose a night of sleep over each of the two corpora.
Wouldn't it be helpful for researchers if such corpora were made available for one-click download incl. useful metadata, potentially even tooling? Let's suppose such convenient access became a best practice. First, reproducibility would be improved. Second, derived research would be simplified. Third, incentives for collaboration would be added.
 
I contend that convenient access adds little pain for the original author, but adds huge value for the scientific community. Why should we need to execute the description of some corpus from some paper, if it requires substantial work for us, but the corpus would be easily shared by the primary author. Why should we work hard to "reproduce the corpus" if some little help by the original authors would make reproducibility (of the corpus and most of the research work, perhaps) a charm.
 

Naysayers -- get lost

 
I can think of many reasons why 'convenient access' is not getting off the ground. Here are few obvious options:
 
  • "It's extra work, even if it is little extra work." This problem can be solved if incentives are created. For instance, publications on empirical analysis with 'convenient access' to the corpus could be rated higher than those w/o. Also, just like tool papers in many venues, there could be corpus papers.
 
  • "There is sufficient, inconvenient access available already." At least, for one of the two examples above, I fully understand how I could go about gathering the corpus myself, but I have not executed this plan, even though I could really use this corpus in some ongoing research activity. It's just too much work for me. I am effectively hampered in benefitting from the authors' research beyond their immediate results.
 
  • "Provision of convenient access is too difficult." Think of a corpus of Java programs. Suddenly, an access provider gets into the business of configuration management. After all, convenience would imply that the corpus builds and runs out of the box. I think the short-term answer is that access to the corpus w/o extra "out-of-the box" magic is still more convenient than no access. The long-term-answer is that we may need a notion ofremote access to corpora, where I can give you access to my corpus in my environment, through appropriate, web-based or service-oriented interfaces.
 
  • "Convenient access gives a head start to the competition." I refuse to believe that this is really too relevant in academic practice. For instance, I am sure that the research groups behind the above-mentioned papers have no "corpus monopoly" in mind. I have not done much work on empirical analysis, but I have experience with papers that "give away details", and I must say that those papers which give away the most typically coincide with those which have the highest impact in all possible ways.
 
  • "There is copyright & Co. in the way." Yes, it is. This is a serious problem, and we better focus on solving the problem shortly, if we want to get anywhere with science and (IT) society in this age. This post will just explode if I tried to comment on that issue over here. There are many good ideas around on this issue, and we all understand that some amount of sharing works even now in this very imperfect world as we have it. If you are pro-Research 2.0, don't get bogged down by this red herring.
  
Well, I can think of quite a number of other reasons, but I reckon that all the usual suspects have been named, and everything else can be delegated perhaps to some discussion on this blog or elsewhere.
  
Regards,
Ralf Lämmel
  
PS: CS is of an age that empirical research is becoming viral and vital. I am grateful for talking to Jean-Marie occasionally with his lucid vision of Research 2.0 and linguistics for software languages---two topics that are strongly connected. Empirical analysis of software languages has got to be an integral part of software language linguistics. Specialized software-engineering conferences like SLEICPC and MSR or even big ones like ICSE or OOPSLA include empirical research for a while now.

 


Research 2.0, how will it be?

Posted by: Ralf Laemmel

Tagged in: Untagged 

Ralf Laemmel

I don't know why some people think that Research 2.0 is sort of fluff. Ok, we don't know yet what it exactly is, but this status doesn't suggest to me that we should disallow ourselves talking about Research 2.0. I am pretty certain that the contemporary model of static publications (measured by numbers and weight of archival information), combined with travel-intensive presentations makes little sense in the view of new ecological awareness and new technical options.  We need to find and agree on improved means of publication and communication.

I am happy to see this forum to come around.

You can follow me on Twitter if you like.

twitter.com/notquiteabba

 

Regards,

Ralf

 


Activity Stream

18 months ago
Jurgen Vinju updated a blog entry Organize a workshops... Nov 04
Jurgen Vinju uploaded a new avatar. Nov 04
23 months ago
Sebastian Buchwald and Edgar Jakumeit are now Colleagues Jun 30
Gábor Bergmann and Abel Hegedus are now Colleagues Jun 17
Pieter Van Gorp Created page with all TTC2011 solutions to the offline case studies: http://planet-research20.org/ttc2011/index.php?option=com_content&view=article&id=120&Itemid=160 Jun 09
24 months ago
Li Dan replied in a discussion Gábor's QVTR-XSLT review Jun 02
 

Latest Blog Comments