| “Old Books” photo by flickr user Iguana Joe, used by permission (CC-by-nc) |
Earlier this week, the Harvard Library announced its new open metadata policy, which was approved by the Library Board earlier this year, along with an initial two metadata releases. The policy is straightforward:
The Harvard Library provides open access to library metadata, subject to legal and privacy factors. In particular, the Library makes available its own catalog metadata under appropriate broad use licenses. The Library Board is responsible for interpreting this policy, resolving disputes concerning its interpretation and application, and modifying it as necessary.
The first releases under the policy include the metadata in the DASH repository. Though this metadata has been available through open APIs since early in the repository’s history, the open metadata policy makes clear the open licensing terms that the data is provided under.
The release of a huge percentage of the Harvard Library’s bibliographic metadata for its holdings is likely to have much bigger impact. We’ve provided 12 million records — the vast majority of Harvard’s bibliographic data — describing Harvard’s library holdings in MARC format under a CC0 license that requests adherence to a set of community norms that I think are quite reasonable, primarily calling for attribution to Harvard and our major partners in the release, OCLC and the Library of Congress.
OCLC in particular has praised the effort, saying it “furthers [Harvard's] mandate from their Library Board and Faculty to make as much of their metadata as possible available through open access in order to support learning and research, to disseminate knowledge and to foster innovation and aligns with the very public and established commitment that Harvard has made to open access for scholarly communication. I’m pleased to say that they worked with OCLC as they thought about the terms under which the release would be made.” We’ve gotten nice coverage from the New York Times, Library Journal, and Boing Boing as well.
Many people have asked what we expect people to do with the data. Personally, I have no idea, and that’s the point. I’ve seen over and over that when data is made openly available with the fewest impediments — legal and technical — people are incredibly creative about finding innovative uses for the data that we never could have predicted. Already, we’re seeing people picking up the data, exploring it, and building on it.
(I’m sure I’ve missed some of the ways people are using the data. Let me know if you’ve heard of others, and I’ll update this list.)
As I’ve said before, “This data serves to link things together in ways that are difficult to predict. The more information you release, the more you see people doing innovative things.” These examples are the first evidence of that potential.
John Palfrey, who was really the instigator of the open metadata project, has been especially interested in getting other institutions to make their own collection metadata publicly available, and the DPLA stands ready to help. They’re running a wiki with instructions on how to add your own institution’s metadata to the DPLA service.
It’s hard to list all the people who make initiatives like this possible, since there are so many, but I’d like to mention a few major participants (in addition to John): Jonathan Hulbert, Tracey Robinson, David Weinberger, and Robin Wendler. Thanks to them and the many others that have helped in various ways.
| “Majesty of Law” Statue in front of the Rayburn House Office Building in Washington, D.C., photo by flickr user NCinDC, used by permission (CC-by-nd) |
Here is my written testimony filed in association with my appearance yesterday at the hearing on “Federally Funded Research: Examining Public Access and Scholarly Publication Interests” before the Subcommittee on Investigations and Oversight of the House Committee on Science, Space and Technology. My thanks to Chairman Broun, ranking member Tonko, and the committee for allowing me the opportunity to speak with them today.
[Update 3/30/12: Coverage from Chronicle of Higher Education. Update 4/2/12: Video of the session is available from the House Science Committee as well.]
Statement of Stuart M. Shieber before the
Committee on Science, Space and Technology
Subcommittee on Investigations and Oversight
U.S. House of Representatives
March 29, 2012
Chairman Broun and Members of the Subcommittee:
My name is Stuart Shieber. I am the James O. Welch, Jr. and Virginia B. Welch Professor of Computer Science at Harvard University. My primary field of research is computational linguistics, the study of human language from a computer science perspective, often with application to the engineering of useful computer systems that manipulate language. As a faculty member, I led the development and enactment of Harvard’s open-access policies. Since October of 2008, I have served in the additional role as the faculty director of Harvard’s Office for Scholarly Communication. Thank you for the opportunity to speak with you today about some of the actions that we have taken at Harvard to provide the broadest possible access to the results of our research.
The mission of the university is to create, preserve, and disseminate knowledge to the benefit of all. In Harvard’s Faculty of Arts and Sciences (FAS), where I hold my faculty post, we codify this in the FAS Grey Book, which states that research policy “should encourage the notion that ideas or creative works produced at the University should be used for the greatest possible public benefit. This would normally mean the widest possible dissemination and use of such ideas or materials.”
At one time, the widest possible dissemination was achieved by distributing the scholarly articles describing the fruits of research in the form of printed issues of peer-reviewed journals, sent to the research libraries of the world for reading by their patrons, and paid for by subscription fees. These fees covered the various services provided to the authors of the articles — management of the peer review process, copy-editing, typesetting, and other production processes — as well as the printing, binding, and shipping of the physical objects.
Thanks to the forward thinking of federal science funding agencies, including NSF, DARPA, NASA, and DOE, we now have available computing and networking technologies that hold the promise of transforming the mechanisms for disseminating and using knowledge in ways not imaginable even a few decades ago. The internet allows nearly instantaneous distribution of content for essentially zero marginal cost to a large and rapidly increasing proportion of humanity. Ideally, this would ramify in a universality of access to research results, thereby truly achieving the widest possible dissemination.
The benefits of such so-called open access are manifold. The signatories of the 2002 Budapest Open Access Initiative state that
The public good [open access] make[s] possible is the world-wide electronic distribution of the peer-reviewed journal literature and completely free and unrestricted access to it by all scientists, scholars, teachers, students, and other curious minds. Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.
From a more pragmatic point of view, a large body of research has shown that public research has a large positive impact on economic growth, and that access to the scholarly literature is central to that impact. Martin and Tang’s recent review of the literature concludes that “there have been numerous attempts to measure the economic impact of publicly funded research and development (R&D), all of which show a large positive contribution to economic growth.”[1] It is therefore not surprising that Houghton’s modeling of the effect of broader public access to federally funded research shows that the benefits to the US economy come to the billions of dollars and are eight times the costs.[2]
Opening access to the literature makes it available not only to human readers, but to computer processing as well. There are some million and a half scholarly articles published each year.[3] No human can read them all or even the tiny fraction in a particular subfield, but computers can, and computer analysis of the text, known as text mining, has the potential not only to extract high-quality structured data from article databases but even to generate new research hypotheses. My own field of research, computational linguistics, includes text mining. I have collaborated with colleagues in the East Asian Languages and Civilization department on text mining of tens of thousands of classical Chinese biographies and with colleagues in the History department on computational analysis of pre-modern Latin texts. Performing similar analyses on the current research literature, however, is encumbered by proscriptions of copyright and contract because the dominant publishing mechanisms are not open.
In Harvard’s response to the Office of Science and Technology Policy’s request for information on public access,[4] Provost Alan Garber highlighted the economic potential for the kinds of reuse enabled by open access.
Public access not only facilitates innovation in research-driven industries such as medicine and manufacturing. It stimulates the growth of a new industry adding value to the newly accessible research itself. This new industry includes search, current awareness, impact measurement, data integration, citation linking, text and data mining, translation, indexing, organizing, recommending, and summarizing. These new services not only create new jobs and pay taxes, but they make the underlying research itself more useful. Research funding agencies needn’t take on the job of provide all these services themselves. As long as they ensure that the funded research is digital, online, free of charge, and free for reuse, they can rely on an after-market of motivated developers and entrepreneurs to bring it to users in the forms in which it will be most useful. Indeed, scholarly publishers are themselves in a good position to provide many of these value-added services, which could provide an additional revenue source for the industry.
Finally, free and open access to the scholarly literature is an intrinsic good. It is in the interest of the researchers generating the research and those who might build upon it, the public who take interest in the research, the press who help interpret the results, and the government who funds these efforts. All things being equal, open access to the research literature ought to be the standard.
Unfortunately, over the last several years, it has become increasingly clear to many that this goal of the “widest possible dissemination” was in jeopardy because of systemic problems in the current mechanisms of scholarly communication, which are not able to take full advantage of the new technologies to maximize the access to research and therefore its potential for social good.
By way of background, I should review the standard process for disseminating research results. Scholars and researchers — often with government funding — perform research and write up their results in the form of articles, which are submitted to journals that are under the editorial control of the editor-in-chief and editorial boards made up of other scholars. These editors find appropriate reviewers, also scholars, to read and provide detailed reviews of the articles, which authors use to improve the quality of the articles. Reviewers also provide advice to the editors on whether the articles are appropriate for publication in the journal, the final decisions being made by the editors. Participants in these aspects of the publishing process are overwhelmingly volunteers, scholars who provide their time freely as a necessary part of their engagement in the research enterprise. The management of this process, handling the logistics, is typically performed by the journal’s publisher, who receives the copyright in the article from the author for its services. The publisher also handles any further production process such as copy-editing and typesetting of accepted articles and their distribution to subscribers through print issue or more commonly these days through online access. This access is provided to researchers by their institutional libraries, which pay for annual subscriptions to the journals.
Libraries have observed with alarm a long-term dramatic rise in subscription costs of journals. The Association of Research Libraries, whose members represent the leading research libraries of the United States and Canada, have tracked serials expenditures for over three decades. From 1986 through 2010 (the most recent year with available data), expenditures in ARL libraries have increased by a factor of almost 5. Even discounting for inflation, the increase is almost 2.5 times. These increases correspond to an annualized rate of almost 7% per year, during a period in which inflation has averaged less than 3%.[5]
Another diagnostic of the market dysfunction in the journal publishing system is the huge disparity in subscription costs between different journals. Bergstrom and Bergstrom showed that even within a single field of research, commercial journals are on average five times more expensive per page than non-profit journals.[6] When compared by cost per citation, which controls better for journal quality, the disparity becomes even greater, a factor of 10 times. Odylzko notes that “The great disparity in costs among journals is a sign of an industry that has not had to worry about efficiency.”[7] Finally, the extraordinary profit margins, increasing even over the last few years while research libraries’ budgets were under tremendous pressure, provide yet another signal of the absence of a functioning competitive market.
The Harvard library system is the largest academic library in the world, and the fifth largest library of any sort. In attempting to provide access to research results to our faculty and students, the university subscribes to tens of thousands of serials at a cost of about 9 million dollars per year. Nonetheless, we too have been buffeted by the tremendous growth in journal costs over the last decades, with Harvard’s serials expenditures growing by a factor of 3 between 1986 and 2004.[8] Such geometric increases in expenditures could not be sustained indefinitely. Over the years since 2004 our journal expenditure increases have been curtailed through an aggressive effort at deduplication, elimination of print subscriptions, and a painful series of journal cancellations. As a researcher, I know that Harvard does not subscribe to all of the journals that I would like access to for my own research, and if Harvard, with its scale, cannot provide optimal subscription access, other universities without our resources are in an even more restricted position.
Correspondingly, the articles that we ourselves generate as authors are not able to be accessed as broadly as we would like. We write articles not for direct financial gain — we are not paid for the articles and receive no royalties — but rather so that others can read them and make use of the discoveries they describe. To the extent that access is limited, those goals are thwarted.
The economic causes of these observed phenomena are quite understandable. Journal access is a monopolistic good. Libraries can buy access to a journal’s articles only from the publisher of that journal, by virtue of the monopoly character of copyright. In addition, the high prices of journals are hidden from the “consumers” of the journals, the researchers reading the articles, because an intermediary, the library, pays the subscriptions on their behalf. The market therefore embeds a moral hazard. Under such conditions, market failure is not surprising; one would expect inelasticity of demand, hyperinflation, and inefficiency in the market, and that is what we observe. Prices inflate, leading to some libraries canceling journals, leading to further price increases to recoup revenue — a spiral that ends in higher and higher prices paid by fewer and fewer libraries. The market is structured to provide institutions a Hobson’s choice between unsustainable expenditures or reduced access.
The unfortunate side effect of this market dysfunction has been that as fewer libraries can afford the journals, access to the research results they contain is diminished. In 2005, then Provost of Harvard Steven Hyman appointed an ad hoc committee, which I chaired, to examine these issues and make recommendations as to what measures Harvard might pursue to mitigate this problem of access to our writings. Since then, we have been pursuing a variety of approaches to maximize access to the writings of Harvard researchers.
One of these approaches involves the self-imposition by faculty of an open-access policy according to which faculty grant a license to the university to distribute our scholarly articles and commit to providing copies of our manuscript articles for such distribution. By virtue of this kind of policy, the problem of access limitation is mitigated by providing a supplemental venue for access to the articles. Four years ago, in February of 2008, the members of the Faculty of Arts and Sciences at Harvard became the first school to enact such a policy,[9] by unanimous vote as it turned out.
In order to guarantee the freedom of faculty authors to choose the rights situation for their articles, the license is waivable at the sole discretion of the author, so faculty retain control over whether the university is granted this license. But the policy has the effect that by default, the university holds a license to our articles, which can therefore be distributed from a repository that we have set up for that purpose. Since the FAS vote, six other schools at Harvard — Harvard Law School, Harvard Kennedy School of Government, Harvard Graduate School of Education, Harvard Business School, Harvard Divinity School, and Harvard Graduate School of Design — have passed this same kind of policy, and similar policies have been voted by faculty bodies at many other universities as well, including Massachusetts Institute of Technology, Stanford, Princeton, Columbia, and Duke. Notably, the policies have seen broad faculty support, with faculty imposing these policies on themselves typically by unanimous or near unanimous votes.
Because of these policies in the seven Harvard schools, Harvard’s article repository, called DASH (for Digital Access to Scholarship at Harvard),[10] now provides access to over 7,000 articles representing 4,000 Harvard-affiliated authors. Articles in DASH have been downloaded almost three-quarters of a million times.[11] The number of waivers of the license has been very small; we estimate the waiver rate at about 5%. Because of the policy, as faculty authors we are retaining rights to openly distribute the vast majority of the articles that we write.
The process of consultation in preparation for the faculty vote was a long one. I started speaking with faculty committees, departments, and individuals about two years before the actual vote. During that time and since, I have not met a single faculty member or researcher who objected to the principle underlying the open-access policies at Harvard, to obtain the widest possible dissemination for our scholarly results, and have been struck by the broad support for the kind of open dissemination of articles that the policy and the repository allow.
This approach to the access limitation problem, the provision of supplemental access venues, is also seen in the extraordinarily successful public access policy of the National Institutes of Health (NIH), which Congress mandated effective April, 2008. By virtue of that policy, researchers funded by NIH provide copies of their articles for distribution from NIH’s PubMed Central (PMC) repository. Today, PMC provides free online access to 2.4 million articles downloaded a million times per day by half a million users.[12] NIH’s own analysis has shown that a quarter of the users are researchers. The hundreds of thousands of articles they are accessing per day demonstrates the large latent demand for articles not being satisfied by the journals’ subscription base. Companies account for another 17%, showing that the policy benefits small businesses and corporations, who need access to scientific advances to spur innovation. Finally, the general public accounts for 40% of the users, some quarter of a million people per day, demonstrating that these articles are of tremendous interest to the taxpayers who fund the research in the first place and who deserve access to the results that they have underwritten.
The standard objection to these open-access policies is that supplemental access to scholarly articles, such as that provided by institutional repositories like Harvard’s DASH or subject-based repositories like NIH’s PubMed Central, could supplant subscription access to such an extent that subscriptions would come under substantial price pressure. Sufficient price pressure, in this scenario, could harm the publishing industry, the viability of journals, and the peer review and journal production processes.
There is no question that the services provided by journals are valuable to the research enterprise, so such concerns must be taken seriously. By now, however, these arguments have been aired and addressed in great detail. I recommend the report “The Future of Taxpayer-Funded Research: Who Will Control Access to the Results?” by my co-panelist Elliott Maxwell,[13] which provides detailed support for the report’s conclusion that “There is no persuasive evidence that increased access threatens the sustainability of traditional subscription-supported journals, or their ability to fund rigorous peer review.” The reasons are manifold, including the fact that supplemental access covers only a fraction of the articles in any given journal, is often delayed relative to publication, and typically provides a manuscript version of the article rather than the version of record. Consistent with this reasoning, the empirical evidence shows no such discernible effect. After four years of the NIH policy, for instance, subscription prices have continued to increase, as have publisher margins. The NIH states that “while the U.S. economy has suffered a downturn during the time period 2007 to 2011, scientific publishing has grown: The number of journals dedicated to publishing biological sciences/agriculture articles and medicine/health articles increased 15% and 19%, respectively. The average subscription prices of biology journals and health sciences journals increased 26% and 23%, respectively. Publishers forecast increases to the rate of growth of the medical journal market, from 4.5% in 2011 to 6.3% in 2014.”[14]
Nonetheless, it does not violate the laws of economics that increased supplemental access (even if delayed) to a sufficiently high proportion of articles (even if to a deprecated version) could put price pressure on subscription journals, perhaps even so much so that journals would not be able to recoup their costs. In this hypothetical case, would that be the end of journals? No, because even if publishers (again, merely by hypothesis and counterfactually) add no value for the readers (beyond what the readers are already getting in the [again hypothetical] universal open access), the author and the author’s institution gain much value: vetting, copyediting, typesetting, and most importantly, imprimatur of the journal. This is value that authors and their institutions should be, would be, and are willing to pay for. The upshot is that journals will merely switch to a different business model, in which the journal charges a one-time publication fee to cover the costs of publishing the article.
I state this as though this publication-fee revenue model is itself hypothetical, but it is not. Open-access journals already exist in the thousands. They operate in exactly the same way as traditional subscription journals — providing management of peer review, production services, and distribution — with the sole exception that they do not charge for online access, so that access is free and open to anyone. The publication-fee revenue model for open-access journals is a proven mechanism. The prestigious non-profit open-access publisher Public Library of Science is generating surplus revenue and is on track to publish some 3% of the world biomedical literature through its journal PLoS ONE alone. The BioMed Central division of the commercial publisher Springer is generating profits for its parent company using the same revenue model. Indeed, the growth of open-access journals over the past few years has been meteoric. There are now over 7,000 open-access journals,[15] many using the publication-fee model, and many of the largest, most established commercial journal publishers — Elsevier, Springer, Wiley-Blackwell, SAGE — now operate open-access journals using the publication-fee revenue model. Were supplemental access to cause sufficient price pressure to put the subscription model in danger, the result would merely be further uptake of this already burgeoning alternative revenue model.
In this scenario, the cost of journal publishing would be borne not by the libraries on behalf of their readers, but by funding agencies and research institutions on behalf of their authors. Already, funding agencies such as Wellcome Trust and Howard Hughes Medical Institute underwrite open access author charges, and in fact mandate open access. Federal granting agencies such as NSF and NIH allow grant funds to be used for open-access publication fees as well (though grantees must prebudget for these unpredictable charges). Not all fields have the sort of grant funding opportunities that could underwrite these fees. For those fields, the researcher’s employing institution, as de facto funder of the research, should underwrite charges for publication in open-access journals. Here again, Harvard has taken an early stand as one of the initial signatories — along with Cornell, Dartmouth, MIT, and University of California, Berkeley — of the Compact for Open-Access Publishing Equity,[16] which commits these universities and the dozen or so additional signatories to establishing mechanisms for underwriting reasonable open-access publication fees. The Compact acknowledges the fact that the services that journal publishers provide are important, cost money, and deserve to be funded, and commits the universities to doing so, albeit with a revenue model that avoids the market dysfunction of the subscription journal system.
The primary advantage of the open-access journal publishing system is the open access that it provides. Since revenue does not depend on limiting access to those willing to pay, journals have no incentive to limit access, and in fact have incentive to provide as broad access as possible to increase the value of their brand. In fact, open-access journals can provide access not only in the traditional sense, allowing anyone to access the articles for the purpose of reading them, but can provide the articles unencumbered by any use restrictions, thereby allowing the articles to be used, re-used, analyzed, and data-mined in ways we are not even able to predict.
A perhaps less obvious advantage of the publication-fee revenue model for open-access journals is that the factors leading to the subscription market failure do not inhere in the publication-fee model. Bergstrom and Bergstrom[17] explain why:
Journal articles differ [from conventional goods such as cars] in that they are not substitutes for each other in the same way as cars are. Rather, they are complements. Scientists are not satisfied with seeing only the top articles in their field. They want access to articles of the second and third rank as well. Thus for a library, a second copy of a top academic journal is not a good substitute for a journal of the second rank. Because of this lack of substitutability, commercial publishers of established second-rank journals have substantial monopoly power and are able to sell their product at prices that are much higher than their average costs and several times higher than the price of higher quality, non-profit journals.
By contrast, the market for authors’ inputs appears to be much more competitive. If journals supported themselves by author fees, it is not likely that one Open Access journal could charge author fees several times higher than those charged by another of similar quality. An author, deciding where to publish, is likely to consider different journals of similar quality as close substitutes. Unlike a reader, who would much prefer access to two journals rather than to two copies of one, an author with two papers has no strong reason to prefer publishing once in each journal rather than twice in the cheaper one.
If the entire market were to switch from Reader Pays to Author Pays, competing journals would be closer substitutes in the view of authors than they are in the view of subscribers. As publishers shift from selling complements to selling substitutes, the greater competition would be likely to force commercial publishers to reduce their profit margins dramatically.
Again, the empirical evidence supports this view. Even the most expensive open-access publication fees, such as those of the prestigious Public Library of Science journals, are less than $3,000 per article, with a more typical value in the $1,000-1,500 range. By contrast, the average revenue per article for subscription journal articles is about $5,000. Thus, the open-access model better leverages free market principles: Despite providing unencumbered access to the literature, it costs no more overall per article, and may end up costing much less, than the current system. The savings to universities and funding agencies could be substantial.
I began my comments by quoting the mission of academics such as myself to provide the widest possible dissemination — open access — to the ideas and knowledge resulting from our research. Government, too, has an underlying goal of promoting the dissemination of knowledge, expressed in Thomas Jefferson’s view that “by far the most important bill in our whole code is that for the diffusion of knowledge among the people.”[18] The federal agencies and science policies that this committee oversees have led to knowledge breakthroughs of the most fundamental sort — in our understanding of the physical universe, in our ability to comprehend fundamental biological processes, and, in my own field, in the revolutionary abilities to transform and transmit information.
Open access policies build on these information technology breakthroughs to maximize the return on the taxpayers’ enormous investment in that research, and magnify the usefulness of that research. They bring economic benefits that far exceed the costs. The NIH has shown one successful model, which could be replicated at other funding agencies, as envisioned in the recently re-introduced bipartisan Federal Research Public Access Act (FRPAA).
Providing open access to the publicly-funded research literature — amplifying the “diffusion of knowledge” — will benefit researchers, taxpayers, and every person who gains from new medicines, new technologies, new jobs, and new solutions to longstanding problems of every kind.
[1] Ben R. Martin and Puay Tang, The benefits from publicly funded research, SEWPS Paper No. 161, SPRU—Science and Technology Policy Research, University of Sussex, Brighton (2007). http://www.sussex.ac.uk/spru/documents/sewp161
[2] John Houghton, Economic and Social Returns on Investment in Open Archiving Publicly Funded Research Outputs (July 2010). http://www.arl.org/sparc/bm~doc/vufrpaa
[3] Scholarly Publishing Roundtable, Report and Recommendations from the Scholarly Publishing Roundtable (January 2010). http://www.aau.edu/WorkArea/DownloadAsset.aspx?id=10044
[4] Alan Garber, Harvard response to the White House RFI on public access to research (January 2012). http://osc.hul.harvard.edu/stp-rfi-response-january-2012
[5] Association of Research Libraries, Monograph and Serial Costs in ARL Libraries, 1986-2010 (2010). http://www.arl.org/bm~doc/t2_monser10.xls
[6] Carl T. Bergstrom and Theodore C. Bergstrom, The costs and benefits of library site licenses to academic journals, Proceedings of the National Academy of Sciences, volume 101, number 3 (20 January 2004). http://dx.doi.org/10.1073/pnas.0305628101}
[7] Andrew Odlyzko, The Economics of Electronic Journals, First Monday, volume 2, number 8 (4 August 1997). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/542/463
[8] Association of Research Libraries, Monograph and Serial Costs in ARL Libraries, 1986-2010 (2010). http://www.arl.org/bm~doc/t2_monser10.xls
[9] Text of the FAS policy and the other Harvard open-access policies is available at http://osc.hul.harvard.edu/policies.
[11] http://dash.harvard.edu/mydash
[12] National Institutes of Health, NIH Public Access Policy Implications (2012).http://publicaccess.nih.gov/public_access_policy_implications_2012.pdf
[13] Committee for Economic Development. The Future of Taxpayer-Funded Research: Who Will Control Access to the Results? (2012). http://www.ced.org/component/blog/entry/1/765
[14] National Institutes of Health, NIH Public Access Policy Implications (2012).http://publicaccess.nih.gov/public_access_policy_implications_2012.pdf
[15] According to the Directory of Open Access Journals, http://www.doaj.org/.
[16] http://www.oacompact.org/. See also Stuart M. Shieber, Equity for open-access journal publishing, PLoS Biology, volume 7, number 8 (2012). http://dx.doi.org/10.1371/journal.pbio.1000165
[17] Theodore C. Bergstrom and Carl T. Bergstrom, Can ‘author pays’ journals compete with ‘reader pays’?, Nature Web Focus (2004). http://www.nature.com/nature/focus/accessdebate/22.html
[18] Thomas Jefferson, Letter to George Wythe (13 August, 1786). http://hdl.loc.gov/loc.mss/mtj.mtjbib002184
| “Have scientists lost interest again?” |
The “Cost of Knowledge” boycott of Elsevier is in its seventh week. The boycott was precipitated by various practices of the journal publisher, most recently its support for the Research Works Act, a bill that would roll back the NIH public access policy and prevent similar policies by other federal funding agencies.
Early on, several hundred researchers a day were signing on to the pledge not to submit to or edit or review for Elsevier journals, but recently that rate had settled down to about a hundred per day. On February 11, I started tracking the daily totals by scraping the site through a simple scraper I set up at ScraperWiki. I’ve graphed the results in the attached graph, showing raw count of signatories with the blue line (left axis) and the number added since the previous day with the green bars (right axis).
As you can see from the chart, there seems to be a slight drop in activity around weekends, and Sunday February 26 and Monday February 27 had clearly been the slowest days since I’ve been keeping records, and likely since the effort started. On the 27th (red arrow), Elsevier issued its quasi-recantation of support for RWA. (“While we continue to oppose government mandates in this area, Elsevier is withdrawing support for the Research Work Act itself. We hope this will address some of the concerns expressed….”)
The day after Elsevier’s announcement saw a bit of a bump back to previous levels. Was this an instance of the Streisand effect or was the 26-27 dip an aberration? It’s hard to tell. However, since the 27th, it seems clear that the number of pledges is down considerably. It could well be that Elsevier’s tactical approach has worked and it has stanched the spate of boycott pledges, despite the fact that the community was generally unimpressed with Elsevier’s statement, as Peter Suber has cataloged. Alternatively, the current rate of new pledges may just reflect the natural reductions that had been happening over the last few weeks.
Elsevier has not changed its underlying stance. It still “continue[s] to oppose government mandates” for public access, as per RWA. It strongly opposes FRPAA. Have scientists lost interest again?
| “Note the surges…” |
[Update 4/20/2012: Now that a few more weeks have passed, here's an updated figure of the boycott growth. Note the surges around March 18 and April 10. As near as I can make out, these were the result of widely disseminated coverage in Slashdot and the Guardian, respectively. These surges show that the boycott hasn't played itself out yet, and that continued discussion of the boycott is likely to lead to a continued steady rise in the number of signatures.
At the current rate, I expect the number of signatories to hit 10,000 around April 27 or so.]
[Update 4/24/2012: Well, my guess was wrong. A big bump of activity in the last few days meant that the boycott broke 10,000 signatures on April 23. I'm not sure who to blame for the renewed interest in the last couple of days. Anyone have any conjectures?]
| “You seem to believe in fairies.” Photo of the Cottingley Fairies, 1917, by Elsie Wright via Wikipedia. |
Aficionados of open access should know about the Journal of Machine Learning Research (JMLR), an open-access journal in my own research field of artificial intelligence, a subfield of computer science concerned with the computational implementation and understanding of behaviors that in humans are considered intelligent. The journal became the topic of some dispute in a conversation that took place a few months ago in the comment stream of the Scholarly Kitchen blog between computer science professor Yann LeCun and scholarly journal publisher Kent Anderson, with LeCun stating that “The best publications in my field are not only open access, but completely free to the readers and to the authors.” He used JMLR as the exemplar. Anderson expressed incredulity:
I’m not entirely clear how JMLR is supported, but there is financial and infrastructure support going on, most likely from MIT. The servers are not “marginal cost = 0″ — as a computer scientist, you surely understand the 20-25% annual maintenance costs for computer systems (upgrades, repairs, expansion, updates). MIT is probably footing the bill for this. The journal has a 27% acceptance rate, so there is definitely a selection process going on. There is an EIC, a managing editor, and a production editor, all likely paid positions. There is a Webmaster. I think your understanding of JMLR’s financing is only slightly worse than mine — I don’t understand how it’s financed, but I know it’s financed somehow. You seem to believe in fairies.
Since I have some pretty substantial knowledge of JMLR and how it works, I thought I’d comment on the facts of the matter.
First, some history. JMLR was founded when most of the editorial board of the Kluwer journal Machine Learning (now a Springer journal) resigned to establish JMLR, Inc., a nonprofit to develop and publish the new journal on an open access model. The first editor-in-chief was Leslie Kaelbling, a computer science professor at MIT. The journal’s first papers appeared in October 2000. Its twelfth annual volume just completed this past December.
One of the main things that journal publishers do is manage the logistics of the peer review and filtering of submitted articles. Starting with the former Machine Learning team, the journal put together an editorial board and a cadre of action editors to handle the reviewing process. At the time the journal was launched, there weren’t the abundance of open-source journal management platforms that are now available. Being computer scientists, the editorial board took the expedient of implementing their own, a custom system that they still use. Much of the clerical effort of tracking the peer review process — assigning papers to action editors, engaging reviewers, tracking reviews, acceptances and rejections, and the like — is automated by the platform. Of course, these days, the platform situation has eased considerably.
The journal does not charge any submission or publication fees and has never done so. It has never taken any advertising. Indeed, it has never had any direct revenue at all. In fact, JMLR, Inc. didn’t even have a bank account until recently; there was no need.
Of course, there are costs, but they are all provided through in-kind support. By far the largest costs are the labor required for peer reviewing and its management by the editorial board, but this is all volunteer effort as in most all scholarly journals. The primary people involved, the editor-in-chief, managing editor, and production editor, are all unpaid, contra Anderson’s conjecture. They volunteer for JMLR in their spare time away from their day jobs as computer science professors. MIT implicitly underwrites some clerical help, since Kaelbling’s administrative assistant at MIT does a small amount of work for the journal, amounting to a few hours per year.
The webmaster is a student volunteer. Anderson is right that MIT provides the web server, saving JMLR the tens of dollars per month they would otherwise have to pay for commercial hosting. Kaelbling has paid for the domain name jmlr.org out of her own pocket. The going rate for .org domains is about $15 per year.
In addition to management of the peer review process, publishers provide production services as well, such as copy-editing and typesetting. One of the main motivations for JMLR leaving Kluwer was the sense that the help they were supposed to be providing was sparse and better avoided. Kluwer did no copy-editing of articles. JMLR relies on reviewers for the kind of light copy-editing they always have done in the normal course of reviewing. For accepted articles that require large amounts of language help, the authors are requested to find copy-editing help at their expense; such cases are extremely rare. Other than that, no copy-editing is done. It doesn’t seem to have harmed the journal’s perceived quality.
As for the typesetting of articles, computer science authors typically use the open-source LaTeX typesetting system for writing their articles, a system designed for beautiful typesetting of mathematical material and far better for mathematical typesetting than the typical systems publishers are accustomed to. The process of retypesetting that many journals have historically performed inevitably introduces errors, leading to a product inferior to that computer science authors typically provide. JMLR used an approach where authors submit camera-ready copy based on a publisher-supplied LaTeX style file. By dropping the retypesetting with an inferior system, errors in the process are eliminated and the quality of typesetting improved. Increasingly, journals in computer science and related fields (mathematics, physics) are moving to this system. In fact, Machine Learning itself accepts LaTeX submissions and provides an appropriate LaTeX style file for authors to use. Thus, the total cost to JMLR for copy-editing and typesetting is zero.
The biggest expense, it turns out paradoxically, is paying a tax accountant. Kaelbling explained the problem to me:
We have to file a bunch of annoying forms to maintain tax exempt status, etc. I have paid for the original incorporation and some amount of the accountant out of my pocket. But I have gotten a couple of donations (totaling $7K) which I have also used for that stuff. It wouldn’t need to be so expensive, except I’m too disorganized and late to keep on top of it myself.
JMLR has always appeared both free online and by subscription in print. The print edition was originally intended to satisfy the desires of authors who hung onto a view that online-only journals may not be viewed as “serious”, but also has the advantage of substantially solving the digital preservation problem for the journal. The print edition of the first four volumes was published by MIT Press, at first quarterly, then semi-quarterly as submissions grew and more articles were accepted. JMLR received no revenue from the print edition and paid no subvention to MIT Press. MIT Press handled all aspects of fulfilling the print subscriptions and kept all the revenues from a quite reasonable subscription fee of just under 30 cents per page. From the fifth volume on, the print edition was taken over by Microtome Publishing under the same zero-zero arrangement. Under Microtome Publishing’s approach, which leverages important aspects of the print editions specific to open-access journals, the subscription cost decreased dramatically over the next few volumes, settling at a steady state of 8 cents per page for the last several volumes.
Adding it all up, a reasonable imputed estimate for JMLR’s total direct costs other than the volunteered labor (that is, tax accountant, web hosting, domain names, clerical work, etc.) is less than $10,000, covering the almost 1,000 articles the journal has published since its founding — about $10 per article. With regard to whose understanding of JMLR’s financing is better than whose, Yann LeCun I think comes out on top.
[Update 3/18/12: In the comments section, Leslie Kaelbling corrects her estimate of outside donations to $3,500, so I should revise my estimate of JMLR’s cost per article to be about $6.50 per article.]
How do I know all this about JMLR? Because (full disclosure alert) I am Microtome Publishing. Microtome is a sole proprietorship providing “publishing services in support of open access to the scholarly literature.” I’ve worked with JMLR for many years now, and consequently have gained a good understanding of all aspects of its operations and of the operations of a subscription-based print journal as well. I don’t pretend to have all of the knowledge of a professional publisher by any means. On the other hand, I don’t believe in fairies.
Does JMLR’s success and efficiency mean that all journals could run this way? Of course not. First, computer science journals are in a particularly good situation for being operated at low cost. Computer scientists possess all of the technological expertise required to efficiently manage and operate an online journal. Journal publishing is an information industry and computer scientists are specialists in information processing. Second, the level of volunteerism that JMLR relies on is atypical for the entire spectrum of journals. Paid editorial positions for computer science journals are exceptionally rare; we’re used to the volunteerism of running a journal. As authors, computer scientists are accustomed to performing their own typesetting and we prefer to do it ourselves. JMLR reviewers are relied on for whatever copy-editing is done. Paying professional copy-editors if that was desired would add more to the cost per page (though apparently not even Machine Learning’s commercial publisher was doing so when the board left). Third, some of the costs of operating a journal are the overhead costs that are being absorbed by various institutions. An independent publisher would have to pay for office space for staff, for instance, whereas the primary editors use their homes or offices, hiding that cost.
Nonetheless, the success of JMLR does provide a clue that the cost of running a premier journal might be far less than publishers imply, if they were to rethink the process substantially — maybe not $10 per article, but surely far less than the $5,000 average revenue per article that scholarly publishers currently receive. This expectation is borne out by the several non-profit and commercial open-access journal publishers that are able to operate in the black with publication fees a fraction of that average.
Anderson closes his comments on JMLR with these recommendations for LeCun:
You should look at yourself in the mirror, and ask why you don’t understand even the most basic financial realities (computers cost money to run, editors get paid, and webmasters get paid), why you don’t understand how JMLR is funded, how much you’ve benefited from tuition/fee increases foisted on students at +395% over the past decade,[1] and why you feel compelled to argue points you haven’t adequately examined (you tell me how JMLR is funded, and you’ll have much better face validity).
The call not to argue points one hasn’t adequately examined is surely apt.
[1]With regard to tuition hikes foisted on students see my earlier post.
| “…the interpersonal processes that a student goes through…” Harvard students (2008) by E>mar via flickr. Used by permission (CC by-nc-nd) |
Is the pot calling the kettle black? Oh sure, journal prices are going up, but so is tuition. How can universities complain about journal price hyperinflation if tuition is hyperinflating too? Why can’t universities use that income stream to pay for the rising journal costs?
There are several problems with this argument, above and beyond the obvious one that two wrongs don’t make a right.
First, tuition fees aren’t the bulk of a university’s revenue stream. So even if it were true that tuition is hyperinflating at the pace of journal prices, that wouldn’t mean that university revenues were keeping pace with journal prices.
Second, a journal is a monopolistic good. If its price hyperinflates, buyers can’t go elsewhere for a substitute; it’s pay or do without. But a college education can be arranged for at thousands of institutions. Students and their families can and do shop around for the best bang for the buck. (Just do a search for “best college values” for the evidence.) In economists’ parlance, colleges are economic substitutes. So even if it were true that tuition at a given college is hyperinflating at the pace of journal prices, individual students can adjust accordingly. As the College Board says in their report on “Trends in College Pricing 2011”:
Neither changes in average published prices nor changes in average net prices necessarily describe the circumstances facing individual students. There is considerable variation in prices across sectors and across states and regions as well as among institutions within these categories. College students in the United States have a wide variety of educational institutions from which to choose, and these come with many different price tags.
Third, a journal article is a pure information good. What you buy is the content. Pure information goods include things like novels and music CDs. They tend to have high fixed costs and low marginal costs, leading to large economies of scale. But a college education is not a pure information good. Sure, you are paying in part to acquire some particular knowledge, say, by listening to a lecture. But far more important are the interpersonal processes that a student participates in: interacting with faculty, other instructional staff, librarians, other students, in their dormitories, labs, libraries, and classrooms, and so forth. It is through the person-to-person hands-on interactions that a college education develops knowledge, skills, and character.
This aspect of college education has high marginal costs. One would not expect it to exhibit the economies of scale of a pure information good. So even if it were true that tuition is hyperinflating at the pace of journal prices, that would not take the journals off the hook; they should be able to operate with much higher economies of scale than a college by virtue of the type of good they are.[1]
Which makes it all the more surprising that the claims about college tuition hyperinflating at the rate of journals are, as it turns out, just plain false.
Let’s look at what the average Harvard College student pays for his or her education. When we talk about journal prices hyperinflating, we’re not just talking about list prices but about the net prices that libraries actually pay for the journals. If list prices hyperinflate but publishers provide discounts that moderate the inflation, you can’t hold the list prices against them. But the hyperinflation in serials prices has been in net costs—for instance, as recorded by the annual ARL serials price surveys.
Similarly for colleges we need to look at net prices, not list prices. The list price of a college education, the cost of attendance (COA) is the published tuition and fees, room and board; the net price subtracts whatever financial aid the college provides. There is a substantial difference between the two, as this chart shows:
The green line, the average net COA (in green) has increased, surely, but not nearly as quickly as the list COA (in blue). And it is the net COA that we are concerned with.
We also need to make sure that we are comparing costs appropriately over time. We should compare in inflation-adjusted dollars. Looking at the net COA normalized against the consumer price index (CPI-U, in 1996 dollars), things look quite different:
Net Harvard COA has been basically flat for the last 15 years or so, with, in fact, a dip in the last few years.
Now, we can place the inflation-adjusted net COA on the same chart as inflation-adjusted net serials expenditures to gauge the comparison between the two. (I normalize them to 1996 = 1 so that the changes over time can be compared.)
Is this unique to Harvard? Certainly there is a lot of variation in net COA among different groups of colleges. Public universities have been losing large amounts of their state subsidies over the last few years, leading to real net COA increases. But those tuition increases result from a subsidy being reduced; they aren’t generating windfall revenue increases to pay for journal price increases. Private four-year colleges and universities have not had huge increases in revenues from students. The College Board’s study shows that over the time period they looked at
Average published tuition and fees at private nonprofit four-year colleges and universities are about $3,730 higher (in 2011 dollars) in 2011-12 than they were in 2006-07, but the average net tuition paid by full-time students in this sector declined by $550 in inflation-adjusted dollars over this five-year period. (College Board, “Trends in College Pricing 2011”. Emphasis added.)
You can be the judge as to whether a tuition decrease of $100 per year in real dollars constitutes hyperinflation at journal-price levels.
The hyperinflation in journal prices looks especially bad when we compare it against other pure information goods, say, music CDs. The information technology revolution has led to tremendous efficiencies in generating and distributing information goods. In the presence of competition, this leads in general to great efficiencies and price reductions, a phenomenon we see clearly in CD price history. Why hasn’t the journal publishing industry been able to generate similar efficiency gains over time? Shouldn’t journal prices be going down, not up?
[Update May 15, 2012: NPR's Planet Money podcast for May 11, 2012 featured a story ("The Real Price of College") on the divergence of list and net college prices stating the same conclusion of stable net COA over the last decade. They go into great detail about why list prices are increasing while net prices remain roughly constant.]
Serials expenditure data are from the ARL Annual Surveys.
Harvard COA and net COA data are courtesy of Harvard’s Office of Institutional Research.
CD prices are from the RIAA report “CDs are a better value than ever!“.
Consumer price index data are from the US Department of Labor Bureau of Labor Statistics.
[1]You may say that this aspect of college education is a waste of money. The Khan Academy is able to teach algebra at essentially zero marginal cost. This is true, and its tuition is not hyperinflating either. If students want that kind of education, that’s fine, but for better or worse that is not the type of education provided by institutions that have libraries that subscribe to journals, so is irrelevant to this argument.
| “…time to switch…” A very old light switch (2008) by RayBanBro66 via flickr. Used by permission (CC by-nc-nd) |
The journal Research in Learning Technology has switched its approach from closed to open access as of New Year’s 2012. Congratulations to the Association for Learning Technology (ALT) and its Central Executive Committee for this farsighted move.
This isn’t the first journal to make the switch. The Open Access Directory lists about 130 of them. In my own research field, the Association for Computational Linguistics (ACL) converted its flagship journal Computational Linguistics to OA as of 2009, and has just announced a new open-access journal Transactions of the Association for Computational Linguistics. Each such transition is a reminder of the trajectory that journal publishing ought to head.
The ALT has done lots of things right in this change. They’ve chosen the ideal licensing regime for papers, the Creative Commons Attribution (CC-BY) license. They’ve jettisoned one of the largest commercial subscription journal publishers, and gone with a small but dedicated professional open-access publisher, Co-Action Publishing. They’ve opened access to the journal retrospectively, so that the entire archive, back to 1993, is available from the publisher’s web site.
Here’s hoping that other scholarly societies are inspired by the examples of the ALT and ACL, and join the many hundreds of scholarly societies that publish their journals open access. It’s time to switch.
My friend and ex-colleague Matt Welsh has an interesting post supporting the Research Without Walls pledge, in which he talks about the Harvard open-access policies. He says:
Another way to fight back is for your home institution to require all of your work be made open. Harvard was one of the first major universities to do this. This ambitious effort, spearheaded by my colleague Stuart Shieber, required all Harvard affiliates to submit copies of their published work to the open-access Harvard DASH archive. While in theory this sounds great, there are several problems with this in practice. First, it requires individual scientists to do the legwork of securing the rights and submitting the work to the archive. This is a huge pain and most folks don’t bother. Second, it requires that scientists attach a Harvard-supplied “rider” to the copyright license (e.g., from the ACM or IEEE) allowing Harvard to maintain an open-access copy in the DASH repository. Many, many publishers have pushed back on this. Harvard’s response was to allow its affiliates to get an (automatic) waiver of the open-access requirement. Well, as soon as word got out that Harvard was granting these waivers, the publishers started refusing to accept the riders wholesale, claiming that the scientist could just request a waiver. So the publishers tend to win.
I wrote a response to his post, clarifying some apparent misconceptions about the policy, but it was too long for his blogging platform’s comment system, so I decided to post it here in its entirety. Here it is:
There’s a lot to like about your post, and I agree with much of what you say. But I’d like to clarify some specific issues about the Harvard open-access policies, which are in place at seven of the Harvard schools as well as MIT, Duke, Stanford, and elsewhere.
The policy has two aspects. First, the policy commits faculty to (as you say) ”submitting the work to the archive”, that is, providing a copy of the final manuscript of each article, to be deposited into Harvard’s DASH open-access repository. Doing so involves filling out a web form with metadata about the article and uploading a file. But if that is too much trouble, we provide a simpler web form that is tantamount to just uploading the file. Or you can email the file to the OSC. Or one of our “open-access fellows” can make the deposit on your behalf. We also harvest articles from other repositories such as PubMed Central and arXiv. I can’t imagine that providing the articles is “a huge pain”.
Second, by virtue of the policy, Harvard faculty grant a nonexclusive transferable license to the university in all our scholarly articles. This license occurs as soon as copyright vests in the article, so it predates and therefore dominates any later transfer of copyright to a publisher. Since the policy license is transferable, the university can and does transfer it back to the author, so the author automatically retains rights in each article, without having to take any further action. Because of this policy, the “legwork of securing the rights” is actually eliminated. By doing nothing at all, the author retains rights in the article.
You mention attaching a rider to publication agreements. Although we provide an addendum generator to generate such riders, and we recommend that authors use them, attaching an addendum is not required to retain rights. The only point of the addendum is to alert the publisher that the author has already given Harvard non-exclusive rights to the article (though publishers undoubtedly are already aware of the fact; the policy and its license have been widely publicized).
Because we want the policy to work in the interest of faculty and guarantee the free choice of faculty as to the disposition of their works, the license is waivable at the sole discretion of the author. Thus, rights retention moves from an opt-in regime without the policy to an opt-out regime with the policy. The waiver aspect of the policy was not a response to publisher pushback, but has in fact been in the policies from the beginning. The waiver was intended to preserve complete freedom of choice for authors in rights retention.
As is found in many areas (organ donation, 401K participation), participation tends to be much higher with opt-out than opt-in systems, and that holds for rights retention as well. We have found that the waiver rate is extraordinarily low, contra your assumption. For FAS, we estimate it at perhaps 5% of articles. In total, the number of waivers we have issued is in the very low hundreds, out of the many thousands of articles that have been published by Harvard faculty since the policy was in force. MIT has tracked the waiver rate more accurately, and has reported a 1.5% waiver rate. So for well over 90% of articles, authors are retaining broad rights to use their articles.
The statement that “Many, many publishers have pushed back on this” is false. Less than a handful of publishers have established systematic policies to require waivers of the license, which accounts for the exceptionally low waiver rate. Indeed, over a third of all waivers are attributable to a single journal.
The Harvard approach to rights retention and open-access provision for articles is not a silver bullet to solve all problems in scholarly publishing. It has a limited goal: to provide an alternate venue for openly disseminating our articles and to retain the rights to do so. It is extremely successful at that goal. Many thousands of articles have been deposited in DASH, accounting for over half a million downloads. Nonetheless, other efforts need to be made to address the underlying market dysfunction in scholarly publishing, and we are actively engaged there too. For those interested in what we’re up to along those lines, I recommend taking a look at the various posts at my blog, The Occasional Pamphlet, which discusses issues of open access and scholarly communication more generally.
| “…dog-eared in thirty-one places…” |
I’ve been reading Arthur Conan Doyle‘s first novel, The Narrative of John Smith, just published for the first time by the British Library. It’s no The Adventures of Sherlock Holmes, that’s for sure. For one thing, he seems to have left out any semblance of plot. But it does incorporate some entertaining pronouncements. Here’s one I identify with highly:
There should be a Society for the Prevention of Cruelty to Books. I hate to see the poor patient things knocked about and disfigured. A book is a mummified soul embalmed in morocco leather and printer’s ink instead of cerecloths and unguents. It is the concentrated essence of a man. Poor Horatius Flaccus has turned to an impalpable powder by this time, but there is his very spirit stuck like a fly in amber, in that brown-backed volume in the corner. A line of books should make a man subdued and reverent. If he cannot learn to treat them with becoming decency he should be forced.
If a bibliophile House of Commons were to pass a ‘Bill for the better preservation of books’ we should have paragraphs of this sort under the headings of ‘Police Intelligence’ in the newspapers of the year 2000: ‘Marylebone Police Court. Brutal outrage upon an Elzevir Virgil. James Brown, a savage-looking elderly man, was charged with a cowardly attack upon a copy of Virgil’s poems issued by the Elzevir press. Police Constable Jones deposed that on Tuesday evening about seven o’clock some of the neighbours complained to him of the prisoner’s conduct. He saw him sitting at an open window with the book in front of him which he was dog-earing, thumb-marking and otherwise ill using. Prisoner expressed the greatest surprise upon being arrested. John Robinson, librarian of the casualty section of the British Museum, deposed to the book, having been brought in in a condition which could only have arisen from extreme violence. It was dog-eared in thirty-one places, page forty-six was suffering from a clean cut four inches long, and the whole volume was a mass of pencil — and finger — marks. Prisoner, on being asked for his defence, remarked that the book was his own and that he might do what he liked with it. Magistrate: “Nothing of the kind, sir! Your wife and children are your own but the law does not allow you to ill treat them! I shall decree a judicial separation between the Virgil and yourself: and condemn you to a week’s hard labour.” Prisoner was removed, protesting. The book is doing well and will soon be able to quit the museum.’
Portrait of Arthur Conan Doyle by Sidney Paget, c. 1890 What a wonderful, wonderful thing it is, though use has dulled our admiration of it! Here are all these dead men lurking inside my oaken case, ready to come out and talk to me whenever I may desire it. Do I wish philosophy? Here are Aristotle, Plato, Bacon, Kant and Descartes, all ready to confide to one their very inmost thoughts upon a subject which they have made their own. Am I dreamy and poetical? Out come Heine and Shelley and Goethe and Keats with all their wealth of harmony and imagination. Or am I in need of amusement on the long winter evenings? You have but to light your reading lamp and beckon to any one of the world’s great storytellers, and the dead man will come forth and prattle to you by the hour. That reading-lamp is the real Aladdin’s wonder for summoning the genii with. Indeed, the dead are such good company that one is apt to think too little of the living.
I know that there are those who think it is a sign of appreciation to write in, dog-ear, underline, highlight, and otherwise modify books — Anne Fadiman lauds such things as carnal acts — but I can’t bring myself to do so. I just can’t.
| “…a drop in the bucket.” Drop I (2007) by Delox – Martin Deák via flickr. Used by permission (CC by-nc-nd) |
At the recent Berlin 9 conference, there was much talk about the role of funding agencies in open-access publication, both through funding-agency-operated journals like the new eLife journal and through direct reimbursement of publication fees. I’ve written in the past about the importance of universities underwriting open-access publication fees, but only tangentially about the role of funding agencies. To correct that oversight, I provide in this post my thoughts on how best to organize a funding agency’s open-access underwriting system.
The motivation for underwriting publication fees is simple: Publishers provide valuable services to authors: management of peer review; production (copy-editing and typesetting); filtering, branding, and imprimatur. Although access to scholarly articles can now be provided at essentially zero marginal cost through digital networks, some means for paying for these so-called first-copy costs needs to be found in order to preserve these services. The natural business model is the open-access journal funded by article processing fees. (Although most current open-access journals charge no article processing fees, I will abuse the term “open-access journal” for this model.) Open-access (OA) journals are no longer an oddity, a fringe phenomenon. The largest scholarly journal on earth, PLoS ONE, is an OA journal. Major publishers — Springer, Elsevier, SAGE, Nature Publishing Group — are now publishing OA journals.
However, OA journals are currently at a significant disadvantage with respect to subscription journals, because universities and funding agencies subsidize the costs of subscription journals in such a way that authors do not need to trade off money used for the subsidy against money used for other purchases. In particular, subscription fees are paid by universities through their library budgets and by funding agencies through their overhead payments that fund those libraries. Authors do not see, let alone attend to, these costs. In such a situation, an author is inclined to publish in a subscription journal, where they do not need to use any moneys that could otherwise be applied to other uses, rather than an OA journal that requires payment of a publication fee. And if authors are unwilling to publish in open-access journals because of the fees, publishers — even those interested and motivated to switch to an OA revenue model — are unable to do so.
The solution is clear: universities and funding agencies should underwrite reasonable OA publication fees just as they do subscription fees. But how should this be done? Each kind of institution needs to provide its fair share of support.
As I’ve written about before, universities can underwrite processing fees on behalf of their faculty, and do so in a way that does not reintroduce a moral hazard, by reimbursing faculty for OA publication fees up to a fixed cap per year. Since these funds can only be used for open access fees, they can’t be traded off against other purchases, so they don’t provide a disincentive against open access journals. On the other hand, since these funds are limited (capped), they provide a market signal to motivate choosing among open access journals so that the economic incentives will militate toward low-cost high-service open access journals.
This is the argument for the Compact for Open-Access Publishing Equity (COPE), a commitment by universities to establish mechanisms for underwriting OA publication fees. COPE has grown well beyond its initial five signatories and is supported by a wide range of institutions and people. Harvard and other COPE signatories have already set up such OA funds, which work in just this way.
Many COPE-compliant OA funds don’t underwrite articles that were developed under research grants, under the view that such funding is the responsibility of the granting institutions. COPE calls for universities to do their fair share of paying OA fees, no less, but no more. Funding agencies need to underwrite their share of OA fees as well, and crucially should do so in a way that respects several important criteria:
Of course, many funders already allow grantees to pay for OA publication fees from their grants. But this method falls afoul of some of these criteria. With respect to criterion (1), grantees are forced to trade off uses of grant moneys to pay OA fees against uses to pay for other research expenses, providing incentive to publish in subscription-fee journals where these costs are hidden. This approach maintains the tilted playing field against OA journals. With respect to criterion (2), because the funds must be expended during the granting period, grantees must predict ahead of time how many articles they will be publishing in OA journals, where they will be publishing them, and those articles must be completed and accepted for publication by the end of the granting period.
The mechanism that satisfies these criteria is for funding agencies to provide non-fungible funds specifically for OA publication fees, funds that are not usable for purchasing other grant-related materials. Funders would establish a policy that grantees could be reimbursed for OA publication fees for articles based on grant-funded research at any time during or after the period of the grant. This satisfies criterion (1) because grantees would no longer have to pay publication fees out of pocket or from grant funds that could be used otherwise. It satisfies criterion (2) because payments can be provided after the end of the grant. (If desired, the delay after the grant ends can be limited to, say, a year or two.) A reasonable requirement for reimbursement of publication fees would be that the article explicitly acknowledge the grant as a source of research funding.
Wellcome Trust already uses a similar incremental funding system. However, they (inadvisably in my mind) allow the funds to apply to so-called hybrid publication fees, where an additional fee can be paid to make a single article available open access. These reimbursements should be limited to publication fees for true OA journals, not hybrid fees for subscription journals. Willingness to pay hybrid fees provides an incentive for a publisher to maintain the subscription revenue model for a journal, because the publisher can acquire these funds without converting the journal as a whole to open access. Eschewing hybrid fees is necessary to satisfy criterion (3).
If funders were willing to pay arbitrary amounts for publication fees without limit, a new moral hazard would be introduced into the publishing market. Authors would become price-insensitive and hyperinflation of publication fees would be possible. To retain a functioning market in publication fees, we must be careful in designing the reimbursement scheme for OA journals; we need to make sure that there is still some scarce resource that authors must manage. This can be achieved in a couple of ways, by capping reimbursements or by copayments. First, reimbursement of OA publication fees can be offered only up to a fixed percentage of the grant amount. By way of example, if an average NIH grant is $300,000 (excluding overhead[1]), a cap of, say, 2% would provide up to $6,000 available for OA fees. (Robert Kiley, Head of Digital Services at the Wellcome Trust, estimates that at present rates all funded papers of the Wellcome Trust could be underwritten for about 1.25% of their total granted funds. In the short run, nowhere near that level of underwriting is necessary, since the number of publication-fee-charging OA journals is so small. In the long run, as competition in the publication fee market increases, this number may well go down.) That would cover two PLoS Biology papers, three BMC papers, four or five PLoS ONE papers, eight or so Hindawi papers. A grantee would apply separately for these funds to reimburse reasonable OA fees. Some grantees might use all of these funds, some none, with most falling in the middle (and currently at the low end); but in any case they would not be usable for other purposes. Since these funds can only be used for OA publication fees, they can’t be traded off against other purchases, so there is no disincentive against selecting OA journals. On the other hand, since these funds are limited (capped), they provide a market signal to motivate choosing among open access journals so that the economic incentives will militate toward low-cost high-service OA journals. (This can’t be repeated often enough.)
Alternatively, a copayment approach can be used to provide economic pressure to keep publication fees down. Reimbursement would cover only part of the fee, at least at the expensive end of such fees. It is important (criterion 1) that for cost-efficient OA journals, authors should not be out of pocket for any fees. Thus, reimbursement should be at 100% for journals charging less than some threshold amount, say, $1,500. (As publishers become more efficient, this threshold can and should be reduced over time.) Above that level, the funder might pay only a proportion of the fee, say, 50%, so that grantees have some “skin in the game” and are motivated to trade off publication fees against quality of publisher services. With these parameters, the payment schedule would provide for the following kinds of payments:
| Publication fee | Funder pays | Author copays | Examples |
|---|---|---|---|
| $700 | $700 | $0 | typical Hindawi journal, SAGE Open |
| $1350 | $1350 | $0 | PLoS ONE, Scientific Reports |
| $2000 | $1750 | $250 | typical BMC journal |
| $2900 | $2200 | $700 | PLoS Biology |
(What the right parameters of such an approach are may depend on field and may change over time. I don’t propose these as the correct values, but merely provide an example of the workings of such a system.)
These two approaches are complementary. A policy could involve both a per-article copayment and a maximum per-grant outlay.
Finally, criterion (5) calls for implementing such an underwriting scheme as cost-effectively as possible, so that a funder’s research impact is not lessened by paying for publication fees. Indeed, one might expect that impact would be increased by such a move, given that the tiny percentage of funds going to OA fees would mean that those research results were freely and openly available to readers and to machine analysis throughout the world. I would think (and I recall a claim to this effect at Berlin 9) that the impact benefit of providing open access to a funder’s research results is greater than the impact of the marginal funded research grant. To the extent that this is so, it behooves funders to underwrite OA fees even at the expense of funding the incremental research. Nonetheless, there may be no need to forego funding research just to pay OA fees. Suppose that on the average grant incremental funds of $200 are used to pay OA publication fees. (With current availability and usage of OA journals, this is likely an overestimate of current demand for OA fees.) Where would this money come from? To the extent that faculty are publishing in OA journals, funders should not need to underwrite subscription journals, so that their overhead rates can be reduced accordingly. An overhead rate of 67% (Harvard’s current rate) would need to be reduced by a minuscule 0.067% to compensate. (This is not a typo. The number really is 0.067%, not 6.7%.) This constitutes a percentage reduction in overhead of one part in a thousand, a drop in the bucket. In the longer term over several years if usage of the funds rises to, say, $1000 per grant, the overhead rate would need to be reduced by a still tiny 0.33% for cost neutrality. As more OA journals become available and more funds are used, the overhead rate would be adjusted accordingly. If hypothetically all journals became OA, and all articles incurred these charges, the cost per grant might rise higher to Wellcome Trust’s predicted 1.25% (though by this point competition may have substantially reduced the fees), but then, larger reductions in overhead rates would be met by reduced university costs, since libraries would no longer need to pay subscription fees.
One of the nice properties of this approach is that it doesn’t require synchronization of the many actors involved. Each funding agency can unilaterally start providing OA fee reimbursement along these lines. Until a critical mass do so, the costs would be minimal. Once a critical mass is obtained, and journals feel confident enough that a sufficient proportion of their author pool will be covered by such a fund to switch to an open-access revenue model, subscription fees to libraries will drop, allowing for overhead rates to be reduced commensurately to cover the increasing underwriting costs. Each actor — author, funder, publisher, university, library — acts independently, with a market mechanism to move all towards a system based on open access.
It is time for funding agencies to take on the responsibility not only to fund research but its optimal distribution. Part of that responsibility is putting in place an economically sustainable system of underwriting open-access publication fees.
[1]The NIH Data Book reports average grant size for 2010 as around $450,000, which corresponds to something like $270,000 assuming a 67% overhead rate. $300,000 is thus likely on the high side.
| Petrus Spronk, “Architectural Fragment”, 1992. Photo © 2005 Robert Laddish (www.laddish.net), used by permission. |
I’ve just been at the conference in honor of the 30th anniversary of the University of Sao Paulo Integrated Library System (SIBi USP). David Palmer, one of the speakers at the conference, used in his presentation a picture of a wonderful sculpture that I had never seen before, which turned out to be a public art piece at the State Library of Victoria in Melbourne, Australia by Petrus Spronk entitled “Architectural Fragment”. I place a couple of pictures of it here in honor of Spronk’s 72nd birthday, which happens to be today. You can find more images here.
| Petrus Spronk, “Architectural Fragment”, 1992. Photo by flickr user madam3181, used by permission (CC by-nc-nd). |
"Besides getting more data, faster, we also now use much more sophisticated learning algorithms. For instance, algorithms based on logistic regression and that support vector machines can reduce by half the amount of spam that evades filtering, compared to Naive Bayes." (Emphasis added.)
Joshua Goodman, Gordon V. Cormack, and David Heckerman. 2007. Spam and the ongoing battle for the inbox. Communications of the Association for Computing Machinery, volume 50, number 2, page 27.
A common source of run-on sentences is the inclusion of a parenthetical full sentence at the end of another sentence, for instance,
This is an example (there may be others).This construction is always wrong. Separate the two sentences, as
This is an example. (There may be others.)or coordinate or subordinate the two, as
This is an example (though there may be others).or
This is an example (and there may be others).The following is not correct:
This is an example (however, there may be others).“However” is an adverb, not a subordinating conjunction.
Writers using MS Word tend to make certain standard errors in their typesetting. For instance, they use hyphens instead of em-dashes (ctrl-alt-hyphen or option-shift-hyphen). Mathematical typesetting is especially bad. There is essentially no way to typeset mathematics well in MS Word. The best solution: LaTeX.
For a while, I've been meaning to comment on the "that"/"which" controversy, the claim that "which" should not be used with restrictive relative clauses, nor "that" for nonrestrictive. From a linguistic point of view, it seems clear that this view is descriptively barren. Geoff Pullum provides a convincing and entertaining argument on Language Log, based on the sentence "The key point, that all the popular reports missed, is that FOXP2 is a transcription factor...". The rarity of sentences like these, in which "that" is used for a nonrestrictive relative clause, leads Pullum to refer to it as "ivory-billed".
I suppose, and am happy to stipulate for the purposes of discussion, that the use of "which" for restrictive relative clauses and "that" for nonrestrictive (or supplemental, as Pullum prefers) is grammatical. Nonetheless, the overwhelming preponderance of occurrences of "which" for nonrestrictive clauses means that the use of "that" in that context is much more likely to give pause to the reader, a kind of cognitive setback. For that reason, a charitable writer (and shouldn't we all strive to be one of those?) ought to use "which" for nonrestrictive relative clauses -- not because it is "wrong" to use "that", or ungrammatical, but because the use of "that" is likely to be jarring to a significant fraction of one's readers. (And I don't only mean the Fowler-type prescriptivist readers, though I suppose there's no reason to be jarring them needlessly either.) An excellent point of evidence is the fact that Pullum had to ask the author directly which meaning he had intended in the ivory-billed sentence; had he used a "which", no clarification would have been needed.
In the particular case of the sentence quoted above, there is no concomitant advantage to using "that" over "which" that would compensate for the negative effect of jarring or confusing the reader. Thus, its use should be prescriptively deprecated. (This issue of compensation allows me to avoid proscriptions against splitting infinitives or dangling prepositions, the slavish following of which leads to circumlocutions and semantic errors. Avoiding these negative effects clearly compensates for the oh so very slight jarring effect on some small fraction of true-believing Fowlerians.) By a similar argument, the use of "which" for restrictive relatives should be deprecated as well in formal writing.
What I am arguing is that even though the language does not enforce the distinction between nonrestrictive and restrictive in terms of "which" versus "that" (and commas versus none), respectively, there is still a good reason to write as if it did. There was nothing wrong in the quoted sentence even under the intended interpretation, just something infelicitous.
Am I trying to have my cake and eat it too? To be able to rail prescriptively while keeping my linguistic descriptivist moral stance? Yes.
Different people have different styles for overall organization of a
technical paper. There is the "continental" style, in which one states the
solution with as little introduction or motivation as possible, sometimes
not even saying what the problem was. Papers in this style tend to start
like this: "Consider a seven-dimensional manifold Q, and define its
hyper-diagonal as the ...." This style is designed to convince the reader
that the author is very smart; how else could he or she have come up with
the answer out of the blue? Readers will have no clue as to whether you
are right or not without incredible efforts in close reading of the paper,
but at least they'll think you're a genius.
Of course, the author didn't come up with the solution out of the blue.
There was a whole history of false starts, wrong attempts, near misses,
redefinitions of the problem. The "historical" style involves
recapitulating all of this history in chronological order. "First I tried
this. That didn't work because of this, so I tried this other way. That
turned out to be stupid. Then I tried this other way...." This is much
better, because a careful reader can probably follow the line of reasoning
that the author went through, and use this as motivation. But the reader
will probably think you are a bit addle-headed. Why would you even think
of trying half the stuff you talked about?
The ideal style is the "rational reconstruction" style. In this style, you
don't present the actual history that you went through, but rather an
idealized history that perfectly motivates each step in the solution. "We
consider the problem of XXX. The obvious thing to try is X. But
such-and-such a pithy example shows that that fails miserably.
Nonetheless, the example points the way naturally to solution Y. This
works better, except for such-and-such an obscure case. We patch solution
Y to handle this case, forming solution Z. Voila." Of course, the author
doesn't tell you that he came up with solution Y before solution X, which
only occurred to him after he came up with solution Z, and he skips
solutions A, B, and C because, in retrospect, they are nowhere on the
natural path to Z, even though at the time he was completely convinced they
were on the right track. The goal in pursuing the rational reconstruction
style is not to convince the reader that you are brilliant (or addle-headed
for that matter) but that your solution is trivial. It takes a certain
strength of character to take that as one's goal. But the advantage of the
reader thinking your solution is trivial or obvious is that it necessarily
comes along with the notion that you are correct.
I've just discovered James Pryor's "Guidelines on Writing a Philosophy Paper". Despite the ostensible limited goal of the guidelines, they are much more broadly applicable than just to philosophy papers. I especially like the characterization of readers as "lazy, stupid, and mean".
People seem to fall prey to adverbials like "however" and "rather" seducing them into running on sentences.
This type of approach has been used in previous models, however, the presented algorithm adopts a different foundation.But these words are not conjunctions, subordinating or otherwise. They are adverbs, like "on the other hand" or "unfortunately". The following is, presumably, clearly infelicitous.
This type of approach has been used in previous models, unfortunately, the presented algorithm adopts a different foundation.By the same token, so is the sentence with "however". It is easily corrected:
This type of approach has been used in previous models; however, the presented algorithm adopts a different foundation.or
This type of approach has been used in previous models. The presented algorithm, however, adopts a different foundation.
Email messages should be treated as personal letters. You wouldn't write a handwritten letter with misspellings, would you? Or a typewritten letter in which you didn't bother to use the shift key? Then you shouldn't do that in an email. Doing so implies to many readers that you don't respect them enough to bother with such "niceties".
On a related topic, by convention, words in all caps in email messages are to be read as if the author were shouting them. This is typically not the intended interpretation. According to RFC 1855:
Use symbols for emphasis. That *is* what I meant. Use underscores for underlining. _War and Peace_ is my favorite book.
The use of the pronoun "he" as a bound pronoun of neutral gender is problematic on two grounds. First, its use is blatantly sexist (although the sexism is of a historical nature, so that those who continue to use "he" in this way have a defensible position). Second, and more importantly, many readers confronted with such a use of "he", including myself, tend to find that it causes a jarring effect as they stop to wonder whether or not the writer intended to imply that the referent of the pronoun is male. Anything that causes a jarring effect like this on a substantial portion of your readers should be avoided, as it serves only to distract them from the important substance of your writing.
Now, I turn to a more recent variant of the same problem. The use of the pronoun "she" as a bound pronoun of neutral gender is problematic on two grounds. First, its use is blatantly sexist (although the sexism is of an anti-historical nature, so that those who continue to use "she" in this way have a defensible position). Second, and more importantly, many readers confronted with such a use of "she", including myself, tend to find that it causes a jarring effect as they stop to wonder whether or not the writer intended to imply that the referent of the pronoun is female. Anything that causes a jarring effect like this on a substantial portion of your readers should be avoided, as it serves only to distract them from the important substance of your writing.
But what alternatives are there? In everyday speech, "they" or "them" is used for this purpose, but this disturbs the sensibilities of prescriptivists, who, I should remind you, are a substantial portion of your readers. And anything that causes a jarring effect like this on a substantial portion of your readers....
Rewriting the sentence is the only practicable alternative. Do it and be done with it.
Pat Winston in his lecture on How to Speak notes that covering up parts of overhead transparencies and revealing them slowly like a strip-tease artist is a technique that drives 10 per cent of your audience nuts. I am in that 10 per cent. The desire to use this technique means only one thing: There is too much information on the slide. Split it into multiple slides. Winston recommends using overlays instead, but overlays are really a different and specialized overhead technique, and are not typically necessary for remedying this problem.
By the way, if you make slides using computerized means and want to use an overlay, consider "implicit" overlays instead. An implicit overlay is a series of separate slides each of which includes the contents of a different prefix of the overlay slides. Implicit overlays have the advantage that no Scotch taping of slide material is required, and no fumbling with the overlay pieces is needed. One just continues placing single sheets on the projector as usual, but each one in the overlay series has some additional material added to the previous one.
A citation is not a first-class participant in a sentence; it cannot serve as a noun phrase. Rather it is a parenthetical -- that is why it appears in parentheses -- and like all parentheticals should be removable without changing the well-formedness of the sentence in which it appears. Thus, the following sentences are ill-formed. (Try reading them without the material in parentheses.)
I have no Facebook account, as explained below, so you can't "friend" me. But you can contact me, via the contact link at left, or the various other methods at my Harvard web page.
Here's why I have no Facebook page. As stated in the Facebook Terms of Service, "you grant us a non-exclusive, transferable, sub-licensable, royalty-free, worldwide license to use any IP content that you post on or in connection with Facebook ('IP License')." This means that Facebook can do anything they want with information I would post in my Facebook account, without restriction. Of course, they might restrict their actions to only entirely reasonable ones, like obeying my Privacy Settings for instance, but they don't need to, especially since they can change the rules at any time. ("We can change this Statement if we provide you notice (by posting the change on the Facebook Site Governance Page) and an opportunity to comment.") Should I need to provide a blanket use license to my social network provider? Nope. Do I want to? Nope.
In 2006, in Excellence Without a Soul, I noted that the first "Take Back the Night" rally at Harvard took place in 1980. I continued,
From this point on, the issue of rape flared up on a schedule approximating the four-year cycle of college generations—sometimes emerging after three years in the background, sometimes after five, but not every year. Different circumstances bring the issue to the fore in different years, and each time the college community starts from a different place in responding.Right on schedule, it's back. According to the Crimson, the University "recently appointed student representatives to a special committee to review the sexual misconduct policies of the Office of Sexual Assault Prevention and Response." I am not quite sure what to make of that sentence. Quite possibly I missed the announcement and news reporting on the creation of the committee, but this is the first I have seen of it in either Harvard announcements or the student press. In any case, I seriously doubt that it is the OSAPR itself that creates sexual misconduct policies, de jure anyway (I thought it was the Faculty). Be that as it may, the revival of the "what's rape?" issue seems to be due to the series ("slew," in the Crimson's scrupulously objective journalese) of Title IX complaints against universities, including the Harvard Law School.
… said that she and other students on the committee hoped to push the University instead toward an “enthusiastic consent” model, in which an incident can be called rape in the absence of affirmative agreement.
“The only people who lose out in this model are the rapists,” said [another student], who had also intended to serve on the committee.
[The first student] said that she plans to discuss the stay on student involvement with Rankin, but she might eventually consider leading a “student protest” or “something more radical” than acting through administration-approved channels if she feels that student voices on this issue are not being heard.
TinEye is a reverse image search engine built by Idée currently in beta. Give it an image and it will tell you where the image appears on the web.
Shared by Stuart
My new go-to temporary file sharing service replacement for drop.io and share1t, both defunct.
John Kendall, who taught violin at the college level for more than 50 years and who made "Suzuki" a household name in America, died Jan. 6. He spent more than three decades teaching music at Southern Illinois University Edwardsville, where he founded the Lincoln Quartet and the SIUE Suzuki Program. Under his leadership, the program became an international training center for teachers.
Voting online for public office is a terrifying proposition to most security experts. The paths to subversion or failure are many:
So, terrifying. And yet, I’m now pretty sure it is inevitable.
Today, we bank online, deposit checks and even pay vendors with our smart phones. We can change our mailing address with the postal service and pay parking tickets with our local governments online. We can shop online, socialize online, and debate with our Presidential candidates online. Newt Gingrich announced his Presidential campaign on Twitter.
Just about everyone now carries an Internet-connected personal device. The Internet is everywhere you want it, and just about everywhere you don’t. People are starting to experience the world through augmented reality, using online maps and satellite overlays matched with your current location. The Internet is only going to become more omnipresent, faster. Within a few years, it’s hard to imagine any human activity that doesn’t involve the Internet.
And yet, somehow, we expect people to still be voting in person, on paper? We can’t even get users to take SSL certificate warnings seriously, but we’re going to convince them that voting is so special it has to be done in person? I don’t think so.
I’m not arguing that this is how it should be. I’m definitely not saying that we can secure online voting just like we can secure online banking. In fact I’ve made many of the original arguments, in my dissertation and on this blog, shooting down the bogus arguments that go something like “hey, we can secure online banking, surely we can secure online voting!” No, we don’t know how to do that.
What I’m saying is that, regardless of the state of online voting security, I think it’s a losing battle to expect voting to remain the only activity we still do in person and on paper. With the Oscars moving to online voting, the Federal Voting Assistance Program making $15M available in grants for activities related to online voting (even if it supposedly doesn’t involve online vote casting), parts of Canada moving to online voting, France considering online voting for its 2M+ expats (more than the margin of victory in the last Presidential election), what you’re hearing is the sound of inevitability.
There’s another interesting issue, when you think about problem (4): even if we keep voting on paper in person, voting requires enforced privacy: we have to make sure it’s just you in the voting booth, not you plus a coercer. That’s great. Now, how many ballots do you think we’re going to see next year published on Instagram?
We have a deeper problem here due to the now omnipresent Internet. Voluntary privacy is not dead, since users can choose to isolate themselves. But enforced privacy, privacy imposed on the voter, the kind needed to prevent coercion, that’s quite dead. I’m very concerned about what that means for democracy. But again, this is inevitable.
So, if it’s inevitable, maybe the best we can do is make online voting as secure as possible. We’ll probably have a few disasters, maybe even a few thrown elections. So we’d better start now on the problems we have.
I think we can solve Problem (2) with open-audit, end-to-end voting systems like Helios (but not only Helios, there are others.) I think we can minimize the risk of Problem (1) by moving to a longer voting period (1 week instead of 1 day). I suspect we have to eventually give up on some aspects of (4), whether or not we do online voting, though some technical tricks might make voter coercion a good bit more difficult (it’s never completely impossible). The hardest problem is (3): we have no way of ensuring that people are using trustworthy software that captures their intent properly.
Again, I’m not endorsing online voting for public office. I’m saying it’s inevitable, and it’s time to face that inevitability.
This issue of trustworthy user software is a much larger problem than voting. As human activity increasingly moves online, the central question is: what software is truly on the side of the user? How does the user know for sure that the software they’re using is their true agent? There’s only one piece of Internet architecture today that can be the user’s true agent, and that’s the Web browser (which technologists call the User Agent, unsurprisingly.) And, among the web browsers, there’s one that particularly stands out as the ultimate user agent, backed by a company whose mission is focused on the user and only the user.
That’s why I joined Mozilla. Because for voting and beyond, everything people do is online or soon to be online, and users better have an agent on their side. The best agent users can get today is Firefox, and I hope to contribute to making it an even better user agent in the next few years.
[It's worth noting that Mozilla has no intention of getting into the voting business, that's just my personal interest.]
OK, you may now get out your pitchfork.
Shared by Stuart
"I will convert your Excel data into one of several web-friendly formats, including HTML, JSON and XML."
Tokyo University researchers develop scanner that can capture 200 pages in one minute
Yale University has adopted an open access policy for digitized images from its museums, archives, and libraries. Yale has also launched the Discover Yale Digital Commons, which has over 250,000 images.
Here's an excerpt from the announcement:
The goal of the new policy is to make high quality digital images of Yale's vast cultural heritage collections in the public domain openly and freely available.
As works in these collections become digitized, the museums and libraries will make those images that are in the public domain freely accessible. In a departure from established convention, no license will be required for the transmission of the images and no limitations will be imposed on their use. The result is that scholars, artists, students, and citizens the world over will be able to use these collections for study, publication, teaching and inspiration.
| Digital Scholarship | Digital Scholarship Publications Overview | Transforming Scholarly Publishing through Open Access: A Bibliography |
On page 287 of Blown to Bits, we discuss the incestuous relationships between the regulators and the regulated in the world of information flows.
And then there is the revolving door. Most communications jobs are in the private sector. FCC employees know that their future lies in the commercial use of the spectrum. Hundreds of FCC staff and officials, including all eight past FCC chairmen, have gone to work for or represented the businesses they regulated. These movements from government to private employment violate no government ethics rules. But FCC officials can be faced with a choice between angering a large incumbent that is a potential employer, and disap- pointing a marginal start-up or a public interest non-profit. It is not surprising that they remember that they will have to earn a living after leaving the FCC.Even by historical standards, today's news is appalling. FCC Commissioner Attwell Baker is leaving the FCC to become a lobbyist for Comcast, just four months after voting to approve the controversial merger of Comcast with NBC United. We are, once again, going down the path to information monopoly. We have been there before, indeed we were there already in the late 19th century.
Development work continues, and we’ve got a nice section of the publishing workflow and permissions set completed.
The staging server is set up (private access for now, sorry), and some initial code has been committed to Github.
And, we’ve got a new logo!
There’s also a thumbnail version.
Up for next week: finishing the workflow and permissions, and starting on the authoring piece.
Statement of Stuart M. Shieber before the Committee on Science, Space, and Technology Subcommittee on Investigations and Oversight, U.S. House of Representatives Shieber, Stuart M.
Plan Recognition in Exploratory Domains Gal, Ya'akov; Reddy, Swapna; Shieber, Stuart M.; Rubin, Andee; Grosz, Barbara J. This paper describes a challenging plan recognition problem that arises in environments in which agents engage widely in exploratory behavior, and presents new algorithms for effective plan recognition in such settings. In exploratory domains, agentsʼ actions map onto logs of behavior that include switching between activities, extraneous actions, and mistakes. Flexible pedagogical software, such as the application considered in this paper for statistics education, is a paradigmatic example of such domains, but many other settings exhibit similar characteristics. The paper establishes the task of plan recognition in exploratory domains to be NP-hard and compares several approaches for recognizing plans in these domains, including new heuristic methods that vary the extent to which they employ backtracking, as well as a reduction to constraint-satisfaction problems. The algorithms were empirically evaluated on peopleʼs interaction with flexible, open-ended statistics education software used in schools. Data was collected from adults using the software in a lab setting as well as middle school students using the software in the classroom. The constraint satisfaction approaches were complete, but were an order of magnitude slower than the heuristic approaches. In addition, the heuristic approaches were able to perform within 4% of the constraint satisfaction approaches on student data from the classroom, which reflects the intended user population of the software. These results demonstrate that the heuristic approaches offer a good balance between performance and computation time when recognizing peopleʼs activities in the pedagogical domain of interest.
Inverting the Turing Test [review of The Most Human Human by Brian Christian] Shieber, Stuart M. In his book The Most Human Human, Brian Christian extrapolates from his experiences at the 2009 Loebner Prize competition, a competition among chatbots (computer programs that engage in conversation with people) to see which is "most human." In doing so, he demonstrates once again that the human being may be the only animal that overinterprets.
A Simple Language for Novel Visualizations of Information Shieber, Stuart M.; Lucas, Wendy While information visualization tools support the representation of abstract data, their ability to enhance one’s understanding of complex relationships can be hindered by a limited set of predefined charts. To enable novel visualization over multiple variables, we propose a declarative language for specifying informational graphics from first principles. The language maps properties of generic objects to graphical representations based on scaled interpretations of data values. An iterative approach to constraint solving that involves user advice enables the optimization of graphic layouts. The flexibility and expressiveness of a powerful but relatively easy to use grammar supports the expression of visualizations ranging from the simple to the complex.
A Language for Specifying Informational Graphics from First Principles Shieber, Stuart M.; Lucas, Wendy Informational visualization tools, such as commercial charting packages, provide a standard set of visualizations for tabular data, including bar charts, scatter plots, pie charts, and the like. For some combinations of data and task, these are suitable visualizations. For others, however, novel visualizations over multiple variables would be preferred but are unavailable in the fixed list of standard options. To allow for these cases, we introduce a declarative language for specifying visualizations on the basis of the first principles on which (a subset of) informational graphics are built. The functionality we aim to provide with this language is presented by way of example, from simple scatter plots to versions of two quite famous visualizations: Minard’s depiction of troop strength during Napoleon’s march on Moscow and a map of the early ARPAnet from the ancient history of the Internet. Benefits of our approach include flexibility and expressiveness for specifying a range of visualizations that cannot be rendered with standard commercial systems.
Synchronous Vector TAG for Syntax and Semantics: Control Verbs, Relative Clauses, and Inverse Linking Shieber, Stuart M.; Nesson, Rebecca Recent work has used the synchronous tree-adjoining grammar (STAG) formalism to demonstrate that many of the cases in which syntactic and semantic derivations appeared to be divergent could be handled elegantly through synchronization. This research has provided syntax and semantics for diverse and complex lin- guistic phenomena. However, certain hard cases push the STAG formalism to its limits, requiring awkward analyses or leaving no clear solution at all. In this paper a new variant of STAG, synchronous vector TAG (SV-TAG), and demonstrate that it has the potential to handle hard cases such as control verbs, relative clauses, and in- verse linking, while maintaining the simplicity of previous STAG syntax-semantics analyses.
Recognition of Users' Activities using Constraint Satisfaction Reddy, Swapna Cherukupalli; Gal, Ya'akov; Shieber, Stuart M. Ideally designed software allow users to explore and pursue interleaving plans, making it challenging to automatically recognize user interactions. The recognition algorithms presented use constraint satisfaction techniques to compare user interaction histories to a set of ideal solutions. We evaluate these algorithms on data obtained from user interactions with a commercially available pedagogical software, and find that these algorithms identified users’ activities with 93% accuracy.
Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression Yamangil, Elif; Shieber, Stuart M. We describe our experiments with training algorithms for tree-to-tree synchronous tree-substitution grammar (STSG) for monolingual translation tasks such as sentence compression and paraphrasing. These translation tasks are characterized by the relative ability to commit to parallel parse trees and availability of word alignments, yet the unavailability of large-scale data, calling for a Bayesian tree-to-tree formalism. We formalize nonparametric Bayesian STSG with epsilon alignment in full generality, and provide a Gibbs sampling algorithm for posterior inference tailored to the task of extractive sentence compression. We achieve improvements against a number of baselines, including expectation maximization and variational Bayes training, illustrating the merits of nonparametric inference over the space of grammars as opposed to sparse parametric inference with a fixed grammar.
Complexity, Parsing, and Factorization of Tree-Local Multi-Component Tree-Adjoining Grammar Shieber, Stuart M.; Satta, Giorgio; Nesson, Rebecca Tree-Local Multi-Component Tree-Adjoining Grammar (TL-MCTAG) is an appealing formalism for natural language representation because it arguably allows the encapsulation of the appropriate domain of locality within its elementary structures. Its multicomponent structure allows modeling of lexical items that may ultimately have elements far apart in a sentence, such as quantifiers and Wh-words. When used as the base formalism for a synchronous grammar, its flexibility allows it to express both the close relationships and the divergent structure necessary to capture the links between the syntax and semantics of a single language or the syntax of two different languages. Its limited expressivity provides constraints on movement and, we posit, may have generated additional popularity based on a misconception about its parsing complexity. Although TL-MCTAG was shown to be equivalent in expressivity to TAG when it was first introduced (Weir 1988), the complexity of TL-MCTAG is still not well-understood. This paper offers a thorough examination of the problem of TL-MCTAG recognition, showing that even highly restricted forms of TL-MCTAG are NP-complete to recognize. However, in spite of the provable difficulty of the recognition problem, we offer several algorithms that can substantially improve processing efficiency. First, we present a parsing algorithm that improves on the baseline parsing method and runs in polynomial time when both the fan-out and rank of the input grammar are bounded. Second, we offer an optimal, efficient algorithm for factorizing a grammar to produce a strongly-equivalent TL-MCTAG grammar with the rank of the grammar minimized.
Identifying Uncertain Words within an Utterance via Prosodic Features Pon-Barry, Heather Roberta; Shieber, Stuart M. We describe an experiment that investigates whether sub-utterance prosodic features can be used to detect uncertainty at the wordlevel. That is, given an utterance that is classified as uncertain, we want to determine which word or phrase the speaker is uncertain about. We have a corpus of utterances spoken under varying degrees of certainty. Using combinations of sub-utterance prosodic features we train models to predict the level of certainty of an utterance. On a set of utterances that were perceived to be uncertain, we compare the predictions of our models for two candidate target word segmentations: (a) one with the actual word causing uncertainty as the proposed target word, and (b) one with a control word as the proposed target word. Our best model correctly identifies the word causing the uncertainty rather than the control word 91% of the time.
The Importance of Sub-Utterance Prosody in Predicting Level of Certainty Shieber, Stuart M.; Pon-Barry, Heather Roberta We present an experiment aimed at understanding how to optimally use acoustic and prosodic information to predict a speaker's level of certainty. With a corpus of utterances where we can isolate a single word or phrase that is responsible for the speaker's level of certainty we use different sets of sub-utterance prosodic features to train models for predicting an utterance's perceived level of certainty. Our results suggest that using prosodic features of the word or phrase responsible for the level of certainty and of its surrounding context improves the prediction accuracy without increasing the total number of features when compared to using only features taken from the utterance as a whole.
Criteria for Designing Computer Facilities for Linguistic Analysis Shieber, Stuart M. Abstract: In the natural-language-processing research community, the usefulness of computer tools for testing linguistic analyses is often taken for granted. Linguists, on the other hand, have generally been unaware of or ambivalent about such devices. We discuss several aspects of computer use that are preeminent in establishing the utility for linguistic research of computer tools and describe several factors that must be considered in designing such computer tools to aid in testing linguistic analyses of grammatical phenomena. A series of design alternatives, some theoretically and some practically motivated, is then based on the resultant criteria. We present one way of pinning down these choices which culminates in a description of a particular grammar formalism for use in computer linguistic tools. The PATR-II formalism this serves to exemplify our general perspective.
Agent Decision-Making in Open Mixed Networks Gal, Ya'akov; Grosz, Barbara J.; Kraus, Sarit; Shieber, Stuart M. Computer systems increasingly carry out tasks in mixed networks, that is in group settings in which they interact both with other computer systems and with people. Participants in these heterogeneous human-computer groups vary in their capabilities, goals, and strategies; they may cooperate, collaborate, or compete. The presence of people in mixed networks raises challenges for the design and the evaluation of decision-making strategies for computer agents. This paper describes several new decision-making models that represent, learn and adapt to various social attributes that influence people's decision-making and presents a novel approach to evaluating such models. It identifies a range of social attributes in an open-network setting that influence people's decision-making and thus affect the performance of computer-agent strategies, and establishes the importance of learning and adaptation to the success of such strategies. The settings vary in the capabilities, goals, and strategies that people bring into their interactions. The studies deploy a configurable system called Colored Trails (CT) that generates a family of games. CT is an abstract, conceptually simple but highly versatile game in which players negotiate and exchange resources to enable them to achieve their individual or group goals. It provides a realistic analogue to multi-agent task domains, while not requiring extensive domain modeling. It is less abstract than payoff matrices, and people exhibit less strategic and more helpful behavior in CT than in the identical payoff matrix decision-making context. By not requiring extensive domain modeling, CT enables agent researchers to focus their attention on strategy design, and it provides an environment in which the influence of social factors can be better isolated and studied.
Recognizing Uncertainty in Speech Shieber, Stuart M.; Pon-Barry, Heather Roberta We address the problem of inferring a speaker’s level of certainty based on prosodic information in the speech signal, which has application in speech-based dialogue systems. We show that using phrase-level prosodic features centered around the phrases causing uncertainty, in addition to utterance-level prosodic features, improves our model’s level of certainty classification. In addition, our models can be used to predict which phrase a person is uncertain about. These results rely on a novel method for eliciting utterances of varying levels of certainty that allows us to compare the utility of contextually-based feature sets. We elicit level of certainty ratings from both the speakers themselves and a panel of listeners, finding that there is often a mismatch between speakers’ internal states and their perceived states, and highlighting the importance of this distinction.
Equity for Open-Access Journal Publishing Shieber, Stuart M. Scholars write articles to be read--the more access to their articles the better--so one might think that the open-access approach to publishing, in which articles are freely available online to all without interposition of an access fee, would be an attractive competitor to traditional subscription-based journal publishing. But open-access journal publishing is currently at a systematic disadvantage relative to the traditional model. I propose a simple, cost-effective remedy to this inequity that would put open-access publishing on a path to become a sustainable, efficient system, allowing the two journal publishing systems to compete on a more level playing field. The issue is important, first, because academic institutions shouldn’t perpetuate barriers to an open-access business model on principle and, second, because the subscription-fee business model has manifested systemic dysfunctionalities in practice. After describing the problem with the subscription-fee model, I turn to the proposal for providing equity for open-access journal publishing--the open-access compact.
Efficiently Parsable Extensions to Tree-Local Multicomponent TAG Nesson, Rebecca; Shieber, Stuart M. Recent applications of Tree-Adjoining Grammar (TAG) to the domain of semantics as well as new attention to syntactic phenomena have given rise to increased interested in more expressive and complex multicomponent TAG formalisms (MCTAG). Although many constructions can be modeled using tree-local MCTAG (TL-MCTAG), certain applications require even more flexibility. In this paper we suggest a shift in focus from constraining locality and complexity through tree- and set-locality to constraining locality and complexity through restrictions on the derivational distance between trees in the same tree set in a valid derivation. We examine three formalisms, restricted NS-MCTAG, restricted Vector-TAG and delayed TL-MCTAG, that use notions of derivational distance to constrain locality and demonstrate how they permit additional expressivity beyond TL-MCTAG without increasing complexity to the level of set local MCTAG.
Restricting the weak-generative capacity of synchronous tree-adjoining grammars Shieber, Stuart The formalism of synchronous tree-adjoining grammars, a variant of standard tree-adjoining grammars (TAG), was intended to allow the use of TAGs for language transduction in addition to language specification. In previous work, the definition of the transduction relation defined by a synchronous TAG was given by appeal to an iterative rewriting process. The rewriting definition of derivation is problematic in that it greatly extends the expressivity of the formalism and makes the design of parsing algorithms difficult if not impossible. We introduce a simple, natural definition of synchronous tree-adjoining derivation, based on isomorphisms between standard tree-adjoining derivations, that avoids the expressivity and implementability problems of the original rewriting definition. The decrease in expressivity, which would otherwise make the method unusable, is offset by the incorporation of an alternative definition of standard tree-adjoining derivation, previously proposed for completely separate reasons, thereby making it practical to entertain using the natural definition of synchronous derivation. Nonetheless, some remaining problematic cases call for yel more flexibility in the definition; the isomorphism requirement may have to be relaxed. It remains for future research to rune the exact requirements on the allowable mappings.
Optimal k-arization of synchronous tree-adjoining grammar Nesson, Rebecca; Shieber, Stuart; Satta, Giorgio Synchronous Tree-Adjoining Grammar (STAG) is a promising formalism for syntax-aware machine translation and simultaneous computation of natural-language syntax and semantics. Current research in both of these areas is actively pursuing its incorporation. However, STAG parsing is known to be NP-hard due to the potential for intertwined correspondences between the linked nonterminal symbols in the elementary structures. Given a particular grammar, the polynomial degree of efficient STAG parsing algorithms depends directly on the rank of the grammar: the maximum number of correspondences that appear within a single elementary structure. In this paper we present a compile-time algorithm for transforming a STAG into a strongly-equivalent STAG that optimally minimizes the rank, k, across the grammar. The algorithm performs in O( |G| + |Y| · (L_G)^3 ) time where L_G is the maximum number of links in any single synchronous tree pair in the grammar and Y is the set of synchronous tree pairs of G.
Formal constraints on metarules Shieber, Stuart; Robinson, Jane J.; Stucky, Susan U.; Uszkoreit, Hans Metagrammatical formalisms that combine context-free phrase structure rules and metarules (MPS grammars) allow concise statement of generalizations about the syntax of natural languages. Unconstrained MPS grammars, unfortunately, are not computationally "safe." We evaluate several proposals for constraining them, basing our assessment on computational tractability and explanatory adequacy. We show that none of them satisfies both criteria, and suggest new directions for research on alternative metagrammatical formalisms.
The design of a computer language for linguistic information Shieber, Stuart A considerable body of accumulated knowledge about the design of languages for communicating information to computers has been derived from the subfields of programming language design and semantics. It has been the goal of the PATR group at SRI to utilize a relevant portion of this knowledge in implementing tools to facilitate communication of linguistic information to computers. The PATR-II formalism is our current computer language for encoding linguistic information. This paper, a brief overview of that formalism, attempts to explicate our design decisions in terms of a set of properties that effective computer languages should incorporate.
The semantics of grammar formalisms seen as computer languages Shieber, Stuart; Pereira, Fernando C. N. The design, implementation, and use of grammar formalisms for natural language have constituted a major branch of computational linguistics throughout its development. By viewing grammar formalisms as just a special case of computer languages, we can take advantage of the machinery of denotational semantics to provide a precise specification of their meaning. Using Dana Scott's domain theory, we elucidate the nature of the feature systems used in augmented phrase-structure grammar formalisms, in particular those of recent versions of generalized phrase structure grammar, lexical functional grammar and PATR-II, and provide a denotational semantics for a simple grammar formalism. We find that the mathematical structures developed for this purpose contain an operation of feature generalization, not available in those grammar formalisms, that can be used to give a partial account of the effect of coordination on syntactic features.
Sentence disambiguation by a shift-reduce parsing technique Shieber, Stuart Native speakers of English show definite and consistent preferences for certain readings of syntactically ambiguous sentences. A user of a natural-language-processing system would naturally expect it to reflect the same preferences. Thus, such systems must model in some way the linguistic performance as well as the linguistic competence of the native speaker. We have developed a parsing algorithm---a variant of the LALR(1) shift-reduce algorithm---that models the preference behavior of native speakers for a range of syntactic preference phenomena reported in the psycholinguistic literature, including the recent data on lexical preferences. The algorithm yields the preferred parse deterministically, without building multiple parse trees and choosing among them. As a side effect, it displays appropriate behavior in processing the much discussed garden-path sentences. The parsing algorithm has been implemented and has confirmed the feasibility of our approach to the modeling of these phenomena.
Synchronous tree-adjoining grammars Schabes, Yves; Shieber, Stuart The unique properties of tree-adjoining grammars (TAG) present a challenge for the application of TAGs beyond the limited confines of syntax, for instance, to the task of semantic interpretation or automatic translation of natural language. We present a variant of TAGs, called synchronous TAGs, which characterize correspondences between languages. The formalism's intended usage is to relate expressions of natural languages to their associated semantics represented in a logical form language, or to their translates in another natural language; in summary, we intend it to allow TAGs to be used beyond their role in syntax proper. We discuss the application of synchronous TAGs to concrete examples, mentioning primarily in passing some computational issues that arise in its interpretation
A viewer for PostScript documents Shieber, Stuart; Marks, Joe; Ginsburg, Adam We describe a PostScript viewer that provides navigation and annotation functionality similar to that of paper documents using simple unified user-interface techniques.
An interactive system for drawing graphs Marks, Joe; Shieber, Stuart; Ryall, Kathy Abstract: In spite of great advances in the automatic drawing of medium and large graphs, the tools available for drawing small graphs exquisitely (that is, with the aesthetics commonly found in professional publications and presentations) are still very primitive. Commercial tools, e.g., Claris Draw, provide minimal support for aesthetic graph layout. At the other extreme, research prototypes based on constraint methods are overly general for graph drawing. Our system improves on general constraint-based approaches to drawing and layout by supporting only a small set of “macro” constraints that are specifically suited to graph drawing. These constraints are enforced by a generalized spring algorithm. The result is a usable and useful tool for drawing small graphs easily and nicely.
Semi-automatic Delineation of Regions in Floor Plans Shieber, Stuart; Mazer, Murray; Marks, Joe; Ryall, Kathy We propose a technique that uses a proximity metric for delineating partially or fully bounded regions of a scanned bitmap that depicts a building floor plan. A proximity field is defined over the bitmap, which is used both to identify the centers of subjective regions in the image and to assign pixels to regions by proximity. The region boundaries generated by the method tend to match well the subjective boundaries of regions in the image. We discuss incorporation of the technique in a semi-automated interactive system for region identification in floor plans. In contrast to area-filling techniques for delineating areal regions of images, our approach works robustly for partially bounded regions. Furthermore, the frailties of the method that do remain, unlike those of alternative techniques, are well-moderated by simple human intervention.
A simple reconstruction of GPSG Shieber, Stuart Like most linguistic theories, the theory of generalized phrase structure grammar (GPSG) has described language axiomatically, that is, as a set of universal and language-specific constraints on the well-formedness of linguistic elements of some sort. The coverage and detailed analysis of English grammar in the ambitious recent volume by Gazdar, Klein, Pullum, and Sag entitled Generalized Phrase Structure Grammar are impressive, in part because of the complexity of the axiomatic system developed by the authors. In this paper. We examine the possibility that simpler descriptions of the same theory can be achieved through a slightly different, albeit still axiomatic, method. Rather than characterize the well-formed trees directly, we progress in two stages by procedurally characterizing the well-formedness axioms themselves, which in turn characterize the trees.
A uniform architecture for parsing and generation Shieber, Stuart The use of a single grammar for both parsing and generation is an idea with a certain elegance, the desirability of which several researchers have noted. In this paper, we discuss a more radical possibility: not only can a single grammar be used by different processes engaged in various "directions" of processing, but one and the same language-processing architecture can be used for processing the grammar in the various modes. In particular, parsing and generation can be viewed as two processes engaged in by a single parameterized theorem prover for the logical interpretation of the formalism. We discuss our current implementation of such an architecture, which is parameterized in such a way that it can be used for either purpose with grammars written in the PATR formalism. Furthermore, the architecture allows fine tuning to reflect different processing strategies, including parsing models intended to mimic psycholinguistic phenomena. This tuning allows the parsing system to operate within the same realm of efficiency as previous architectures for parsing alone, but with much greater flexibility for engaging in other processing regimes.
Design Galleries: A general approach to setting parameters for computer graphics and animation Gibson, Sarah; Beardsley, Paul; Ruml, Wheeler; Kang, Thomas; Mirtich, Brian; Seims, Joshua; Freeman, William; Hodgins, Jessica; Pfister, Hanspeter; Marks, Joe; Andalman, Brad; Shieber, Stuart Image rendering maps scene parameters to output pixel values; animation maps motion-control parameters to trajectory values. Because these mapping functions are usually multidimensional, nonlinear, and discontinuous, finding input parameters that yield desirable output values is often a painful process of manual tweaking. Interactive evolution and inverse design are two general methodologies for computer-assisted parameter setting in which the computer plays a prominent role. In this paper we present another such methodology: Design Gallery TM (DG) interfaces present the user with the broadest selection--- automatically generated and organized--- of perceptually different graphics or animations that can be produced by varying a given input-parameter vector. The principal technical challenges posed by the DG approach are dispersion, finding a set of input-parameter vectors that optimally disperses the resulting output-value vectors, and arrangement, organizing the resulting graphics for easy and intuitive browsing by the user. We describe the use of DGs for several parametersetting problems: light selection and placement for image rendering, both standard and image-based; opacity and color transfer-function specification for volume rendering; and motion control for particle-system and articulated-figure animation.
Design gallery browsers based on 2D and 3D graph drawing Marks, Joe; Ruml, Wheeler; Andalman, Brad; Ryall, Kathy; Shieber, Stuart Many problems in computer-aided design and graphics involve the process of setting and adjusting input parameters to obtain desirable output values. Exploring different parameter settings can be a difficult and tedious task in most such systems. In the Design GalleryTM (DG) approach, parameter setting is made easier by dividing the task more equitably between user and computer. DG interfaces present the user with the broadest selection, automatically generated and organized, of perceptually different designs that can be produced by varying a given set of input parameters. The DG approach has been applied to several difficult parameter-setting tasks from the field of computer graphics: light selection and placement for image rendering; opacity and color transfer-function specification for volume rendering; and motion control for articulated-figure and particle-system animation. The principal technical challenges posed by the DG approach are dispersion (finding a set of input-parameter vectors that optimally disperses the resulting output values) and arrangement (arranging the resulting designs for easy browsing by the user). We show how effective arrangement can be achieved with 2D and 3D graph drawing. While navigation is easier in the 2D interface, the 3D interface has proven to be surprisingly usable, and the 3D drawings sometimes provide insights that are not so obvious in the 2D drawings.
Induction of probabilistic synchronous tree-insertion grammars for machine translation. Nesson, Rebecca; Rush, Alexander; Shieber, Stuart The more expressive and flexible a base formalism for machine translation is, the less efficient parsing of it will be. However, even among formalisms with the same parse complexity, some formalisms better realize the desired characteristics for machine translation formalisms than others. We introduce a particular formalism, probabilistic synchronous treeinsertion grammar (PSTIG) that we argue satisfies the desiderata optimally within the class of formalisms that can be parsed no less efficiently than context-free grammars and demonstrate that it outperforms state-of-the-art word-based and phrasebased finite-state translation models on training and test data taken from the EuroParl corpus (Koehn, 2005). We then argue that a higher level of translation quality can be achieved by hybridizing our induced model with elementary structures produced using supervised techniques such as those of Groves et al. (2004).
A seed-growth heuristic for graph bisection Ngo, J. Thomas; Shieber, Stuart; Ruml, Wheeler; Marks, Joe We present a new heuristic algorithm for graph bisection, based on an implicit notion of clustering. We describe how the heuristic can be combined with stochastic search procedures and a postprocess application of the Kernighan-Lin algorithm. In a series of time-equated comparisons with large-sample runs of pure Kernighan-Lin, the new algorithm demonstrates significant superiority in terms of the best bisections found.
Generation and synchronous tree-adjoining grammars Schabes, Yves; Shieber, Stuart Tree-adjoining grammars (TAG) have been proposed as a formalism for generation based on the intuition that the extended domain of syntactic locality that TAGs provide should aid in localizing semantic dependencies as well, in turn serving as an aid to generation from semantic representations. We demonstrate that this intuition can be made concrete by using the formalism of synchronous tree-adjoining grammars. The use of synchronous TAGs for generation provides solutions to several problems with previous approaches to TAG generation. Furthermore, the semantic monotonicity requirement previously advocated for generation grammars as a computational aid is seen to be an inherent property of synchronous TAGs.
An interactive constraint-based system for drawing graphs Marks, Joe; Ryall, Kathy; Shieber, Stuart The glide system is an interactive constraint-based editor for drawing small- and medium-sized graphs (50 nodes or fewer) that organizes the interaction in a more collaborative manner than in previous systems. Its distinguishing features are a vocabulary of specialized constraints for graph drawing, and a simple constraintsatisfaction mechanism that allows the user to manipulate the drawing while the constraints are active. These features result in a graph-drawing editor that is superior in many ways to those based on more general and powerful constraint-satisfaction methods.
Empirical testing of algorithms for variable-sized label placement Marks, Joe; Friedman, Stacy; Christensen, Jon; Shieber, Stuart We report an empirical comparison of different heuristic techniques for variable-sized point-feature label placement.
The LinGO redwoods treebank: Motivation and preliminary applications Brants, Thorsten; Flickinger, Dan; Manning, Christopher; Shieber, Stuart; Toutanova, Kristina; Oepen, Stephan The LinGO Redwoods initiative is a seed activity in the design and development of a new type of treebank. While several medium- to large-scale treebanks exist for English (and for other major languages), pre-existing publicly available resources exhibit the following limitations: (i) annotation is mono-stratal, either encoding topological (phrase structure) or tectogrammatical (dependency) information, (ii) the depth of linguistic information recorded is comparatively shallow, (iii) the design and format of linguistic representation in the treebank hard-wires a small, predefined range of ways in which information can be extracted from the treebank, and (iv) representations in existing treebanks are static and over the (often year- or decade-long) evolution of a large-scale treebank tend to fall behind the development of the field. LinGO Redwoods aims at the development of a novel treebanking methodology, rich in nature and dynamic both in the ways linguistic data can be retrieved from the treebank in varying granularity and in the constant evolution and regular updating of the treebank itself. Since October 2001, the project is working to build the foundations for this new type of treebank, to develop a basic set of tools for treebank construction and maintenance, and to construct an initial set of 10,000 annotated trees to be distributed together with the tools under an open-source license.
Abbreviated text input Shieber, Stuart; Baker, Ellie We address the problem of improving the efficiency of natural language text input under degraded conditions (for instance, on PDAs or cell phones or by disabled users) by taking advantage of the informational redundacy in natural language. Previous approaches to this problem have been based on the idea of prediction of the text, but these require the user to take overt action to verify or select the system's predictions. We propose taking advantage of the duality between prediction and compression. We allow the user to enter text in compressed form, in particular, using a simple stipulated abbreviation method that reduces characters by about 30% yet is simple enough that it can be learned easily and generated relatively fluently. Using statistical language processing techniques, we can decode the abbreviated text with a residual word error rate of about 3%, and we expect that simple adaptive methods can improve this to about 1.5%. Because the system's operation is completely independent from the user's, the overhead from cognitive task switching and attending to the system's actions online is eliminated, opening up the possibility that the compression-based method can achieve text input efficiency improvements where the prediction-based methods have not.
Unifying annotated discourse hierarchies to create a gold standard Carbone, Marco; Shieber, Stuart; Gal, Ya'akov Kobi Human annotation of discourse corpora typically results in segmentation hierarchies that vary in their degree of agreement. This paper presents several techniques for unifying multiple discourse annotations into a single hierarchy, deemed a “gold standard ” — the segmentation that best captures the underlying linguistic structure of the discourse. It proposes and analyzes methods that consider the level of embeddedness of a segmentation as well as methods that do not. A corpus containing annotated hierarchical discourses, the Boston Directions Corpus, was used to evaluate the “goodness” of each technique, by comparing the similarity of the segmentation it derives to the original annotations in the corpus. Several metrics of similarity between hierarchical segmentations are computed: precision/recall of matching utterances, pairwise inter-reliability scores ( ¡), and non-crossing-brackets. A novel method for unification that minimizes conflicts among annotators outperforms methods that require consensus among a majority for the ¡ and recall metrics, while capturing much of the structure of the discourse. When higher recall is preferred, methods requiring a majority are preferable to those that demand full consensus among annotators.
Arabic diacritization using weighted finite-state transducers Shieber, Stuart; Nelken, Rani Arabic is usually written without short vowels and additional diacritics, which are nevertheless important for several applications. We present a novel algorithm for restoring these symbols, using a cascade of probabilistic finite- state transducers trained on the Arabic treebank, integrating a word-based language model, a letter-based language model, and an extremely simple morphological model. This combination of probabilistic methods and simple linguistic information yields high levels of accuracy.
Unifying synchronous tree-adjoining grammars and tree transducers via bimorphisms. Shieber, Stuart We place synchronous tree-adjoining grammars and tree transducers in the single overarching framework of bimorphisms, continuing the unification of synchronous grammars and tree transducers initiated by Shieber (2004). Along the way, we present a new definition of the tree-adjoining grammar derivation relation based on a novel direct inter-reduction of TAG and monadic macro tree transducers.
Referring-expression generation using a transformation-based learning approach Nickerson, Jill; Shieber, Stuart; Grosz, Barbara A natural language generation system must generate expressions that allow a reader to identify the entities to which they refer. This paper describes the creation of referring-expression (RE) generation models developed using a transformation-based learning approach. We present an evaluation of the learned models and compare their performance to the performance of a baseline system, which always generates full noun phrase REs. When compared to the baseline system, the learned models produce REs that lead to more coherent natural language documents and are more accurate and closer in length to those that people use.
Probabilistic synchronous tree-adjoining grammars for machine translation: The argument from bilingual dictionaries. Shieber, Stuart We provide a conceptual basis for thinking of machine translation in terms of synchronous grammars in general, and probabilistic synchronous tree-adjoining grammars in particular. Evidence for the view is found in the structure of bilingual dictionaries of the last several millennia.
Practical secrecy-preserving, verifiably correct and trustworthy auctions. Shieber, Stuart; Parkes, David; Rabin, Michael; Thorpe, Christopher We present a practical system for conducting sealed-bid auctions that preserves the secrecy of the bids while providing for verifiable correctness and trustworthiness of the auction. The auctioneer must accept all bids submitted and follow the published rules of the auction. No party receives any useful information about bids before the auction closes and no bidder is able to change or repudiate her bid. Our solution uses Paillier's homomorphic encryption scheme [25] for zero knowledge proofs of correctness. Only minimal cryptographic technology is required of bidders; instead of employing complex interactive protocols or multi-party computation, the single auctioneer computes optimal auction results and publishes proofs of the results' correctness. Any party can check these proofs of correctness via publicly verifiable computations on encrypted bids. The system is illustrated through application to first-price, uniform-price and second-price auctions, including multi-item auctions. Our empirical results demonstrate the practicality of our method: auctions with hundreds of bidders are within reach of a single PC, while a modest distributed computing network can accommodate auctions with thousands of bids.
Towards collaborative intelligent tutors: Automated recognition of users' strategies. Grosz, Barbara; Rubin, Andee; Yamangil, Elif; Shieber, Stuart; Gal, Ya'akov Kobi This paper addresses the problem of inferring students’ strategies when they interact with data-modeling software used for pedagogical purposes. The software enables students to learn about statistical data by building and analyzing their own models. Automatic recognition of students’ activities when interacting with pedagogical software is challenging. Students can pursue several plans in parallel and interleave the execution of these plans. The algorithm presented in this paper decomposes students’ complete interaction histories with the software into hierarchies of interdependent tasks that may be subsequently compared with ideal solutions. This algorithm is evaluated empirically using commercial software that is used in many schools. Results indicate that the algorithm is able to (1) identify the plans students use when solving problems using the software; (2) distinguish between those actions in students’ plans that play a salient part in their problem-solving and those representing exploratory actions and mistakes; and (3) capture students’ interleaving and free-order action sequences.
Parse disambiguation for a rich HPSG grammar Oepen, Stephan; Flickinger, Dan; Manning, Christopher; Shieber, Stuart; Toutanova, Kristina
The influence of task contexts on the decision-making of humans and computers. Gal, Ya'akov Kobi; Allain, Alex; Grosz, Barbara; Pfeffer, Avrom; Shieber, Stuart Many environments in which people and computer agents interact involve deploying resources to accomplish tasks and satisfy goals. This paper investigates the way that the context in which decisions are made affects the behavior of people and the performance of computer agents that interact with people in such environments. It presents experiments that measured negotiation behavior in two different types of settings. One setting was a task context that made explicit the relationships among goals, (sub)tasks and resources. The other setting was a completely abstract context in which only the payoffs for the decision choices were listed. Results show that people are more helpful, less selfish, and less competitive when making decisions in task contexts than when making them in completely abstract contexts. Further, their overall performance was better in task contexts. A predictive computational model that was trained on data obtained in the task context outperformed a model that was trained under the abstract context. These results indicate that taking context into account is essential for the design of computer agents that will interact well with people.
Extraction phenomena in synchronous TAG syntax and semantics. Shieber, Stuart; Nesson, Rebecca We present a proposal for the structure of noun phrases in Synchronous Tree-Adjoining Grammar (STAG) syntax and semantics that permits an elegant and uniform analysis of a variety of phenomena, including quantifier scope and extraction phenomena such as wh-questions with both moved and in-place wh-words, pied-piping, stranding of prepositions, and topicalization. The tight coupling between syntax and semantics enforced by the STAG helps to illuminate the critical relationships and filter out analyses that may be appealing for either syntax or semantics alone but do not allow for a meaningful relationship between them.
Colored trails: A multiagent system testbed for decision-making research (demonstration) Shieber, Stuart; Grosz, Barbara; Pfeffer, Avrom; Ficici, Sevan; Gal, Ya'akov Kobi With increasing frequency, computer agents participate in collaborative and competitive multiagent domains in which humans reason strategically to make decisions. The deployment of computer agents in such domains requires that the agents understand something about human behavior so that they can interact successfully with people; the computer agents must be sensitive to how people reason in strategic settings as well as to the social utilities people employ to inform their reasoning. To date, these design requirements for computer agents have received relatively little attention. To further research in this area, we are developing the Colored Trails (CT) testbed [5], a configurable and extensible open-source system for use by the research community at large to investigate multiagent decision making.
A writer's collaborative assistant Babaian, Tamara; Shieber, Stuart; Grosz, Barbara In traditional human-computer interfaces, a human master directs a computer system as a servant, telling it not only what to do, but also how to do it. Collaborative interfaces attempt to realign the roles, making the participants collaborators in solving the person's problem. This paper describes Writer's Aid, a system that deploys AI planning techniques to enable it to serve as an author's collaborative assistant. Writer's Aid differs from previous collaborative interfaces in both the kinds of actions the system partner takes and the underlying technology it uses to do so. While an author writes a document, Writer's Aid helps in identifying and inserting citation keys and by autonomously finding and caching potentially relevant papers and their associated bibliographic information from various on-line sources. This autonomy, enabled by the use of a planning system at the core of Writer's Aid, distinguishes this system from other collaborative interfaces. The collaborative design and its division of labor result in more efficient operation: faster and easier writing on the user's part and more effective information gathering on the part of the system. Subjects in our laboratory user study found the system effective and the interface intuitive and easy to use.
Partially ordered multiset context-free grammars and ID/LP parsing Nederhof, Mark-Jan; Shieber, Stuart; Satta, Giorgio We present a new formalism, partially ordered multiset context-free grammars (poms-CFG), along with an Earley-style parsing algorithm. The formalism, which can be thought of as a generalization of context-free grammars with partially ordered right-hand sides, is of interest in its own right, and also as infrastructure for obtaining tighter complexity bounds for more expressive context-free formalisms intended to express free or multiple word-order, such as ID/LP grammars. We reduce ID/LP grammars to poms-grammars, thereby getting finer-grained bounds on the parsing complexity of ID/LP grammars. We argue that in practice, the width of attested ID/LP grammars is small, yielding effectively polynomial time complexity for ID/LP grammar parsing.
A learning approach to improving sentence-level MT evaluation Shieber, Stuart; Kulesza, Alex The problem of evaluating machine translation (MT) systems is more challenging than it may first appear, as diverse translations can often be considered equally correct. The task is even more difficult when practical circumstances require that evaluation be done automatically over short texts, for instance, during incremental system development and error analysis. While several automatic metrics, such as BLEU, have been proposed and adopted for largescale MT system discrimination, they all fail to achieve satisfactory levels of correlation with human judgments at the sentence level. Here, a new class of metrics based on machine learning is introduced. A novel method involving classifying translations as machine or humanproduced rather than directly predicting numerical human judgments eliminates the need for labor-intensive user studies as a source of training data. The resulting metric, based on support vector machines, is shown to significantly improve upon current automatic metrics, increasing correlation with human judgments at the sentence level halfway toward that achieved by an independent human evaluator.
Towards robust context-sensitive sentence alignment for monolingual corpora Shieber, Stuart; Nelken, Rani Aligning sentences belonging to comparable monolingual corpora has been suggested as a first step towards training text rewriting algorithms, for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm, combining a sentence-based TF*IDF score, turned into a probability distribution using logistic regression, with a global alignment dynamic programming algorithm. Our approach provides a simpler and more robust solution achieving a substantial improvement in accuracy over existing systems.
Does the Turing Test demonstrate intelligence or not? Shieber, Stuart The Turing Test has served as a defining inspiration throughout the early history of artificial intelligence research. Its centrality arises in part because verbal behavior indistinguishable from that of humans seems like an incontrovertible criterion for intelligence, a "philosophical conversation stopper" as Dennett says. On the other hand, from the moment Turing's seminal Mind article was published, the conversation hasn't stopped; the appropriateness of the Test has been continually questioned, and current philosophical wisdom holds that the Turing Test is hopelessly flawed as a sufficient condition for attributing intelligence. In this short article, I summarize for an artificial intelligence audience an argument that I have presented at length for a philosophical audience that attempts to reconcile these two mutually contradictory but well-founded attitudes towards the Turing Test that have been under constant debate since 1950.
Simpler TAG semantics through synchronization Shieber, Stuart; Nesson, Rebecca In recent years Laura Kallmeyer, Maribel Romero, and their collaborators have led research on TAG semantics through a series of papers refining a system of TAG semantics computation. Kallmeyer and Romero bring together the lessons of these attempts with a set of desirable properties that such a system should have. First, computation of the semantics of a sentence should rely only on the relationships expressed in the TAG derivation tree. Second, the generated semantics should compactly represent all valid interpretations of the input sentence, in particular with respect to quantifier scope. Third, the formalism should not, if possible, increase the expressivity of the TAG formalism. We revive the proposal of using synchronous TAG (STAG) to simultaneously generate syntactic and semantic representations for an input sentence. Although STAG meets the three requirements above, no serious attempt had previously been made to determine whether it can model the semantic constructions that have proved difficult for other approaches. In this paper we begin exploration of this question by proposing STAG analyses of many of the hard cases that have spurred the research in this area. We reframe the TAG semantics problem in the context of the STAG formalism and in the process present a simple, intuitive base for further exploration of TAG semantics. We provide analyses that demonstrate how STAG can handle quantifier scope, long-distance WH-movement, interaction of raising verbs and adverbs, attitude verbs and quantifiers, relative clauses, and quantifiers within prepositional phrases.
Comma restoration using constituency information Tao, Xiaopeng; Shieber, Stuart Automatic restoration of punctuation from unpunctuated text has application in improving the fluency and applicability of speech recognition systems. We explore the possibility that syntactic information can be used to improve the performance of an HMM-based system for restoring punctuation (specifically, commas) in text. Our best methods reduce sentence error rate substantially - by some 20%, with an additional 8% reduction possible given improvements in extraction of the requisite syntactic information.
An alternative conception of tree-adjoining derivation Shieber, Stuart; Schabes, Yves The precise formulation of derivation for tree-adjoining grammars has important ramifications for a wide variety of uses of the formalism, from syntactic analysis to semantic interpretation and statistical language modeling. We argue that the definition of tree-adjoining derivation must be reformulated in order to manifest the proper linguistic dependencies in derivations. The particular proposal is both precisely characterizable, through a compilation to linear indexed grammars, and computationally operational, by virtue of an efficient algorithm for recognition and parsing.
Computing the communication costs of item allocation Rauenbusch, Timothy W.; Grosz, Barbara; Shieber, Stuart Multiagent systems require techniques for effectively allocating resources or tasks to among agents in a group. Auctions are one method for structuring communication of agents’ private values for the resource or task to a central decision maker. Different auction methods vary in their communication requirements. This paper makes three contributions to the understanding the types of group decision making for which auctions are appropriate methods. First, it shows that entropy is the best measure of communication bandwidth used by an auction in messages bidders send and receive. Second, it presents a method for measuring bandwidth usage; the dialogue trees used for this computation are a new and compact representation of the probability distribution of every possible dialogue between two agents. Third, it presents new guidelines for choosing the best auction, guidelines which differ significantly from recommendations in prior work. The new guidelines are based on detailed analysis of the communication requirements of Sealed-bid, Dutch, Staged, Japanese, and Bisection auctions. In contradistinction to previous work, the guidelines show that the auction that minimizes bandwidth depends on both the number of bidders and the sample space from which bidders’ valuations are drawn.
Using restriction to extend parsing algorithms for complex-feature-based formalisms Shieber, Stuart Grammar formalisms based on the encoding of grammatical information in complex-valued feature systems enjoy some currency both in linguistics and natural-language-processing research. Such formalisms can be thought of by analogy to context-free grammars as generalizing the notion of nonterminal symbol from a finite domain of atomic elements to a possibly infinite domain of directed graph structures of a certain sort. Unfortunately, in moving to an infinite nonterminal domain, standard methods of parsing may no longer be applicable to the formalism. Typically, the problem manifests itself as gross inefficiency or even nontermination of the algorithms. In this paper, we discuss a solution to the problem of extending parsing algorithms to formalisms with possibly infinite nonterminal domains, a solution based on a general technique we call restriction. As a particular example of such an extension, we present a complete, correct, terminating extension of Earley's algorithm that uses restriction to perform top-down filtering. Our implementation of this algorithm demonstrates the drastic elimination of chart edges that can be achieved by this technique. Finally, we describe further uses for the technique---including parsing other grammar formalisms, including definite-clause grammars; extending other parsing algorithms, including LR methods and syntactic preference modeling algorithms; and efficient indexing.
Translating English into logical form Rosenschein, Stanley J.; Shieber, Stuart A scheme for syntax-directed translation that mirrors compositional model-theoretic semantics is discussed. The scheme is the basis for an English translation system called PATR and was used to specify a semantically interesting fragment of English, including such constructs as tense, aspect, modals, and various lexically controlled verb complement structures. PATR was embedded in a question-answering system that replied appropriately to questions requiring the computation of logical entailments.
A general cartographic labeling algorithm Edmondson, Shawn; Shieber, Stuart; Christensen, Jon; Marks, Joe Some apparently powerful algorithms for automatic label placement on maps use heuristics that capture considerable cartographic expertise but are hampered by provably inefficient methods of search and optimization. On the other hand, no approach to label placement that is based on an efficient optimization technique has been applied to the production of general cartographic maps - those with labeled point, line, and area features - and shown to generate labelings of acceptable quality. We present an algorithm for label placement that achieves the twin goals of practical efficiency and high labeling quality by combining simple cartographic heuristics with effective stochastic optimization techniques.
A semantic-head-driven generation algorithm for unification-based formalisms Pereira, Fernando C. N.; Moore, Robert C.; van Noord, Gertjan; Shieber, Stuart We present an algorithm for generating strings from logical form encodings that improves upon previous algorithms in that it places fewer restrictions on the class of grammars to which it is applicable. In particular, unlike an Earley deduction generator (Shieber, 1988), it allows use of semantically nonmonotonic grammars, yet unlike topdown methods, it also permits left-recursion. The enabling design feature of the algorithm is its implicit traversal of the analysis tree for the string being generated in a semantic-head-driven fashion.
An empirical study of algorithms for point feature label placement Christensen, Jon; Shieber, Stuart; Marks, Joe A major factor affecting the clarity of graphical displays that include text labels is the degree to which labels obscure display features (including other labels) as a result of spatial overlap. Point-feature label placement (PFLP) is the problem of placing text labels adjacent to point features on a map or diagram so as to maximize legibility. This problem occurs frequently in the production of many types of informational graphics, though it arises most often in automated cartography. In this paper we present a comprehensive treatment of the PFLP problem, viewed as a type of combinatorial optimization problem. Complexity analysis reveals that the basic PFLP problem and most interesting variants of it are NP-hard. These negative results help inform a survey of previously reported algorithms for PFLP; not surprisingly, all such algorithms either have exponential time complexity or are incomplete. To solve the PFLP problem in practice, then, we must rely on good heuristic methods. We propose two new methods, one based on a discrete form of gradient descent, the other on simulated annealing, and report on a series of empirical tests comparing these and the other known algorithms for the problem. Based on this study, the first to be conducted, we identify the best approaches as a function of available computation time.
Lessons from a restricted Turing test Shieber, Stuart We report on the recent Loebner prize competition inspired by Turing's test of intelligent behavior. The presentation covers the structure of the competition and the outcome of its first instantiation in an actual event, and an analysis of the purpose, design, and appropriateness of such a competition. We argue that the competition has no clear purpose, that its design prevents any useful outcome, and that such a competition is inappropriate given the current level of technology. We then speculate as to suitable alternatives to the Loebner prize.
The problem of logical-form equivalence Shieber, Stuart
An Alternative Conception of Tree-Adjoining Derivation Shieber, Stuart; Schabes, Yves The precise formulation of derivation for tree-adjoining grammars has important ramifications for a wide variety of uses of the formalism, from syntactic analysis to semantic interpretation and statistical language modeling. We argue that the definition of tree-adjoining derivation must be reformulated in order to manifest the proper linguistic dependencies in derivations. The particular proposal is both precisely characterizable through a definition of TAG derivations as equivalence classes of ordered derivation trees, and computationally operational, by virtue of a compilation to linear indexed grammars together with an efficient algorithm for recognition and parsing according to the compiled grammar.
Variations on incremental interpretation Johnson, Mark; Shieber, Stuart The strict competence hypothesis has sparked a small dialogue among several researchers attempting to understand its ramifications for human sentence processing and incremental interpretation in particular. In this paper, we review the dialogue, reconstructing the arguments in an attempt to make them more uniform and crisp, and provide our own analyses of certain of the issues that arise. We argue that strict competence, because it requires a synchronous computation mechanism, may actually lead to more complex, rather than simpler, models of incremental interpretation. Asynchronous computation, which is arguably both psychologically more plausible and conceptually more basic, allows for incremental interpretation to fall out naturally, without additional machinery for interpreting partial constituents. We show that this is true regardless of whether the presumed interpretation mechanism is top-down or bottom-up, contra previous conclusions in the literature, and propose a particular implementation of some of these ideas using a novel representation based on tree-adjoining grammars. The research in this paper was supported in part by grant IRI-9157996 from the National Science Foundation to the first author. The authors would like to thank Fernando Pereira, Edward Stabler, and Mark Steedman for discussions on the topic of this paper and for their comments on previous drafts.
A call for collaborative interfaces Shieber, Stuart In this note, I call for a move towards viewing interfaces as means for people and computers to collaborate on solving problems rather than means for people to control computers. This collaborative perspective on user interfaces can apply quite broadly, and not only provides a source for novel interface techniques but serves as a good tool for analyzing existing interfaces. The view affects thinking on interfaces primarily by motivating a different split in the roles and responsibilities of the two participants in problem-solving, the computer and the user.
Automating the layout of network diagrams with specified visual organization. Kosak, Corey; Shieber, Stuart; Marks, Joe Network diagrams are a familiar graphic form that can express many different kinds of information. The problem of automating network-diagram layout has therefore received much attention. Previous research on network-diagram layout has focused on the problem of aesthetically optimal layout, using such criteria as the number of link crossings, the sum of all link lengths, and total diagram area. In this paper the authors propose a restatement of the network-diagram layout problem in which layout-aesthetic concerns are subordinated to perceptual-organization concerns. The authors present a notation for describing the visual organization of a network diagram. This notation is used in reformulating the layout task as a constrained-optimization problem in which constraints are derived from a visual-organization specification and optimality criteria are derived from layout-aesthetic considerations. Two new heuristic algorithms are presented for this version of the layout problem: one algorithm uses a rule-based strategy for computing a layout; the other is a massively parallel genetic algorithm. The authors demonstrate the capabilities of the two algorithms by testing them on a variety of network-diagram layout problems.
Restricting the weak-generative capacity of synchronous tree-adjoining grammars. Shieber, Stuart The formalism of synchronous tree-adjoining grammars, a variant of standard tree-adjoining grammars (TAG), was intended to allow the use of TAGs for language transduction in addition to language specification. In previous work, the definition of the transduction relation defined by a synchronous TAG was given by appeal to an iterative rewriting process. The rewriting definition of derivation is problematic in that it greatly extends the expressivity of the formalism and makes the design of parsing algorithms difficult if not impossible. We introduce a simple, natural definition of synchronous tree-adjoining derivation, based on isomorphisms between standard tree-adjoining derivations, that avoids the expressivity and implementability problems of the original rewriting definition. The decrease in expressivity, which would otherwise make the method unusable, is offset by the incorporation of an alternative definition of standard tree-adjoining derivation, previously proposed for completely separate reasons, thereby making it practical to entertain using the natural definition of synchronous derivation. Nonetheless, some remaining problematic cases call for yel more flexibility in the definition; the isomorphism requirement may have to be relaxed. It remains for future research to rune the exact requirements on the allowable mappings.
Predicting individual book use for off-site storage using decision trees Shieber, Stuart; Silverstein, Craig We explore various methods for predicting library book use, as measured by circulation records. Accurate prediction is invaluable when choosing titles to be stored in an off-site location. Previous researchers in this area concluded that past use information provides by far the most reliable predictor of future use. Because of the computerization of library data, it is now possible not only to reproduce these earlier experiments with a more substantial data set, but also to compare their algorithms with more sophisticated decision methods. We have found that while previous use is indeed an excellent predictor of future use, it can be improved upon by combining previous use information with bibliographic information in a technique that can be customized for individual collections. This has immediate application for libraries that are short on storage space and wish to identify low-demand titles to move to remote storage. For instance, simulations show that the best prediction method we develop, when used as the off-site storage selection method for the Harvard College Library, would have generated only a fifth as many off-site accesses as compared to a method based on previous use.
Easily searched encodings for number partitioning Shieber, Stuart; Marks, Joe; Ngo, J. Thomas; Ruml, Wheeler Can stochastic search algorithms outperform existing deterministic heuristics for the NP-hard problem Number Partitioning if given a sufficient, but practically realizable amount of time? In a thorough empirical investigation using a straightforward implementation of one such algorithm, simulated annealing, Johnson et al. (Ref. 1) concluded tentatively that the answer is negative. In this paper, we show that the answer can be positive if attention is devoted to the issue of problem representation (encoding). We present results from empirical tests of several encodings of Number Partitioning with problem instances consisting of multiple-precision integers drawn from a uniform probability distribution. With these instances and with an appropriate choice of representation, stochastic and deterministic searches can—routinely and in a practical amount of time—find solutions several orders of magnitude better than those constructed by the best heuristic known (Ref. 2), which does not employ searching. We thank David S. Johnson of AT&T Bell Labs for generously and promptly sharing his test instances. For stimulating discussions, we thank members of the Harvard Animation/Optimization Group (especially Jon Christensen), the Computer Science Department at the University of New Mexico, the Santa Fe Institute, and the Berkeley CAD Group. The anonymous referees made numerous constructive suggestions. We thank Rebecca Hayes for comments concerning the figures. The second author is grateful for a Graduate Fellowship from the Fannie and John Hertz Foundation. We thank the Free Software Foundation for making the GNU Multiple Precision package available. The research described in this paper was conducted mostly while the third author was at Digital Equipment Corporation Cambridge Research Lab. This work was supported in part by the National Science Foundation, principally under Grants IRI-9157996 and IRI-9350192 to the fourth author, and by matching grants from Digital Equipment Corporation and Xerox Corporation.
Principles and implementation of deductive parsing Shieber, Stuart; Pereira, Fernando C. N.; Schabes, Yves We present a system for generating parsers based directly on the metaphor of parsing as deduction. Parsing algorithms can be represented directly as deduction systems, and a single deduction engine can interpret such deduction systems so as to implement the corresponding parser. The method generalizes easily to parsers for augmented phrase structure formalisms, such as definite-clause grammars and other logic grammar formalisms, and has been used for rapid prototyping of parsing algorithms for a variety of formalisms including variants of tree-adjoining grammars, categorial grammars, and lexicalized context-free grammars.
Interactions of scope and ellipsis Pereira, Fernando C. N.; Dalrymple, Mary; Shieber, Stuart Systematic semantic ambiguities result from the interaction of the two operations that are involved in resolving ellipsis in the presence of scoping elements such as quantifiers and intensional operators: scope determination for the scoping elements and resolution of the elided relation. A variety of problematic examples previously noted - by Sag, Hirschbüihler, Gawron and Peters, Harper, and others - all have to do with such interactions. In previous work, we showed how ellipsis resolution can be stated and solved in equational terms. Furthermore, this equational analysis of ellipsis provides a uniform framework in which interactions between ellipsis resolution and scope determination can be captured. As a consequence, an account of the problematic examples follows directly from the equational method. The goal of this paper is merely to point out this pleasant aspect of the equational analysis, through its application to these cases. No new analytical methods or associated formalism are presented, with the exception of a straightforward extension of the equational method to intensional logic.
Anaphoric dependencies in ellipsis Shieber, Stuart; Kehler, Andrew
Automatic yellow-pages pagination and layout Marks, Joe; Shieber, Stuart; Johari, Ramesh; Partovi, Ali The compact and harmonious layout of ads and text is a fundamental and costly step in the production of commercial telephone directories (ldquoYellow Pagesrdquo). We formulate a canonical version of Yellow-Pages pagination and layout (YPPL) as an optimization problem in which the task is to position ads and text-stream segments on sequential pages so as to minimize total page length and maximize certain layout aesthetics, subject to constraints derived from page-format requirements and positional relations between ads and text. We present a heuristic-search approach to the YPPL problem. Our algorithm has been applied to a sample of real telephone-directory data, and produces solutions that are significantly shorter and better than the published ones.
Machine learning theory and practice as a source of insight into universal grammar. Shieber, Stuart; Lappin, Shalom In this paper, we explore the possibility that machine learning approaches to natural-language processing being developed in engineering-oriented computational linguistics may be able to provide specific scientific insights into the nature of human language. We argue that, in principle, machine learning results could inform basic debates about language, in one area at least, and that in practice, existing results may offer initial tentative support for this prospect. Further, results from computational learning theory can inform arguments carried on within linguistic theory as well.
Practical secrecy-preserving, verifiably correct and trustworthy auctions Thorpe, Christopher; Shieber, Stuart; Rabin, Michael; Parkes, David We present a practical protocol based on homomorphic cryptography for conducting provably fair sealed-bid auctions. The system preserves the secrecy of the bids, even after the announcement of auction results, while also providing for public verifiability of the correctness and trustworthiness of the outcome. No party, including the auctioneer, receives any information about bids before the auction closes, and no bidder is able to change or repudiate any bid. The system is illustrated through application to first-price, uniform-price and second-price auctions, including multi-item auctions. Empirical results based on an analysis of a prototype demonstrate the practicality of our protocol for real-world applications.
Representation in stochastic search for phylogenetictree reconstruction Shieber, Stuart; Ohno-Machado, Lucila; Weber, Griffin Phylogenetic tree reconstruction is a process in which the ancestral relationships among a group of organisms are inferred from their DNA sequences. For all but trivial sized data sets, finding the optimal tree is computationally intractable. Many heuristic algorithms exist, but the branch-swapping algorithm used in the software package PAUP is the most popular. This method performs a stochastic search over the space of trees, using a branch-swapping operation to construct neighboring trees in the search space. This study introduces a new stochastic search algorithm that operates over an alternative representation of trees, namely as permutations of taxa giving the order in which they are processed during stepwise addition. Experiments on several data sets suggest that this algorithm for generating an initial tree, when followed by branch-swapping, can produce better trees for a given total amount of time.
Direct parsing of ID/LP grammars Shieber, Stuart The Immediate Dominance/Linear Precedence (ID/LP) formalism is a recent extension of Generalized Phrase Structure Grammar (GPSG) designed to perform some of the tasks previously assigned to metarules--for example, modeling the word-order characteristics of so-called free-word-order languages. It allows a simple specification of classes of rules that differ only in constituent order. ID/LP grammars (as well as metarule grammars) have been proposed for use in parsing by expanding them into equivalent context-free grammars. We develop a parsing algorithm, based on the algorithm of Earley, for parsing ID/LP grammars directly, circumventing the initial expansion phase. A proof of correctness is supplied. We also discuss some aspects of the time complexity of the algorithm and some formal properties associated with ID/LP grammars and their relationship to context-free grammars.
Abbreviated text input using language modeling. Shieber, Stuart; Nelken, Rani We address the problem of improving the efficiency of natural language text input under degraded conditions (for instance, on mobile computing devices or by disabled users), by taking advantage of the informational redundancy in natural language. Previous approaches to this problem have been based on the idea of prediction of the text, but these require the user to take overt action to verify or select the system’s predictions. We propose taking advantage of the duality between prediction and compression. We allow the user to enter text in compressed form, in particular, using a simple stipulated abbreviation method that reduces characters by 26.4%, yet is simple enough that it can be learned easily and generated relatively fluently. We decode the abbreviated text using a statistical generative model of abbreviation, with a residual word error rate of 3.3%. The chief component of this model is an n-gram language model. Because the system’s operation is completely independent from the user’s, the overhead from cognitive task switching and attending to the system’s actions online is eliminated, opening up the possibility that the compression-based method can achieve text input efficiency improvements where the prediction-based methods have not. We report the results of a user study evaluating this method.
The Turing test as interactive proof Shieber, Stuart In 1950, Alan Turing proposed his eponymous test based on indistinguishability of verbal behavior as a replacement for the question "Can machines think?" Since then, two mutually contradictory but well-founded attitudes towards the Turing Test have arisen in the philosophical literature. On the one hand is the attitude that has become philosophical conventional wisdom, viz., that the Turing Test is hopelessly flawed as a sufficient condition for intelligence, while on the other hand is the overwhelming sense that were a machine to pass a real live full-fledged Turing Test, it would be a sign of nothing but our orneriness to deny it the attribution of intelligence. The arguments against the sufficiency of the Turing Test for determining intelligence rely on showing that some extra conditions are logically necessary for intelligence beyond the behavioral properties exhibited by an agent under a Turing Test. Therefore, it cannot follow logically from passing a Turing Test that the agent is intelligent. I argue that these extra conditions can be revealed by the Turing Test, so long as we allow a very slight weakening of the criterion from one of logical proof to one of statistical proof under weak realizability assumptions. The argument depends on the notion of interactive proof developed in theoretical computer science, along with some simple physical facts that constrain the information capacity of agents. Crucially, the weakening is so slight as to make no conceivable difference from a practical standpoint. Thus, the Gordian knot between the two opposing views of the sufficiency of the Turing Test can be cut.
Generation and synchronous tree-adjoining grammars Shieber, Stuart; Schabes, Yves Tree-adjoining grammars (TAG) have been proposed as a formalism for generation based on the intuition that the extended domain of syntactic locality that TAGs provide should aid in localizing semantic dependencies as well, in turn serving as an aid to generation from semantic representations. We demonstrate that this intuition can be made concrete by using the formalism of synchronous tree-adjoining grammars. The use of synchronous TAGs for generation provides solutions to several problems with previous approaches to TAG generation. Furthermore, the semantic monotonicity requirement previously advocated for generation grammars as a computational aid is seen to be an inherent property of synchronous TAGs.
Ellipsis and higher-order unification Pereira, Fernando C. N.; Dalrymple, Mary; Shieber, Stuart We present a new method for characterizing the interpretive possibilities generated by elliptical constructions in natural language. Unlike previous analyses, which postulate ambiguity of interpretation or derivation in the full clause source of the ellipsis, our analysis requires no such hidden ambiguity. Further, the analysis follows relatively directly from an abstract statement of the ellipsis interpretation problem. It predicts correctly a wide range of interactions between ellipsis and other semantic phenomena such as quantifier scope and bound anaphora. Finally, although the analysis itself is stated nonprocedurally, it admits of a direct computational method for generating interpretations.
Semantic-head-driven generation Moore, Robert C.; Pereira, Fernando C. N.; van Noord, Gertjan; Shieber, Stuart We present an algorithm for generating strings from logical form encodings that improves upon previous algorithms in that it places fewer restrictions on the class of grammars to which it is applicable. In particular, unlike a previous bottom-up generator, it allows use of semlantically nonmonotonic grammars, yet unlike top-down methods, it also permits left-recursion. The enabling design feature of the algorithm is its implicit traversal of the analysis tree for the string being generated in a semantic-head-driven fashion.
An algorithm for generating quantifier scopings Hobbs, Jerry; Shieber, Stuart The syntactic structure of a sentence often manifests quite clearly the predicate-argument structure and relations of grammatical subordination. But scope dependencies are not so transparent. As a result, many systems for representing the semantics of sentences have ignored scoping or generating scoping mechanisms that have often been inexplicit as to the range of scopings they choose among or profligate in the scopings they allow. In this paper, we present an algorithm, along with proofs of some of its important properties, that generates scoped semantic forms from unscoped expressions encoding predicate-argument structure. The algorithm is not profligate as are those based on permutation of quantifiers, and it can provide a solid foundation for computational solutions where completeness is sacrificed for efficiency and heuristic efficacy.
Evidence against the context-freeness of natural language Shieber, Stuart
Synchronous grammars as tree transducers Shieber, Stuart Tree transducer formalisms were developed in the formal language theory community as generalizations of finite-state transducers from strings to trees. Independently, synchronous tree-substitution and -adjoining grammars arose in the computational linguistics community as a means to augment strictly syntactic formalisms to provide for parallel semantics. We present the first synthesis of these two independently developed approaches to specifying tree relations, unifying their respective literatures for the first time, by using the framework of bimorphisms as the generalizing formalism in which all can be embedded. The central result is that synchronous tree-substitution grammars are equivalent to bimorphisms where the component homomorphisms are linear and complete.