| …going after plagiarists on legal grounds… “Judge Coco Declares Ang Out of Line!” image by flickr user Coco Mault used by permission. |
One of the services that journal publishers claim to provide on behalf of authors is legal support in the case that their work has been plagiarized, and they sometimes cite this as one of the reasons that they require a transfer of rights for publication of articles.
Here’s a recent example of the claim, forwarded to me by a Harvard author of a paper accepted for publication in a Wiley journal.[1] The article falls under Harvard’s FAS open-access policy, by virtue of which the university held a nonexclusive license in the article. The author chose to inform the journal of this license by attaching the Harvard addendum to Wiley’s publication agreement. Wiley’s emailed response to her included this explanation:
You recently had a paper accepted for publication in [journal name] and signed an exclusive license form to which you attached an addendum from Harvard University. Unfortunately, we are unable to accept this addendum, as it conflicts with the rights of the copyright holder (in this case, the [society on whose behalf Wiley publishes the journal]). They guarantee the same rights that our copyright forms guarantee, but Harvard University, unlike Wiley, offers no support if your article is plagiarized or otherwise reused illegally.
(It then went on to list various rights that Wiley grants back to authors of articles in this journal, such as posting manuscripts in repositories, all of which are laudable, though they remained silent on their required 12-month distribution embargo.)
I have no problem with publishers requiring waivers of the Harvard open-access policy as a condition of publishing in their journals. They are their journals after all. And in the event, only a small proportion of articles, in the low single digits, end up needing waivers. But I bristle at the transparently disingenuous argumentation for their requirement. They make two separate arguments.
The addendum “conflicts with the rights of the copyright holder”, the society on whose behalf Wiley publishes the journal.
Wait, what? How is the society the copyright holder? Until the author signs a publication agreement, the author is the copyright holder. And the publication agreement itself doesn’t involve a transfer of copyright, but rather, an exclusive license to Wiley on behalf of the society. And anyway, whether it’s an exclusive license or a wholesale transfer of copyright, that doesn’t conflict with the addendum by virtue of the plain words in the addendum: “Notwithstanding any terms in the Publication Agreement to the contrary, Author and Publisher agree as follows:…”.
If the addendum were allowed, “Harvard University, unlike Wiley, offers no support if your article is plagiarized or otherwise reused illegally.”
Suppose that were true. (Though how would Wiley know what support Harvard gives its faculty when their work is plagiarized or used illegally?) Why would that be an issue? Nothing stops Wiley from providing that support on behalf of its authors, with or without the addendum. Either way it still receives an exclusive license from the author. Others illegally using the work are still violating that exclusive license.
Unless of course the violator received a license by virtue of the prior nonexclusive license to the University mentioned in the addendum. Could that happen? Nope. The university uses its license only to authorize a particular set of uses. You can read them at the DASH Terms of Use page (see the “Open Access Policy Articles” section). They do not include plagiarism as a permitted use, or any illegal uses. (Is it even necessary to point that out?) The university also grants the authors themselves the ability to exercise rights in their article. But if someone explicitly received and exercised rights from a rights-holding author, it’s hard to see how that’s an illegal use.
More fundamentally, however, there’s a basic premise that underlies this talk of publishers requiring exclusive rights in order to weed out and prosecute plagiarism, namely, that publishers would not be able to do so if they didn’t acquire exhaustive exclusive rights. But there’s no legal basis to such a premise that I can imagine.
Plagiarism per se is not a rights matter at all, but a violation of the professional conduct expected of scholars. Pursuing plagiarists is a matter of calling their behavior out for what it is, with the concomitant professional opprobrium and dishonor that such behavior deserves. Publishers should feel free to help with that social process; they don’t need any rights to do so.
Being “otherwise used illegally” gets more to the heart of the matter, as rights violations are presumably what the publisher has in mind. But it’s hard to see how publishers would need any rights themselves just to help an author out in prosecuting a rights violation. Suppose a publisher, rather than acquiring exclusive rights in an article, instead had authors license their articles under a CC-BY license. The publisher could still weed out and prosecute illegal uses of the article. There would be fewer opportunities for illegal use, since CC-BY allows lots of salutary kinds of use and reuse, subject only to proper attribution to the author and journal. But illegal uses might still arise from violating the attribution requirement of the CC-BY declaration. Nothing stops the publisher from looking for such gross plagiarisms of the articles they publish that rise to the level of rights violation, and from prosecuting the plagiarists on behalf of the authors. They could even write that into their agreements: “The Author grants the Publisher permission to prosecute violations of this license on the Author’s behalf, etc.”[2]
(As an aside, the offer to prosecute plagiarists and rights violators isn’t much of a benefit in practice. How many instances of publishers going after plagiarists on legal grounds based on the publisher’s holding of rights have there ever been? As Jake said, “Isn’t it pretty to think so?”)
What’s really going on here is not a mystery. The publisher doesn’t like the idea of the author distributing copies of her work. The primary difference between the rights the publisher wants to grant the author and the rights specified in the open-access policy is that the former stipulates that the author not distribute copies of her article for twelve months after publication. The publisher is objecting so as to force a waiver of the open-access policy license to preserve their ability to limit access to the article. Of course, saying “we won’t accept the addendum because we want to limit people reading your article” doesn’t sound nearly as good as “otherwise we couldn’t go after plagiarists”.
Publishers are welcome to require waivers of Harvard’s open-access policies and the similar policies at other institutions, but hiding behind faux arguments in their explanations to authors isn’t attractive. They should come clean on the reasoning: They think it harms their business model.
There’s a long history of this kind of thing. For instance, Peter Suber addressed the issue as raised by the International Association of Scientific, Technical and Medical Publishers way back in 2007. ↩
Not to mention the fact that open accessability of articles makes plagiarism easier to detect, and therefore provides a disincentive to plagiarize in the first place. For example, researcher at arXiv have reported on experiments for automatically finding cases of plagiarism in its open-access collection. Services like the Open Access Plagiarism Search project sponsored by the German Research Foundation (DFG) are working to make good on this potential. ↩
| …where I should go… “Directions” image by flickr user Peat Bakke used by permission. |
After a several year postponement while I established and led Harvard’s nascent library skunkworks, the Office for Scholarly Communication, my request for a sabbatical leave has been approved for the 2013–14 academic year. I hope to work on a variety of projects related to library and publishing issues, computational-linguistics–related projects, and teaching preparation.
My plans are to remain in the Boston area during the year but with perhaps several short external residencies at other institutions who might be interested in having me around for a couple of weeks. If your institution might be such a place — willing to provide an office, a place to crash, maybe a plane ticket — please let me know.
[This is a transient post.]
| …White House… “White House” image by flickr user Trevor McGoldrick. |
As has been widely reported, this past Friday the White House directed essentially all federal funding agencies to develop open access policies over the next few months. I wrote the letter below to be forwarded to faculty at the Harvard schools with open-access policies, to inform them of this important new directive and its relation to the existing Harvard policies.
To: Harvard faculty at schools with open-access policies[1]
From: Stuart Shieber, faculty director, Harvard Office for Scholarly Communication
I write to you with three pieces of good news.
First, the White House on Friday released a new policy memorandum expanding public access to the results of federally funded research. This new policy follows on from two broad-based national policy forums in 2010 and 2012 organized by the White House Office for Science and Technology Policy. It also serves as the response to over 65,000 petitioners to the White House “We the People” site supporting open access. The policy directs essentially all federal funding agencies to develop policies along the lines of the successful public access policy already in place at the National Institutes of Health, guaranteeing that articles based on federally funded research are made freely available to the public within one year of publication. The first piece of good news is that research results will now be much more broadly available and have greater impact.
Public access policies like the NIH’s, which will soon be in place in all major federal funding agencies, come with a responsibility: Researchers must retain sufficient rights to comply with the public access requirement. The second piece of good news is that because your school has an open-access policy voted by the faculty, you already are automatically retaining sufficient rights to comply with the government’s public access requirements. Unless you opt out of rights retention by expressly directing that a waiver of the open-access policy license be granted, the school’s open-access policy has the effect that you retain broad rights in your articles, sufficient to distribute them in compliance with the NIH policy, as well as any new policies that arise from the White House directive or from the bipartisan Fair Access to Science and Technology Research (FASTR) Act that was just introduced in both houses of Congress. Faculty at most Harvard schools are thus exceptionally well placed to take the broadest advantage of the new White House and Congressional initiatives in making our scholarship openly accessible.
The third piece of good news is that there is no need to wait for the White House policy to effect change in funding agency policies. By virtue of Harvard’s open-access policies, you can already provide for open distribution of your articles. Indeed, part of your school’s open access policy is a commitment by all faculty to do just that: provide copies of your final manuscripts to be placed into the DASH repository. The Office for Scholarly Communication stands ready to help with the process. All you need to do is forward us your articles through our “quick submit” form. Your articles can then join the over 9,000 other articles in the repository that have been distributed well over a million times to grateful readers from every continent on earth.
Please do not hesitate to contact us if we can help in any way in broadening access to your scholarship.
[This is a heavily edited transcript of a talk that I gave on January 3, 2013, at a panel on open access at the 87th Annual Meeting of the Linguistic Society of America (LSA, the main scholarly society for linguistics, and publisher of the journal Language), co-sponsored by the Modern Language Association (MLA).]
Thank you for this opportunity to join the others on this panel in talking about open access. I will concentrate in particular on the relationship between open access and the future of scholarly societies. I’m thinking in particular of small to medium scholarly societies, which have small publishing programs that are often central to the solvency of the societies and to their ability to do the important work that they do. In one sense it should be obvious, and I think it’s been made obvious by the previous speakers, that open access meshes well with the missions of scholarly societies. LSA’s mission, for instance, is “to advance the scientific study of language. LSA plays a critical role in supporting and disseminating linguistic scholarship both to professional linguists and to the general public.” [Emphasis added.] So I’ll just assume the societal benefit of open access to researchers and to the general public alike. For the purpose of conversation let’s just take that as given.
Nonetheless, many scholarly societies, and the faculty that support them, are worried that open access – at least as they understand the concept – could exacerbate the serious financial distress that many of those societies are already under, and even undermine their very existence and thereby their ability to carry out this mission. I’ve heard faculty worry that even “green” open access (self-archiving of articles in open-access repositories) could undermine the economics of journal publishing in such a way that their scholarly societies could be endangered.
I want to argue that there is a real threat that many scholarly societies accurately perceive in their publishing programs, but that we must be careful not to misdiagnose this problem. In fact, a general move to open access would be the best outcome for scholarly society publishers. If the entirety of journal publishing magically metamorphosed somehow to an open-access system, scholarly society publishers would be much better off. From a strategic point of view then, the best course of action for scholarly societies and for the faculty and researchers who support them would be to promote a shift to open access as widely and as quickly as possible. Now, the threat that societies perceive is an economic threat, so my remarks will be almost entirely economic in nature; I’m just warning you. My talk certainly will have no linguistic import at all.
Let me turn first to some basic truths about the subscription journal market that I’ve come to realize are important in understanding what the underlying economic issues are.
The first is that different journals — viewed as products, as goods being sold — are in economists’ terms complements, not substitutes. Substitute goods are products like Coke and Pepsi. If you have one it decreases the value of the other to you, as they fulfill similar functions. Complements are products like a left shoe and a right shoe – that’s the most extreme case. If you have one it increases the value to you of the other. There are less extreme cases of economic complements – printers and toner cartridges, peanut butter and jelly, pencils and erasers.
What about scholarly journals? Suppose you’re a patron of a library that subscribes to a bundle of, let’s say, Elsevier journals, including the journal Lingua. Does the library subscription to that journal make you more or less interested in reading, say, Language? (We’re holding cost aside. When thinking about complements or substitutes, it’s just about the value to the consumer, not the cost.) Of course, you’re not less inclined to read Language just because the library subscribes to Lingua. In fact you may be more inclined, because some Lingua articles will cite Language articles. You read the Lingua article, you want to to read the Language article it cites. So that would lead you to track down those articles and read them if the library had a subscription. And vice versa: a subscription to Language can increase the value of a subscription to Lingua. So journals are economic complements, not substitutes.
This has important ramifications. Non-substitutive goods don’t compete against each other and complementary goods in fact support each other in the market. If consumers suddenly buy a lot more Coke, Pepsi is worried. But if peanut butter sales skyrocket, the jelly manufacturers are elated. So the complementary subscription of individual journals means that there’s limited market competition between journals, and limited competition leads to inefficiency in the journal market. (That’s not to say that there isn’t competition between publishers. But as we’ll see, the primary form of that competition is in competing to acquire journals.)
| Figure 1: Average journal prices in a range of fields, differentiated by commercial and non-profit publishers. Left is based on prices as dollars per page. Right is based on dollars per citation, to normalize for quality. Data are from Bergstrom and Bergstrom, Journal pricing across disciplines, 2002. |
We can see ample evidence of this kind of inefficiency. One clear form of evidence for inefficiency is wide price disparities. The graph in Figure 1 shows average journal prices in a whole range of fields. The data is from a study by Bergstrom and Bergstrom, and they differentiated the cost of the journals by whether the publisher is commercial or non-profit. The dark blue represents the commercial publishers, the light blue the nonprofit publishers. Notice that the commercial publishers on average charge about five times more for their journals than non-profit publishers, as measured by price per page. Now you might think there is a good explanation for this disparity: perhaps these aren’t comparable products. Perhaps the commercial publishers are selling a much better product, higher quality journals, and it’s therefore more expensive to develop them, and that’s what accounts for the price differential. So we can normalize for that by using a proxy for quality. One widely used proxy for quality, admittedly not a great one but at least widely touted by journals themselves through the ubiquitous “Impact Factor”, is the number of citations the journal receives. The second graph in Figure 1 thus shows price per citation. Measured this way, commercial journals are fifteen times more expensive than non-profit journals from the same field. Now, linguistics was not one of the fields examined in this study. But the same holds true here as well. For example, the subscription rate for LSA’s journal Language, published by a non-profit of course, is $3.31 per citation, whereas Elsevier’s Lingua is $32.30 per citation – almost exactly ten times more expensive.
This kind of price differential is a clear sign of market failure, especially as it has been sustained over decades. You just do not get this kind of price disparity preserved over long periods of time in well functioning markets. Go to the various grocery stores in your neighborhood and see if you can find apples at different grocery stores at a price differential of a factor of ten. It does not happen. Such price disparities are a clear sign of inefficiency.
| Figure 2: Elsevier revenues, profit, and profit margin, 2002–2011. Data are from Mike Taylor, The obscene profits of commercial scholarly publishers, 2012. |
The second basic truth is that the good being sold in the subscription market is access, and access is a monopolistic good. The monopoly is enabled by copyright, founded in the government’s ability as codified in Article I Section 8 of the Constitution to provide an exclusive right to the creator of a work for a limited period of time. Subscription publishers acquire exclusive rights to the articles they publish — typically by acquiring copyright, sometimes by acquiring an exclusive license, which is a distinction without a difference — and this allows publishers in theory and in many cases in practice to extract monopoly rents in selling access to the articles. We see evidence of this as well. For example, in Figure 2, I show a graph of the revenues, profits, and most importantly the profit margin, for the publisher Elsevier over the last decade. It’s quite a good business with annual revenues of over $2 billion, but that’s not the big point. The big point is the extraordinary 35–40% profit margins. It’s not just Elsevier. Many large commercial publishers have maintained these kinds of profit margins over a long period of time. An interesting thing to look at is the steady increase in the margins even during the financial crisis starting in 2009 when, for instance, many university endowments and library budgets dropped precipitously. Harvard’s endowment went down by 30% but Elsevier did just fine, and the other large publishers as well. So maintaining those kinds of profit margins again is a sign of the ability to extract monopoly rents.
The third basic truth is that pricing is controlled not at the level of the individual journal but at the level of a bundle of journals. The large publishers have portfolios of hundreds to thousands of journals. They can therefore apply prices to a bundle of journals, not a single journal. They can show vastly different prices to different buyers and use the bundles to incentivize buyers, the libraries, to pay larger fees. The upshot of this point, that pricing happens at the bundle level and not the journal level, is that a library can find it extremely difficult to control its expenditures by canceling individual journals because the publisher can just price the smaller bundle at essentially the same cost as the larger bundle.
I’ll tell you a personal story. Some years ago, Harvard was one of the first universities to cancel the “big deal” with Elsevier. I don’t want to pick on Elsevier. They’re not bad people. They’re a wonderful group of folks. Lots of the large publishers of journals work this way and it’s not because they’re evil or anything like that. I just mention the Elsevier case as a convenient story. Harvard was one of the first universities to cancel its “big deal” and went a la carte on the journals. In the School of Engineering and Applied Sciences, my own school, we had been subscribing to around 130 Elsevier journals in engineering and applied sciences as I recall. We took the opportunity to cancel about 100 of these journals, leaving something like 30 journals, hoping to recoup some costs. And we did. The first year we recouped about 20%. The following year the total cost was back where it had been before the cancellations, and it has increased steadily from there. From the library’s point of view, you can’t win by canceling journals, because the product is not the journal, it’s the bundle.
Edlin and Rubinfeld, in a Law Review article about possible anti-trust implications of this bundling, say “The immediate effect of [bundled pricing] has been to move competition from individual journals to large bundles of journals. … Creating a large bundle of journals to compete with Elsevier or Kluwer seems almost insurmountable. … There are indications that [bundled pricing] is hindering entry. Librarians … say that they would spend more money for journals from smaller and alternative publishers if they could achieve proportionate savings from reductions. By selling electronic bundles, publishers have erected a strategic barrier to entry at just the time that the electronic publishing possibility has made it increasingly possible for alternative publishers to overcome the existing structural barriers.” The fact that competition is at the level of bundles, not at the level of journals, is very important.
| Figure 3: Scholarly journal expenditures percentage increase 1986–2010 compared to consumer price index. Data from Association for Research Libraries. |
When we put all these properties of the journal market together, the end result is market dysfunction and a steady long-term hyperinflation in journal expenditures by libraries. Figure 3 shows a graph of serials expenditures over the last couple of decades, the dark blue line. The light blue line is the consumer price index, a proxy for the ambient rate of inflation. You can see that serials expenditures in research libraries have been going up at something like three times the rate of inflation for decades. Exponential real growth in the cost of journals is an unsustainable state of affairs.
I return to the issue of inefficiency. Why is it that the non-profit publishers are so much more efficient than the commercial publishers? Not in every case of course but on average the difference is really striking. There are a couple of possible reasons. One is that the non-profits tend to be scholarly societies who may be motivated not by profit maximization but by service to the field. I think that’s true to a certain extent. But also the non-profits tend to be small publishers with few journals – maybe one, two, three, five, ten journals. Since bundle size governs market power, non-profits have less ability to grow margins. And scholarly societies rightly complain that they’re being squeezed. From the point of view of libraries, if you have to cancel something you can recoup revenue if you cancel the journals from a small publisher. You can’t recoup revenue if you cancel journals from the large commercial publishers. As a library, what are you going to do? Cancel scholarly society journals, just as the societies have been rightly complaining about.
But notice that the problem that scholarly societies face, a problem that will only increase in a status quo future, is based not on open access but on inherent properties of the subscription market that they participate in. For scholarly societies, the status quo is not a good alternative. Doing nothing is a failing strategy.
The idea of open-access journals is that they provide access to the articles for free. How can this be a better system for scholarly societies, given that much of the societies’ revenues may come from the publishing program?
Open-access journals don’t charge for access, but that doesn’t mean they eschew revenue entirely. Open-access journals are just selling a different good, and therefore participating in a different market. Instead of selling access to readers (or the readers’ proxy, the libraries), they sell publisher services to the authors (or to the authors’ proxy, their research funders).
In fact there are now over 8,500 open-access journals listed in the Directory of Open Access Journals. Some of them have been mentioned already on this panel: Linguistic Discovery, Semantics and Pragmatics. The majority of existing open-access journals, like those journals, don’t charge author-side article-processing charges (APCs). But in the end APCs seems to me the most reasonable, reliable, scalable, and efficient revenue mechanism for open-access journals. This move from reader-side subscription fees to author-side APCs has dramatic ramifications for the structure of the market that the publisher participates in.
The open-access APC market has quite different properties from the subscription market. Recall the basic truths about the subscription market. Journals are complements, not substitutes. There’s limited market competition. The product being sold is a monopolistic good. Pricing is controlled at the bundle level. What are the corresponding properties of the publisher services market, the market that open-access journals participate in? In that market, the purchaser of the good is the author or the author’s proxy, not the reader or reader’s proxy. And from the point of view of an author, two journals are not complements but substitutes. You can publish your article in the Journal of Linguistics or Lingua or better yet in Language. But having published it in one, you have no incentive to publish it in the other. In fact, you’re not allowed to publish in both, making journals perfect substitutes. There is no value to the second journal once you’ve published the article in the first journal, from the point of view of the author trying to get a publication.
So journals compete for authors in a way they don’t for readers, and this competition leads to much greater efficiency. Open-access publishers are highly motivated to provide better services at lower price to compete for authors’ article submissions. We actually see evidence of this competition on both price and quality happening in the market. I won’t go through examples but have written about it previously.
Second, publisher services on the author side are not a monopolistic good. Anyone can provide those services. In fact because the service is a knowledge good, there are exceptionally low barriers to entry. Kai von Fintel and David Beaver can just unilaterally set up Semantics and Pragmatics; maybe they’ll be successful and maybe they won’t. In this case, it turned out pretty well. The low barrier to entry further enhances competition and improves the efficiency of the market.
Finally, pricing is controlled not at the level of the bundle of journals. You don’t care about the bundle of the publisher when you’re an author submitting to a journal. You care about the journal. Actually, pricing is not even at the journal level, but at the level of the individual article. So price competition happens at that level as well, with journals competing for individual articles on price as well as quality.
In summary, the open-access APC market is a more efficient market than the closed-access subscription market for reasons of basic economics. That’s not just my opinion. Claudio Aspesi, senior analyst at Sanford Bernstein studying the finances of publishing companies, has estimated that a transition to open access would lead to Elsevier cutting its margins by 41–89%.
Let me say something about the overall cost for the two kinds of models. The APCs that open-access journals charge range from $0 to around $3,000. The median turns out to be zero. But for those open-access journals that do charge a fee, the mean is around $1,200, and reasonable sustainable fees seem to be shaking out in the $1,000 to $1,500 range. Let’s call it $1,500. Since article processing fees are essentially the totality of revenue that open-access journals receive, the APC is a reasonable figure for average revenue per article. There are open-access publishers who are profitable in that range, including commercial open-access journals.
What’s the corresponding number for subscription journals? What is their average revenue per article? The Scholarly Publishing Roundtable reported total 2008 revenue for scholarly publishing at $8 billion on 1.5 million articles, the vast bulk of that revenue coming from subscription fees. Average revenue per article for subscription journals is, by that measure, over $5,000 an article. Remember that this averages over all of the journals — the high quality and the low alike.
So what’s happening is that authors one way or another are paying. Either you’re paying an APC to an open-access journal or you’re paying with your copyright to a subscription journal, which the publisher then monetizes, turning it into about $5,000 per article. It turns out that if we moved from a subscription journal world to an open-access world, the institutions of the world would go from paying, on average, $5,000 an article to about $1,500. Let’s suppose the $1,500 estimate is unreasonably low. Let’s suppose that really the average APC would be what the most high-end open-access journal, PLoS Biology, now charges – that’s $2,900; call it $3,000. If every article moved from the subscription model to an open-access APC model at the high end of cost – we would still be saving 40%. And more importantly, we would be better executing the scholarly society mission by providing the broadest possible dissemination.
Who wins in this kind of market – a non-monopolistic, competitive market of substitutes where the processing fees are considerably less than the current cost per article for subscription journals? The publisher who wins in that market is the publisher who can provide the best services, including imprimatur, at the lowest price to the author, that is, the publisher who is most efficient. Scholarly society publishers would have a huge lead in this market, because they are manifestly more efficient than commercial publishers by a large factor. If the scholarly journal market were structured as the open-access journal market rather than the subscription journal market, scholarly society publishers would be the big winners. And scholarly societies are beginning to realize that open access could be a boon not only to their mission – that much should be uncontroversial – but also to their solvency. Perhaps for this reason, some 600 scholarly societies, including the LSA, are already publishing open-access journals.
At the root, the reason that scholarly societies benefit from playing in the open-access APC market rather than the closed-access subscription fee market is the difference in the goods being sold. When the good is a journal bundle, the companies with the biggest bundles, the large commercial publishers, win. When the good is publisher services for an individual article, the publishers that can deliver those services for an individual article most efficiently, the non-profit publishers, win. Sure, there are economies of scale, but empirical evidence shows that the scholarly societies are already far better able to efficiently deliver services despite any scale disadvantage.
Now, all that sounds great, but I don’t want to be too positive. As I said at the outset, there is a real worry that society publishers should have about the open-access APC market. But it’s not that they’d be at a competitive disadvantage in that market; I think that they’d have a huge advantage. And it’s important to remember that they’re already at a huge disadvantage in the subscription journal market; status quo is a failing strategy. Rather the problem is this. Open-access journals are at a disadvantage in their competition for authors against subscription journals. That is, the problem arises across the two markets. When the only kind of journals are open-access journals, scholarly societies have the upper hand. When there are both kinds of journals in the market, both subscription journals and open-access journals, the open-access journals are at a competitive disadvantage because (from the author’s point of view) publishing is free in a subscription journal. (Of course, it’s not really free; it’s just that the research libraries of the world are underwriting the very high $5,000 cost per article.) By contrast, in an open-access APC journal, the author personally could be out let’s say $1,200 or $1,500 or whatever. This is a problem not just for scholarly societies but for all publishers exploring the possibility of going fee-based open access.
To make a transition possible, what we, as supporters of scholarly societies, should be working on is placing open-access journals on a level playing field with subscription journals. There’s a principle at stake here, and the principle is this: Dissemination of research results is an inherent part of the research process. This is something that publishers themselves are frequently pointing out — that they are part of the research process. Consequently, the funders of that research should underwrite dissemination of the results. Who are the funders of the research? In science, technology, and medicine, public and private funding agencies are the primary research funders. By this principle then, the funding agencies giving the grants in those areas would be on the hook to pay the $1,000 or $1,200 or $1,500 or $2,900 publication fees. Most funding agencies already will pay for publication costs for open-access journals (though not in an ideal way, which I’ve written about in the past). What about fields where there aren’t funding agencies handing out large grants? In the humanities and social sciences, universities are the de facto primary research funders. Faculty members in universities are doing research in those fields as part of their employment as researchers. As the primary research funders in the humanities and social sciences, in linguistics in particular, the universities that employ us should be on the hook to disseminate the research results that their researchers generate.
This is the principle behind an effort called the Compact for Open-Access Publishing Equity. It was set up by a group of five universities initially — Cornell, Dartmouth, Harvard, MIT, and Berkeley — to place the open-access revenue model on a more level playing field with the subscription model. Since then another dozen or so institutions have signed on. The Compact says that these universities commit to providing a mechanism for underwriting reasonable publication fees for articles written by their faculty and published in fee-based open-access journals. From the point of view of these signatory institutions, and the many other institutions that don’t happen to be signatories but have similar funding policies, if you structure your journal as an open-access journal charging a publication fee, you don’t need to worry that the authors will be personally out of pocket to pay those fees; the university will pay on their behalf.
Given that the open-access publication fee market would be preferable from the point of view of scholarly societies and their members, what should scholarly societies be doing from the strategic point of view? What is in the best interest of us as supporters of scholarly societies? Happily the best interest of scholarly society publishers is the best interest of the scholars themselves, namely as rapid a transition to open access as possible. So scholarly societies should be doing what they can to speed that transition, and I’m glad to say that the LSA and the MLA are working in that direction. I wish all scholarly societies would do so as well.
Of course, the ideal action for a scholarly society is to convert all of its journals to open access. By doing so, they help set expectations among authors that we don’t restrict access to articles, thereby hastening the day that closed-access journals find it impossible to compete for authors.
But some scholarly societies may still find it too worrisome to take such a bold move, not because they disagree with my conclusion that they fare better in an open access world, but because they fear not making it through the transition to that world. I’m sympathetic to that worry. Still, there are important actions that societies can take short of converting all of their journals to open access, actions that will still greatly contribute to changing the expectations of scholars that research results should be and are being made accessible. Scholarly societies can:
Where applicable, Publisher acknowledges that Author’s assignment of copyright or Author’s grant of exclusive rights in the Publication Agreement is subject to Author’s prior grant of a non-exclusive copyright license to Author’s employing institution and/or to a funding entity that financially supported the research reflected in the Article as part of an agreement between Author or Author’s employing institution and such funding entity, such as an agency of the United States government.
(My guess is that it should be possible to generate an English version of such a sentence as well.)
To the extent that we can get a transition to a primarily open-access publishing system to happen, scholarly societies, their members, and the general public will all be much better off, which is a happy confluence of interest. Thank you very much.
[I am pleased to present a guest post from my friend Ann Velenchik, professor of economics at Wellesley College, director of their writing program, and expert monologist. This post is reproduced from her private blog, which I am privileged to have access to, in which she has chronicled her experience with her leukemia diagnosis and treatment over the last three years.]
| …one of my idols… “Oprah Winfrey speaks at the launch of the Born This Way Foundation” image by flickr user HarvardEducation. |
January 18, 2013 — Despite living with Bicycle Boy, I take no interest in competitive cycling. We were in Paris for the end of the Tour de France in 2008, and while Tom, Becca and Nate all stood on benches to see the riders circle the Arc de Triomphe, I was happily drinking an orangina at a table far from the crowds. Tom has assured me, for years, that Armstrong has clearly been doping, and I frankly didn’t think much more about it.
But I obviously couldn’t escape the news that Oprah would be interviewing Armstrong on TV last night and, because Oprah is one of my idols, I did a little web surfing this morning to find out what was said. And I found something that made me so angry that I had to respond.
In a part of the conversation about why he had started doping, and how he justified it to himself, Armstrong lay some of the blame on the “fighting spirit” he developed during his “battle” with testicular cancer from October 1996 to February 1997.
“That process turned me into a person — it was truly win at all costs,” Armstrong said. “When I was diagnosed, I said, ‘I will do anything I need to do to survive,’ and that’s good. And I took that attitude, that ruthless and relentless and win-at-all-costs attitude into cycling, and that’s bad.”
Let’s leave aside the fact that there’s evidence that he started doping before he got cancer, and that it’s possible that taking a lot of testosterone might have made that cancer worse. Let’s just talk about the idea that the attitude that helped him “win” the cancer battle justifies, or even explains, what the evidence indicates he has done since.
I think it is highly possible that cancer diagnosis and treatment in the prime of his life was a deep and abiding trauma that warped his moral compass. As I have said before, I don’t think cancer is a blessing in disguise, and I don’t think all the lessons we learn there are good ones, let alone worth the price. So I am not even pissed off that he has the audacity to use his status as a cancer patient to explain the appalling way he has treated people.
What pisses me off is his description of the attitude he brought to treatment itself. When he says “I will do anything I need to do to survive…ruthless and relentless and win-at-all-costs…,” as though lying and cheating and doing terrible things to other people were part of the cancer process, that’s when my head starts to explode.
Because, here’s the thing. There isn’t much in cancer treatment that requires lying or cheating, that requires you to sue for libel the people who are actually telling the truth, or that allows you to threaten and bully and defame other people. Yes, there’s a lot of win-at-all costs to be found there, but those aren’t costs you get to impose on other people.
Lance Armstrong misspoke. Cancer treatment isn’t about being willing to do anything to anyone in order to win. It’s about being willing to endure anything onesself. Here’s my guess. Lance Armstrong is a very bad guy who was a bad guy before he got cancer and perhaps a worse one afterward. He doped because he was getting away with it and getting richer and more famous every minute. He lied and intimidated and threatened and bullied because, as I heard one person say, when he got cornered his strategy was to double down. And maybe his experience as a cancer patient was part of the list of things that made him so broken. But that’s about him, not about cancer.
Government zealotry in prosecuting brilliant people is a repeating theme. It gave rise to one of the great intellectual tragedies of the 20th century, the death of Alan Turing after his appalling treatment by the British government. Sadly, we have just been presented with another case. Aaron Swartz committed suicide at his apartment in New York this week in the face of an overreaching prosecution of his JSTOR download action. I never met him, but I understand from those who knew him well that he was a brilliant, committed person who only acted intending to do good in the world. I’m on the record disagreeing with the particulars of the open access tactic for which he was being prosecuted, on the basis that it was counterproductive. But I empathize with the gut instinct that led to his effort. I hope that it will inspire us all to redouble our efforts to eliminate the needless restraints on the distribution and use of scholarship as Swartz himself was trying to achieve.
| …our little tiff in the late 18th century…“NYC – Metropolitan Museum of Art: Washington Crossing the Delaware” image by flickr user wallyg. Used by permission. |
I’m shortly off to give a talk at the annual meeting of the Linguistic Society of America (on why open access is better for scholarly societies, which I’ll be blogging about soon), but in the meantime, a linguistically related post about punctuation.
Careful readers of this blog (are there any careful readers of this blog? are there any readers at all?) will note that I generally eschew the peculiarly American convention of moving punctuation within a closing quotation mark. Examples from The Occasional Pamphlet abound: here, here, here, here, here, here, here, and here. And that’s just from 2012. It’s surprising how often this punctuation convention comes into play.
Instead, I use the convention that only the stuff being quoted is put within the quotation marks. This is sometimes called the “British” convention, despite the fact that other nationalities use it as well, presumably to emphasize the American/British dualism extant from our little tiff in the late 18th century. I use the “British” convention because the “American” convention is, in technical terms, stupid.
The story goes that punctuation appearing within the quotation mark is more aesthetically pleasing than punctuation outside the quotation mark. But even if that were true, clarity trumps beauty. Moving the punctuation means that when you see a quoted string with some final punctuation, you don’t know if that punctuation is or is not intended to be part of the thing being quoted; it is systematically ambiguous.
Apparently, my view is highly controversial. For example, when working with MIT Press on my book on the Turing test, my copy editor (who, by the way, was wonderful, and amazingly patient) moved all my punctuation around to satisfy the American convention. I moved them all back. She moved them again. We got into a long discussion of the matter; it seems she had never confronted an author who felt strongly about punctuation before. (I presume she had never copy-edited Geoff Pullum, from whom more later.) As a compromise, we left the punctuation the way I liked it—mostly—but she made me add the following prefatory editorial note:
Throughout the text, the American convention of moving punctuation within closing quotation marks (whether or not the punctuation is part of what is being referred to) is dropped in favor of the more logical and consistent convention of placing only the quoted material within the marks.
I would now go on to explain why the “British” convention is better than the “stupid” convention, except that Geoff Pullum has done so much better a job, far better than I ever could. Here is an excerpt from his essay “Punctuation and human freedom” published in Natural Language and Linguistic Theory and reproduced in his book The Great Eskimo Vocabulary Hoax. I recommend the entire essay to you.
I want you to first consider the string ‘the string’ and the string ‘the string.’, noting that it takes ten keystrokes to type the string in the first set of quotes, and eleven to type the string in the second pair. Imagine you wanted to quote me on the latter point. You might want to say (1).
(1) Pullum notes that it takes eleven keystrokes to type the string ‘the string.’
No problem there; (1) is true (and grammatical if we add a final period). But now suppose you want to say this:
(2) Pullum notes that it takes ten keystrokes to type the string ‘the string’.
You won’t be able to publish it. Your copy-editor will change it before the first proof stage to (3), which is false (though regarded by copy-editors as grammatical):
(3) Pullum notes that it takes ten keystrokes to type the string ‘the string.’
Why? Because the copy-editor will insist that when a sentence ends with a quotation, the closing quotation mark must follow the punctuation mark.
I say this must stop. Linguists have a duty to the public to use their expertise in arguing for changes to the fabric of society when its interests are threatened. And we have such a situation here.
What say we all switch over to the logical quotation punctuation approach and save the fabric of society, shall we?
| …There’s a “tree” in it… “Fall New England” image by flickr user BrtinBoston. Used by permission. |
I received the attached email, inviting a contribution to a journal called Advances in Forestry Letter. Yes, that’s “Letter” in the singular, which is even still optimistic given the number of papers they’ve published so far, viz., none. For a week or so after I received the email, the journal’s web site was down. It’s back up now, and we can glean some further information about this “journal”. It is claimed to be published by “World Academic Publishers” (already listed in Jeffrey Beall’s list of predatory publishers), though the publisher’s site does not list the journal as of this writing. The listing of covered topics from their “Focus and Scope” page seems to have been plagiarized from the corresponding listing for the MDPI journal Forests.
Why am I, a computer scientist, being invited to submit an article on forestry? On the basis of being the author of an article entitled “Optimal k-arization of synchronous tree-adjoining grammar“. (Actually, they got that wrong too. I’m a co-author, along with Rebecca Nesson and Giorgio Satta.) See? There’s a “tree” in it. It must be about forestry.
I have half a mind to submit the article to them (after making it “80% different”) and see what happens.
Dear Shieber Stuart M.
This is from the Editorial Board Office of Journal of Advances in Forestry Letter (AFL). It is my honor to contact you.
Your paper
Title: Optimal k-arization of synchronous tree-adjoining grammar
Author(s):Shieber Stuart M.
has drawn our attention. We found the paper in the subject coverage of AFL. To promote the development and communication of Forestry Engineering, we sincerely invite you to make it 80% different from the original one and submit to AFL. The new papers in this area are extremely warmly welcome.
If you are interested, please submit your manuscript online before Nov. 15, 2012. Your paper will be published with no charge if accepted.
PAPER SUBMISSION WEBSITE:
[removed so as not to improve their page rank]
Best regards,
Editorial Board Office
Advances in Forestry Letter (AFL)
Website: [removed so as not to improve their page rank]
Email:
I’m pleased to forward on the announcement that the Harvard Open Access Project has just released an initial version of a guide on “good practices for university open-access policies”. It was put together by Peter Suber and myself with help from many, including Ellen Finnie Duranceau, Ada Emmett, Heather Joseph, Iryna Kuchma, and Alma Swan. It has already received endorsements from the Coalition of Open Access Policy Institutions (COAPI), Confederation of Open Access Repositories (COAR), Electronic Information for Libraries (EIFL), Enabling Open Scholarship (EOS), Harvard Open Access Project (HOAP), Open Access Scholarly Information Sourcebook (OASIS), Scholarly Publishing and Academic Resources Coalition (SPARC), and SPARC Europe.
The official announcement is provided below, replicated from the Berkman Center announcement.
In anticipation of worldwide Open Access Week, the Harvard Open Access Project is pleased to release version 1.0 of a guide to good practices for university open-access policies.
Gathering together recommendations on drafting, adopting, and implementing OA policies, the guide is based on policies adopted at Harvard, Stanford, MIT, and a couple of dozen other institutions around the world. But it’s not limited to policies of this type and includes recommendations that should be useful to institutions taking other approaches.
The guide is designed to evolve. As co-authors, we plan to revise and enlarge it over time, building on our own experience and the experience of colleagues elsewhere. We welcome suggestions.
The guide deliberately refers to “good practices” rather than “best practices”. On many points, there are multiple, divergent good practices. Good practices are easier to identify than best practices. And there can be wider agreement on which practices are good than on which practices are best.
The current version of the guide has the benefit of the advice of expert colleagues, and the endorsement of projects and organizations devoted to the spread of effective university OA policies. It has been written in consultation with Ellen Finnie Duranceau, Ada Emmett, Heather Joseph, Iryna Kuchma, and Alma Swan, and has already been endorsed by the Coalition of Open Access Policy Institutions (COAPI), Confederation of Open Access Repositories (COAR), Electronic Information for Libraries (EIFL), Enabling Open Scholarship (EOS), Harvard Open Access Project (HOAP), Open Access Scholarly Information Sourcebook (OASIS), Scholarly Publishing and Academic Resources Coalition (SPARC), and SPARC Europe.
Over time we hope to name more consulting experts and endorsing organizations. Please contact us if you or your organization may be interested. We do not assume that consulting experts or endorsing organizations support every recommendation in the guide.
The guide should be useful to institutions considering an OA policy, and to faculty and librarians who would like their institution to start considering one. We hope that institutions with working policies will share their experience and recommendations, and that organizers of Open Access Week events will link to the guide and bring it to the attention of their participants.
Good practices for university open-access policies
http://cyber.law.harvard.edu/hoap/Good_practices_for_university_open-access_policies
Stuart Shieber
Professor of Computer Science and Director of the Office for Scholarly Communication, Harvard University
http://www.seas.harvard.edu/~shieber
Peter Suber
Director of the Harvard Open Access Project, Special Advisor to the Harvard Office for Scholarly Communication, and Fellow at the Berkman Center for Internet & Society
psuber@cyber.law.harvard.edu
Harvard Open Access Project
http://cyber.law.harvard.edu/hoap
| Karen Spärck Jones, 1935-2007 |
In honor of Ada Lovelace Day 2012, I write about the only female winner of the Lovelace Medal awarded by the British Computer Society for “individuals who have made an outstanding contribution to the understanding or advancement of Computing”. Karen Spärck Jones was the 2007 winner of the medal, awarded shortly before her death. She also happened to be a leader in my own field of computational linguistics, a past president of the Association for Computational Linguistics. Because we shared a research field, I had the honor of knowing Karen and the pleasure of meeting her on many occasions at ACL meetings.
One of her most notable contributions to the field of information retrieval was the idea of inverse document frequency. Well before search engines were a “thing”, Karen was among the leaders in figuring out how such systems should work. Already in the 1960′s there had arisen the idea of keyword searching within sets of documents, and the notion that the more “hits” a document receives, the higher ranked it should be. Karen noted in her seminal 1972 paper “A statistical interpretation of term specificity and its application in retrieval” that not all hits should be weighted equally. For terms that are broadly distributed throughout the corpus, their occurrence in a particular document is less telling than occurrence of terms that occur in few documents. She proposed weighting each term by its “inverse document frequency” (IDF), which she defined as log(N/(n + 1)) where N is the number of documents and n the number of documents containing the keyword under consideration. When the keyword occurs in all documents, IDF approaches 1 for large N, but as the keyword occurs in fewer and fewer documents (making it a more specific and presumably more important keyword), IDF rises. The two notions of weighting (frequency of occurrence of the keyword together with its specificity as measured by inverse document frequency) are combined multiplicatively in the by now standard tf*idf metric; tf*idf or its successors underlie essentially all information retrieval systems in use today.
In Karen’s interview for the Lovelace Medal, she opined that “Computing is too important to be left to men.” Ada Lovelace would have agreed.
"Besides getting more data, faster, we also now use much more sophisticated learning algorithms. For instance, algorithms based on logistic regression and that support vector machines can reduce by half the amount of spam that evades filtering, compared to Naive Bayes." (Emphasis added.)
Joshua Goodman, Gordon V. Cormack, and David Heckerman. 2007. Spam and the ongoing battle for the inbox. Communications of the Association for Computing Machinery, volume 50, number 2, page 27.
A common source of run-on sentences is the inclusion of a parenthetical full sentence at the end of another sentence, for instance,
This is an example (there may be others).This construction is always wrong. Separate the two sentences, as
This is an example. (There may be others.)or coordinate or subordinate the two, as
This is an example (though there may be others).or
This is an example (and there may be others).The following is not correct:
This is an example (however, there may be others).“However” is an adverb, not a subordinating conjunction.
Writers using MS Word tend to make certain standard errors in their typesetting. For instance, they use hyphens instead of em-dashes (ctrl-alt-hyphen or option-shift-hyphen). Mathematical typesetting is especially bad. There is essentially no way to typeset mathematics well in MS Word. The best solution: LaTeX.
For a while, I've been meaning to comment on the "that"/"which" controversy, the claim that "which" should not be used with restrictive relative clauses, nor "that" for nonrestrictive. From a linguistic point of view, it seems clear that this view is descriptively barren. Geoff Pullum provides a convincing and entertaining argument on Language Log, based on the sentence "The key point, that all the popular reports missed, is that FOXP2 is a transcription factor...". The rarity of sentences like these, in which "that" is used for a nonrestrictive relative clause, leads Pullum to refer to it as "ivory-billed".
I suppose, and am happy to stipulate for the purposes of discussion, that the use of "which" for restrictive relative clauses and "that" for nonrestrictive (or supplemental, as Pullum prefers) is grammatical. Nonetheless, the overwhelming preponderance of occurrences of "which" for nonrestrictive clauses means that the use of "that" in that context is much more likely to give pause to the reader, a kind of cognitive setback. For that reason, a charitable writer (and shouldn't we all strive to be one of those?) ought to use "which" for nonrestrictive relative clauses -- not because it is "wrong" to use "that", or ungrammatical, but because the use of "that" is likely to be jarring to a significant fraction of one's readers. (And I don't only mean the Fowler-type prescriptivist readers, though I suppose there's no reason to be jarring them needlessly either.) An excellent point of evidence is the fact that Pullum had to ask the author directly which meaning he had intended in the ivory-billed sentence; had he used a "which", no clarification would have been needed.
In the particular case of the sentence quoted above, there is no concomitant advantage to using "that" over "which" that would compensate for the negative effect of jarring or confusing the reader. Thus, its use should be prescriptively deprecated. (This issue of compensation allows me to avoid proscriptions against splitting infinitives or dangling prepositions, the slavish following of which leads to circumlocutions and semantic errors. Avoiding these negative effects clearly compensates for the oh so very slight jarring effect on some small fraction of true-believing Fowlerians.) By a similar argument, the use of "which" for restrictive relatives should be deprecated as well in formal writing.
What I am arguing is that even though the language does not enforce the distinction between nonrestrictive and restrictive in terms of "which" versus "that" (and commas versus none), respectively, there is still a good reason to write as if it did. There was nothing wrong in the quoted sentence even under the intended interpretation, just something infelicitous.
Am I trying to have my cake and eat it too? To be able to rail prescriptively while keeping my linguistic descriptivist moral stance? Yes.
Different people have different styles for overall organization of a
technical paper. There is the "continental" style, in which one states the
solution with as little introduction or motivation as possible, sometimes
not even saying what the problem was. Papers in this style tend to start
like this: "Consider a seven-dimensional manifold Q, and define its
hyper-diagonal as the ...." This style is designed to convince the reader
that the author is very smart; how else could he or she have come up with
the answer out of the blue? Readers will have no clue as to whether you
are right or not without incredible efforts in close reading of the paper,
but at least they'll think you're a genius.
Of course, the author didn't come up with the solution out of the blue.
There was a whole history of false starts, wrong attempts, near misses,
redefinitions of the problem. The "historical" style involves
recapitulating all of this history in chronological order. "First I tried
this. That didn't work because of this, so I tried this other way. That
turned out to be stupid. Then I tried this other way...." This is much
better, because a careful reader can probably follow the line of reasoning
that the author went through, and use this as motivation. But the reader
will probably think you are a bit addle-headed. Why would you even think
of trying half the stuff you talked about?
The ideal style is the "rational reconstruction" style. In this style, you
don't present the actual history that you went through, but rather an
idealized history that perfectly motivates each step in the solution. "We
consider the problem of XXX. The obvious thing to try is X. But
such-and-such a pithy example shows that that fails miserably.
Nonetheless, the example points the way naturally to solution Y. This
works better, except for such-and-such an obscure case. We patch solution
Y to handle this case, forming solution Z. Voila." Of course, the author
doesn't tell you that he came up with solution Y before solution X, which
only occurred to him after he came up with solution Z, and he skips
solutions A, B, and C because, in retrospect, they are nowhere on the
natural path to Z, even though at the time he was completely convinced they
were on the right track. The goal in pursuing the rational reconstruction
style is not to convince the reader that you are brilliant (or addle-headed
for that matter) but that your solution is trivial. It takes a certain
strength of character to take that as one's goal. But the advantage of the
reader thinking your solution is trivial or obvious is that it necessarily
comes along with the notion that you are correct.
I've just discovered James Pryor's "Guidelines on Writing a Philosophy Paper". Despite the ostensible limited goal of the guidelines, they are much more broadly applicable than just to philosophy papers. I especially like the characterization of readers as "lazy, stupid, and mean".
People seem to fall prey to adverbials like "however" and "rather" seducing them into running on sentences.
This type of approach has been used in previous models, however, the presented algorithm adopts a different foundation.But these words are not conjunctions, subordinating or otherwise. They are adverbs, like "on the other hand" or "unfortunately". The following is, presumably, clearly infelicitous.
This type of approach has been used in previous models, unfortunately, the presented algorithm adopts a different foundation.By the same token, so is the sentence with "however". It is easily corrected:
This type of approach has been used in previous models; however, the presented algorithm adopts a different foundation.or
This type of approach has been used in previous models. The presented algorithm, however, adopts a different foundation.
Email messages should be treated as personal letters. You wouldn't write a handwritten letter with misspellings, would you? Or a typewritten letter in which you didn't bother to use the shift key? Then you shouldn't do that in an email. Doing so implies to many readers that you don't respect them enough to bother with such "niceties".
On a related topic, by convention, words in all caps in email messages are to be read as if the author were shouting them. This is typically not the intended interpretation. According to RFC 1855:
Use symbols for emphasis. That *is* what I meant. Use underscores for underlining. _War and Peace_ is my favorite book.
The use of the pronoun "he" as a bound pronoun of neutral gender is problematic on two grounds. First, its use is blatantly sexist (although the sexism is of a historical nature, so that those who continue to use "he" in this way have a defensible position). Second, and more importantly, many readers confronted with such a use of "he", including myself, tend to find that it causes a jarring effect as they stop to wonder whether or not the writer intended to imply that the referent of the pronoun is male. Anything that causes a jarring effect like this on a substantial portion of your readers should be avoided, as it serves only to distract them from the important substance of your writing.
Now, I turn to a more recent variant of the same problem. The use of the pronoun "she" as a bound pronoun of neutral gender is problematic on two grounds. First, its use is blatantly sexist (although the sexism is of an anti-historical nature, so that those who continue to use "she" in this way have a defensible position). Second, and more importantly, many readers confronted with such a use of "she", including myself, tend to find that it causes a jarring effect as they stop to wonder whether or not the writer intended to imply that the referent of the pronoun is female. Anything that causes a jarring effect like this on a substantial portion of your readers should be avoided, as it serves only to distract them from the important substance of your writing.
But what alternatives are there? In everyday speech, "they" or "them" is used for this purpose, but this disturbs the sensibilities of prescriptivists, who, I should remind you, are a substantial portion of your readers. And anything that causes a jarring effect like this on a substantial portion of your readers....
Rewriting the sentence is the only practicable alternative. Do it and be done with it.
Pat Winston in his lecture on How to Speak notes that covering up parts of overhead transparencies and revealing them slowly like a strip-tease artist is a technique that drives 10 per cent of your audience nuts. I am in that 10 per cent. The desire to use this technique means only one thing: There is too much information on the slide. Split it into multiple slides. Winston recommends using overlays instead, but overlays are really a different and specialized overhead technique, and are not typically necessary for remedying this problem.
By the way, if you make slides using computerized means and want to use an overlay, consider "implicit" overlays instead. An implicit overlay is a series of separate slides each of which includes the contents of a different prefix of the overlay slides. Implicit overlays have the advantage that no Scotch taping of slide material is required, and no fumbling with the overlay pieces is needed. One just continues placing single sheets on the projector as usual, but each one in the overlay series has some additional material added to the previous one.
A citation is not a first-class participant in a sentence; it cannot serve as a noun phrase. Rather it is a parenthetical -- that is why it appears in parentheses -- and like all parentheticals should be removable without changing the well-formedness of the sentence in which it appears. Thus, the following sentences are ill-formed. (Try reading them without the material in parentheses.)
I have no Facebook account, as explained below, so you can't "friend" me. But you can contact me, via the contact link at left, or the various other methods at my Harvard web page.
Here's why I have no Facebook page. As stated in the Facebook Terms of Service, "you grant us a non-exclusive, transferable, sub-licensable, royalty-free, worldwide license to use any IP content that you post on or in connection with Facebook ('IP License')." This means that Facebook can do anything they want with information I would post in my Facebook account, without restriction. Of course, they might restrict their actions to only entirely reasonable ones, like obeying my Privacy Settings for instance, but they don't need to, especially since they can change the rules at any time. ("We can change this Statement if we provide you notice (by posting the change on the Facebook Site Governance Page) and an opportunity to comment.") Should I need to provide a blanket use license to my social network provider? Nope. Do I want to? Nope.
In 2006, in Excellence Without a Soul, I noted that the first "Take Back the Night" rally at Harvard took place in 1980. I continued,
From this point on, the issue of rape flared up on a schedule approximating the four-year cycle of college generations—sometimes emerging after three years in the background, sometimes after five, but not every year. Different circumstances bring the issue to the fore in different years, and each time the college community starts from a different place in responding.Right on schedule, it's back. According to the Crimson, the University "recently appointed student representatives to a special committee to review the sexual misconduct policies of the Office of Sexual Assault Prevention and Response." I am not quite sure what to make of that sentence. Quite possibly I missed the announcement and news reporting on the creation of the committee, but this is the first I have seen of it in either Harvard announcements or the student press. In any case, I seriously doubt that it is the OSAPR itself that creates sexual misconduct policies, de jure anyway (I thought it was the Faculty). Be that as it may, the revival of the "what's rape?" issue seems to be due to the series ("slew," in the Crimson's scrupulously objective journalese) of Title IX complaints against universities, including the Harvard Law School.
… said that she and other students on the committee hoped to push the University instead toward an “enthusiastic consent” model, in which an incident can be called rape in the absence of affirmative agreement.
“The only people who lose out in this model are the rapists,” said [another student], who had also intended to serve on the committee.
[The first student] said that she plans to discuss the stay on student involvement with Rankin, but she might eventually consider leading a “student protest” or “something more radical” than acting through administration-approved channels if she feels that student voices on this issue are not being heard.
TinEye is a reverse image search engine built by Idée currently in beta. Give it an image and it will tell you where the image appears on the web.
Shared by Stuart
My new go-to temporary file sharing service replacement for drop.io and share1t, both defunct.
John Kendall, who taught violin at the college level for more than 50 years and who made "Suzuki" a household name in America, died Jan. 6. He spent more than three decades teaching music at Southern Illinois University Edwardsville, where he founded the Lincoln Quartet and the SIUE Suzuki Program. Under his leadership, the program became an international training center for teachers.
Voting online for public office is a terrifying proposition to most security experts. The paths to subversion or failure are many:
So, terrifying. And yet, I’m now pretty sure it is inevitable.
Today, we bank online, deposit checks and even pay vendors with our smart phones. We can change our mailing address with the postal service and pay parking tickets with our local governments online. We can shop online, socialize online, and debate with our Presidential candidates online. Newt Gingrich announced his Presidential campaign on Twitter.
Just about everyone now carries an Internet-connected personal device. The Internet is everywhere you want it, and just about everywhere you don’t. People are starting to experience the world through augmented reality, using online maps and satellite overlays matched with your current location. The Internet is only going to become more omnipresent, faster. Within a few years, it’s hard to imagine any human activity that doesn’t involve the Internet.
And yet, somehow, we expect people to still be voting in person, on paper? We can’t even get users to take SSL certificate warnings seriously, but we’re going to convince them that voting is so special it has to be done in person? I don’t think so.
I’m not arguing that this is how it should be. I’m definitely not saying that we can secure online voting just like we can secure online banking. In fact I’ve made many of the original arguments, in my dissertation and on this blog, shooting down the bogus arguments that go something like “hey, we can secure online banking, surely we can secure online voting!” No, we don’t know how to do that.
What I’m saying is that, regardless of the state of online voting security, I think it’s a losing battle to expect voting to remain the only activity we still do in person and on paper. With the Oscars moving to online voting, the Federal Voting Assistance Program making $15M available in grants for activities related to online voting (even if it supposedly doesn’t involve online vote casting), parts of Canada moving to online voting, France considering online voting for its 2M+ expats (more than the margin of victory in the last Presidential election), what you’re hearing is the sound of inevitability.
There’s another interesting issue, when you think about problem (4): even if we keep voting on paper in person, voting requires enforced privacy: we have to make sure it’s just you in the voting booth, not you plus a coercer. That’s great. Now, how many ballots do you think we’re going to see next year published on Instagram?
We have a deeper problem here due to the now omnipresent Internet. Voluntary privacy is not dead, since users can choose to isolate themselves. But enforced privacy, privacy imposed on the voter, the kind needed to prevent coercion, that’s quite dead. I’m very concerned about what that means for democracy. But again, this is inevitable.
So, if it’s inevitable, maybe the best we can do is make online voting as secure as possible. We’ll probably have a few disasters, maybe even a few thrown elections. So we’d better start now on the problems we have.
I think we can solve Problem (2) with open-audit, end-to-end voting systems like Helios (but not only Helios, there are others.) I think we can minimize the risk of Problem (1) by moving to a longer voting period (1 week instead of 1 day). I suspect we have to eventually give up on some aspects of (4), whether or not we do online voting, though some technical tricks might make voter coercion a good bit more difficult (it’s never completely impossible). The hardest problem is (3): we have no way of ensuring that people are using trustworthy software that captures their intent properly.
Again, I’m not endorsing online voting for public office. I’m saying it’s inevitable, and it’s time to face that inevitability.
This issue of trustworthy user software is a much larger problem than voting. As human activity increasingly moves online, the central question is: what software is truly on the side of the user? How does the user know for sure that the software they’re using is their true agent? There’s only one piece of Internet architecture today that can be the user’s true agent, and that’s the Web browser (which technologists call the User Agent, unsurprisingly.) And, among the web browsers, there’s one that particularly stands out as the ultimate user agent, backed by a company whose mission is focused on the user and only the user.
That’s why I joined Mozilla. Because for voting and beyond, everything people do is online or soon to be online, and users better have an agent on their side. The best agent users can get today is Firefox, and I hope to contribute to making it an even better user agent in the next few years.
[It's worth noting that Mozilla has no intention of getting into the voting business, that's just my personal interest.]
OK, you may now get out your pitchfork.
Shared by Stuart
"I will convert your Excel data into one of several web-friendly formats, including HTML, JSON and XML."
Tokyo University researchers develop scanner that can capture 200 pages in one minute
Yale University has adopted an open access policy for digitized images from its museums, archives, and libraries. Yale has also launched the Discover Yale Digital Commons, which has over 250,000 images.
Here's an excerpt from the announcement:
The goal of the new policy is to make high quality digital images of Yale's vast cultural heritage collections in the public domain openly and freely available.
As works in these collections become digitized, the museums and libraries will make those images that are in the public domain freely accessible. In a departure from established convention, no license will be required for the transmission of the images and no limitations will be imposed on their use. The result is that scholars, artists, students, and citizens the world over will be able to use these collections for study, publication, teaching and inspiration.
| Digital Scholarship | Digital Scholarship Publications Overview | Transforming Scholarly Publishing through Open Access: A Bibliography |
On page 287 of Blown to Bits, we discuss the incestuous relationships between the regulators and the regulated in the world of information flows.
And then there is the revolving door. Most communications jobs are in the private sector. FCC employees know that their future lies in the commercial use of the spectrum. Hundreds of FCC staff and officials, including all eight past FCC chairmen, have gone to work for or represented the businesses they regulated. These movements from government to private employment violate no government ethics rules. But FCC officials can be faced with a choice between angering a large incumbent that is a potential employer, and disap- pointing a marginal start-up or a public interest non-profit. It is not surprising that they remember that they will have to earn a living after leaving the FCC.Even by historical standards, today's news is appalling. FCC Commissioner Attwell Baker is leaving the FCC to become a lobbyist for Comcast, just four months after voting to approve the controversial merger of Comcast with NBC United. We are, once again, going down the path to information monopoly. We have been there before, indeed we were there already in the late 19th century.
Development work continues, and we’ve got a nice section of the publishing workflow and permissions set completed.
The staging server is set up (private access for now, sorry), and some initial code has been committed to Github.
And, we’ve got a new logo!
There’s also a thumbnail version.
Up for next week: finishing the workflow and permissions, and starting on the authoring piece.
Neo-Riemannian Cycle Detection with Weighted Finite-State Transducers Shieber, Stuart M.; Bragg, Jonathan; Chew, Elaine This paper proposes a finite-state model for detecting harmonic cycles as described by neo-Riemannian theorists. Given a string of triads representing a harmonic analysis of a piece, the task is to identify and label all substrings corresponding to these cycles with high accuracy. The solution method uses a noisy channel model implemented with weighted finitestate transducers. On a dataset of four works by Franz Schubert, our model predicted cycles in the same regions as cycles in the ground truth with a precision of 0.18 and a recall of 1.0. The recalled cycles had an average edit distance of 3.2 insertions or deletions from the ground truth cycles, which average 6.4 labeled triads in length. We suggest ways in which our model could be used to contribute to current work in music theory, and be generalized to other music pattern-finding applications.
The Case for the Journal’s Use of a CC-BY License Shieber, Stuart M. Journal of Language Modelling provides its articles under a Creative Commons CC-BY license. We discuss why this is the appropriate choice for the journal.
Reconciling Abstract Structure and Concrete Data in Statistical Natural-Language Processing Shieber, Stuart M.
Lexical Chaining and Word-Sense-Disambiguation Nelken, Rani; Shieber, Stuart M. Lexical chains algorithms attempt to find sequences of words in a document that are closely related semantically. Such chains have been argued to provide a good indication of the topics covered by the document without requiring a deeper analysis of the text, and have been proposed for many NLP tasks. Different underlying lexical semantic relations based on WordNet have been used for this task. Since links in WordNet connect synsets rather than words, open word-sense disambiguation becomes a necessary part of any chaining algorithm, even if the intended application is not disambiguation. Previous chaining algorithms have combined the tasks of disambiguation and chaining by choosing those word senses that maximize chain connectivity, a strategy which yields poor disambiguation accuracy in practice.
We present a novel probabilistic algorithm for finding lexical chains. Our algorithm explicitly balances the requirements of maximizing chain connectivity with the choice of probable word-senses. The algorithm achieves better disambiguation results than all previous ones, but under its optimal settings shifts this balance totally in favor of probable senses, essentially ignoring the chains. This model points to an inherent conflict between chaining and word-sensedisambiguation. By establishing an upper bound on the disambiguation potential of lexical chains, we show that chaining is theoretically highly unlikely to achieve accurate disambiguation.
Moreover, by defining a novel intrinsic evaluation criterion for lexical chains, we show that poor disambiguation accuracy also implies poor chain accuracy. Our results have crucial implications for chaining algorithms. At the very least, they show that disentangling disambiguation from chaining significantly improves chaining accuracy. The hardness of all-words disambiguation, however, implies that finding accurate lexical chains is harder than suggested by the literature.
Statement of Stuart M. Shieber before the Committee on Science, Space, and Technology Subcommittee on Investigations and Oversight, U.S. House of Representatives Shieber, Stuart M.
Plan Recognition in Exploratory Domains Gal, Ya'akov; Reddy, Swapna; Shieber, Stuart M.; Rubin, Andee; Grosz, Barbara J. This paper describes a challenging plan recognition problem that arises in environments in which agents engage widely in exploratory behavior, and presents new algorithms for effective plan recognition in such settings. In exploratory domains, agentsʼ actions map onto logs of behavior that include switching between activities, extraneous actions, and mistakes. Flexible pedagogical software, such as the application considered in this paper for statistics education, is a paradigmatic example of such domains, but many other settings exhibit similar characteristics. The paper establishes the task of plan recognition in exploratory domains to be NP-hard and compares several approaches for recognizing plans in these domains, including new heuristic methods that vary the extent to which they employ backtracking, as well as a reduction to constraint-satisfaction problems. The algorithms were empirically evaluated on peopleʼs interaction with flexible, open-ended statistics education software used in schools. Data was collected from adults using the software in a lab setting as well as middle school students using the software in the classroom. The constraint satisfaction approaches were complete, but were an order of magnitude slower than the heuristic approaches. In addition, the heuristic approaches were able to perform within 4% of the constraint satisfaction approaches on student data from the classroom, which reflects the intended user population of the software. These results demonstrate that the heuristic approaches offer a good balance between performance and computation time when recognizing peopleʼs activities in the pedagogical domain of interest.
Inverting the Turing Test [review of The Most Human Human by Brian Christian] Shieber, Stuart M. In his book The Most Human Human, Brian Christian extrapolates from his experiences at the 2009 Loebner Prize competition, a competition among chatbots (computer programs that engage in conversation with people) to see which is "most human." In doing so, he demonstrates once again that the human being may be the only animal that overinterprets.
A Simple Language for Novel Visualizations of Information Shieber, Stuart M.; Lucas, Wendy While information visualization tools support the representation of abstract data, their ability to enhance one’s understanding of complex relationships can be hindered by a limited set of predefined charts. To enable novel visualization over multiple variables, we propose a declarative language for specifying informational graphics from first principles. The language maps properties of generic objects to graphical representations based on scaled interpretations of data values. An iterative approach to constraint solving that involves user advice enables the optimization of graphic layouts. The flexibility and expressiveness of a powerful but relatively easy to use grammar supports the expression of visualizations ranging from the simple to the complex.
A Language for Specifying Informational Graphics from First Principles Shieber, Stuart M.; Lucas, Wendy Informational visualization tools, such as commercial charting packages, provide a standard set of visualizations for tabular data, including bar charts, scatter plots, pie charts, and the like. For some combinations of data and task, these are suitable visualizations. For others, however, novel visualizations over multiple variables would be preferred but are unavailable in the fixed list of standard options. To allow for these cases, we introduce a declarative language for specifying visualizations on the basis of the first principles on which (a subset of) informational graphics are built. The functionality we aim to provide with this language is presented by way of example, from simple scatter plots to versions of two quite famous visualizations: Minard’s depiction of troop strength during Napoleon’s march on Moscow and a map of the early ARPAnet from the ancient history of the Internet. Benefits of our approach include flexibility and expressiveness for specifying a range of visualizations that cannot be rendered with standard commercial systems.
Synchronous Vector TAG for Syntax and Semantics: Control Verbs, Relative Clauses, and Inverse Linking Shieber, Stuart M.; Nesson, Rebecca Recent work has used the synchronous tree-adjoining grammar (STAG) formalism to demonstrate that many of the cases in which syntactic and semantic derivations appeared to be divergent could be handled elegantly through synchronization. This research has provided syntax and semantics for diverse and complex lin- guistic phenomena. However, certain hard cases push the STAG formalism to its limits, requiring awkward analyses or leaving no clear solution at all. In this paper a new variant of STAG, synchronous vector TAG (SV-TAG), and demonstrate that it has the potential to handle hard cases such as control verbs, relative clauses, and in- verse linking, while maintaining the simplicity of previous STAG syntax-semantics analyses.
Recognition of Users' Activities using Constraint Satisfaction Reddy, Swapna Cherukupalli; Gal, Ya'akov; Shieber, Stuart M. Ideally designed software allow users to explore and pursue interleaving plans, making it challenging to automatically recognize user interactions. The recognition algorithms presented use constraint satisfaction techniques to compare user interaction histories to a set of ideal solutions. We evaluate these algorithms on data obtained from user interactions with a commercially available pedagogical software, and find that these algorithms identified users’ activities with 93% accuracy.
Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression Yamangil, Elif; Shieber, Stuart M. We describe our experiments with training algorithms for tree-to-tree synchronous tree-substitution grammar (STSG) for monolingual translation tasks such as sentence compression and paraphrasing. These translation tasks are characterized by the relative ability to commit to parallel parse trees and availability of word alignments, yet the unavailability of large-scale data, calling for a Bayesian tree-to-tree formalism. We formalize nonparametric Bayesian STSG with epsilon alignment in full generality, and provide a Gibbs sampling algorithm for posterior inference tailored to the task of extractive sentence compression. We achieve improvements against a number of baselines, including expectation maximization and variational Bayes training, illustrating the merits of nonparametric inference over the space of grammars as opposed to sparse parametric inference with a fixed grammar.
Complexity, Parsing, and Factorization of Tree-Local Multi-Component Tree-Adjoining Grammar Shieber, Stuart M.; Satta, Giorgio; Nesson, Rebecca Tree-Local Multi-Component Tree-Adjoining Grammar (TL-MCTAG) is an appealing formalism for natural language representation because it arguably allows the encapsulation of the appropriate domain of locality within its elementary structures. Its multicomponent structure allows modeling of lexical items that may ultimately have elements far apart in a sentence, such as quantifiers and Wh-words. When used as the base formalism for a synchronous grammar, its flexibility allows it to express both the close relationships and the divergent structure necessary to capture the links between the syntax and semantics of a single language or the syntax of two different languages. Its limited expressivity provides constraints on movement and, we posit, may have generated additional popularity based on a misconception about its parsing complexity. Although TL-MCTAG was shown to be equivalent in expressivity to TAG when it was first introduced (Weir 1988), the complexity of TL-MCTAG is still not well-understood. This paper offers a thorough examination of the problem of TL-MCTAG recognition, showing that even highly restricted forms of TL-MCTAG are NP-complete to recognize. However, in spite of the provable difficulty of the recognition problem, we offer several algorithms that can substantially improve processing efficiency. First, we present a parsing algorithm that improves on the baseline parsing method and runs in polynomial time when both the fan-out and rank of the input grammar are bounded. Second, we offer an optimal, efficient algorithm for factorizing a grammar to produce a strongly-equivalent TL-MCTAG grammar with the rank of the grammar minimized.
Identifying Uncertain Words within an Utterance via Prosodic Features Pon-Barry, Heather Roberta; Shieber, Stuart M. We describe an experiment that investigates whether sub-utterance prosodic features can be used to detect uncertainty at the wordlevel. That is, given an utterance that is classified as uncertain, we want to determine which word or phrase the speaker is uncertain about. We have a corpus of utterances spoken under varying degrees of certainty. Using combinations of sub-utterance prosodic features we train models to predict the level of certainty of an utterance. On a set of utterances that were perceived to be uncertain, we compare the predictions of our models for two candidate target word segmentations: (a) one with the actual word causing uncertainty as the proposed target word, and (b) one with a control word as the proposed target word. Our best model correctly identifies the word causing the uncertainty rather than the control word 91% of the time.
The Importance of Sub-Utterance Prosody in Predicting Level of Certainty Shieber, Stuart M.; Pon-Barry, Heather Roberta We present an experiment aimed at understanding how to optimally use acoustic and prosodic information to predict a speaker's level of certainty. With a corpus of utterances where we can isolate a single word or phrase that is responsible for the speaker's level of certainty we use different sets of sub-utterance prosodic features to train models for predicting an utterance's perceived level of certainty. Our results suggest that using prosodic features of the word or phrase responsible for the level of certainty and of its surrounding context improves the prediction accuracy without increasing the total number of features when compared to using only features taken from the utterance as a whole.
Criteria for Designing Computer Facilities for Linguistic Analysis Shieber, Stuart M. Abstract: In the natural-language-processing research community, the usefulness of computer tools for testing linguistic analyses is often taken for granted. Linguists, on the other hand, have generally been unaware of or ambivalent about such devices. We discuss several aspects of computer use that are preeminent in establishing the utility for linguistic research of computer tools and describe several factors that must be considered in designing such computer tools to aid in testing linguistic analyses of grammatical phenomena. A series of design alternatives, some theoretically and some practically motivated, is then based on the resultant criteria. We present one way of pinning down these choices which culminates in a description of a particular grammar formalism for use in computer linguistic tools. The PATR-II formalism this serves to exemplify our general perspective.
Agent Decision-Making in Open Mixed Networks Gal, Ya'akov; Grosz, Barbara J.; Kraus, Sarit; Shieber, Stuart M. Computer systems increasingly carry out tasks in mixed networks, that is in group settings in which they interact both with other computer systems and with people. Participants in these heterogeneous human-computer groups vary in their capabilities, goals, and strategies; they may cooperate, collaborate, or compete. The presence of people in mixed networks raises challenges for the design and the evaluation of decision-making strategies for computer agents. This paper describes several new decision-making models that represent, learn and adapt to various social attributes that influence people's decision-making and presents a novel approach to evaluating such models. It identifies a range of social attributes in an open-network setting that influence people's decision-making and thus affect the performance of computer-agent strategies, and establishes the importance of learning and adaptation to the success of such strategies. The settings vary in the capabilities, goals, and strategies that people bring into their interactions. The studies deploy a configurable system called Colored Trails (CT) that generates a family of games. CT is an abstract, conceptually simple but highly versatile game in which players negotiate and exchange resources to enable them to achieve their individual or group goals. It provides a realistic analogue to multi-agent task domains, while not requiring extensive domain modeling. It is less abstract than payoff matrices, and people exhibit less strategic and more helpful behavior in CT than in the identical payoff matrix decision-making context. By not requiring extensive domain modeling, CT enables agent researchers to focus their attention on strategy design, and it provides an environment in which the influence of social factors can be better isolated and studied.
Recognizing Uncertainty in Speech Shieber, Stuart M.; Pon-Barry, Heather Roberta We address the problem of inferring a speaker’s level of certainty based on prosodic information in the speech signal, which has application in speech-based dialogue systems. We show that using phrase-level prosodic features centered around the phrases causing uncertainty, in addition to utterance-level prosodic features, improves our model’s level of certainty classification. In addition, our models can be used to predict which phrase a person is uncertain about. These results rely on a novel method for eliciting utterances of varying levels of certainty that allows us to compare the utility of contextually-based feature sets. We elicit level of certainty ratings from both the speakers themselves and a panel of listeners, finding that there is often a mismatch between speakers’ internal states and their perceived states, and highlighting the importance of this distinction.
Equity for Open-Access Journal Publishing Shieber, Stuart M. Scholars write articles to be read--the more access to their articles the better--so one might think that the open-access approach to publishing, in which articles are freely available online to all without interposition of an access fee, would be an attractive competitor to traditional subscription-based journal publishing. But open-access journal publishing is currently at a systematic disadvantage relative to the traditional model. I propose a simple, cost-effective remedy to this inequity that would put open-access publishing on a path to become a sustainable, efficient system, allowing the two journal publishing systems to compete on a more level playing field. The issue is important, first, because academic institutions shouldn’t perpetuate barriers to an open-access business model on principle and, second, because the subscription-fee business model has manifested systemic dysfunctionalities in practice. After describing the problem with the subscription-fee model, I turn to the proposal for providing equity for open-access journal publishing--the open-access compact.
Efficiently Parsable Extensions to Tree-Local Multicomponent TAG Nesson, Rebecca; Shieber, Stuart M. Recent applications of Tree-Adjoining Grammar (TAG) to the domain of semantics as well as new attention to syntactic phenomena have given rise to increased interested in more expressive and complex multicomponent TAG formalisms (MCTAG). Although many constructions can be modeled using tree-local MCTAG (TL-MCTAG), certain applications require even more flexibility. In this paper we suggest a shift in focus from constraining locality and complexity through tree- and set-locality to constraining locality and complexity through restrictions on the derivational distance between trees in the same tree set in a valid derivation. We examine three formalisms, restricted NS-MCTAG, restricted Vector-TAG and delayed TL-MCTAG, that use notions of derivational distance to constrain locality and demonstrate how they permit additional expressivity beyond TL-MCTAG without increasing complexity to the level of set local MCTAG.
Restricting the weak-generative capacity of synchronous tree-adjoining grammars Shieber, Stuart The formalism of synchronous tree-adjoining grammars, a variant of standard tree-adjoining grammars (TAG), was intended to allow the use of TAGs for language transduction in addition to language specification. In previous work, the definition of the transduction relation defined by a synchronous TAG was given by appeal to an iterative rewriting process. The rewriting definition of derivation is problematic in that it greatly extends the expressivity of the formalism and makes the design of parsing algorithms difficult if not impossible. We introduce a simple, natural definition of synchronous tree-adjoining derivation, based on isomorphisms between standard tree-adjoining derivations, that avoids the expressivity and implementability problems of the original rewriting definition. The decrease in expressivity, which would otherwise make the method unusable, is offset by the incorporation of an alternative definition of standard tree-adjoining derivation, previously proposed for completely separate reasons, thereby making it practical to entertain using the natural definition of synchronous derivation. Nonetheless, some remaining problematic cases call for yel more flexibility in the definition; the isomorphism requirement may have to be relaxed. It remains for future research to rune the exact requirements on the allowable mappings.
Optimal k-arization of synchronous tree-adjoining grammar Nesson, Rebecca; Shieber, Stuart; Satta, Giorgio Synchronous Tree-Adjoining Grammar (STAG) is a promising formalism for syntax-aware machine translation and simultaneous computation of natural-language syntax and semantics. Current research in both of these areas is actively pursuing its incorporation. However, STAG parsing is known to be NP-hard due to the potential for intertwined correspondences between the linked nonterminal symbols in the elementary structures. Given a particular grammar, the polynomial degree of efficient STAG parsing algorithms depends directly on the rank of the grammar: the maximum number of correspondences that appear within a single elementary structure. In this paper we present a compile-time algorithm for transforming a STAG into a strongly-equivalent STAG that optimally minimizes the rank, k, across the grammar. The algorithm performs in O( |G| + |Y| · (L_G)^3 ) time where L_G is the maximum number of links in any single synchronous tree pair in the grammar and Y is the set of synchronous tree pairs of G.
Formal constraints on metarules Shieber, Stuart; Robinson, Jane J.; Stucky, Susan U.; Uszkoreit, Hans Metagrammatical formalisms that combine context-free phrase structure rules and metarules (MPS grammars) allow concise statement of generalizations about the syntax of natural languages. Unconstrained MPS grammars, unfortunately, are not computationally "safe." We evaluate several proposals for constraining them, basing our assessment on computational tractability and explanatory adequacy. We show that none of them satisfies both criteria, and suggest new directions for research on alternative metagrammatical formalisms.
The design of a computer language for linguistic information Shieber, Stuart A considerable body of accumulated knowledge about the design of languages for communicating information to computers has been derived from the subfields of programming language design and semantics. It has been the goal of the PATR group at SRI to utilize a relevant portion of this knowledge in implementing tools to facilitate communication of linguistic information to computers. The PATR-II formalism is our current computer language for encoding linguistic information. This paper, a brief overview of that formalism, attempts to explicate our design decisions in terms of a set of properties that effective computer languages should incorporate.
The semantics of grammar formalisms seen as computer languages Shieber, Stuart; Pereira, Fernando C. N. The design, implementation, and use of grammar formalisms for natural language have constituted a major branch of computational linguistics throughout its development. By viewing grammar formalisms as just a special case of computer languages, we can take advantage of the machinery of denotational semantics to provide a precise specification of their meaning. Using Dana Scott's domain theory, we elucidate the nature of the feature systems used in augmented phrase-structure grammar formalisms, in particular those of recent versions of generalized phrase structure grammar, lexical functional grammar and PATR-II, and provide a denotational semantics for a simple grammar formalism. We find that the mathematical structures developed for this purpose contain an operation of feature generalization, not available in those grammar formalisms, that can be used to give a partial account of the effect of coordination on syntactic features.
Sentence disambiguation by a shift-reduce parsing technique Shieber, Stuart Native speakers of English show definite and consistent preferences for certain readings of syntactically ambiguous sentences. A user of a natural-language-processing system would naturally expect it to reflect the same preferences. Thus, such systems must model in some way the linguistic performance as well as the linguistic competence of the native speaker. We have developed a parsing algorithm---a variant of the LALR(1) shift-reduce algorithm---that models the preference behavior of native speakers for a range of syntactic preference phenomena reported in the psycholinguistic literature, including the recent data on lexical preferences. The algorithm yields the preferred parse deterministically, without building multiple parse trees and choosing among them. As a side effect, it displays appropriate behavior in processing the much discussed garden-path sentences. The parsing algorithm has been implemented and has confirmed the feasibility of our approach to the modeling of these phenomena.
Synchronous tree-adjoining grammars Schabes, Yves; Shieber, Stuart The unique properties of tree-adjoining grammars (TAG) present a challenge for the application of TAGs beyond the limited confines of syntax, for instance, to the task of semantic interpretation or automatic translation of natural language. We present a variant of TAGs, called synchronous TAGs, which characterize correspondences between languages. The formalism's intended usage is to relate expressions of natural languages to their associated semantics represented in a logical form language, or to their translates in another natural language; in summary, we intend it to allow TAGs to be used beyond their role in syntax proper. We discuss the application of synchronous TAGs to concrete examples, mentioning primarily in passing some computational issues that arise in its interpretation
A viewer for PostScript documents Shieber, Stuart; Marks, Joe; Ginsburg, Adam We describe a PostScript viewer that provides navigation and annotation functionality similar to that of paper documents using simple unified user-interface techniques.
An interactive system for drawing graphs Marks, Joe; Shieber, Stuart; Ryall, Kathy Abstract: In spite of great advances in the automatic drawing of medium and large graphs, the tools available for drawing small graphs exquisitely (that is, with the aesthetics commonly found in professional publications and presentations) are still very primitive. Commercial tools, e.g., Claris Draw, provide minimal support for aesthetic graph layout. At the other extreme, research prototypes based on constraint methods are overly general for graph drawing. Our system improves on general constraint-based approaches to drawing and layout by supporting only a small set of “macro” constraints that are specifically suited to graph drawing. These constraints are enforced by a generalized spring algorithm. The result is a usable and useful tool for drawing small graphs easily and nicely.
Semi-automatic Delineation of Regions in Floor Plans Shieber, Stuart; Mazer, Murray; Marks, Joe; Ryall, Kathy We propose a technique that uses a proximity metric for delineating partially or fully bounded regions of a scanned bitmap that depicts a building floor plan. A proximity field is defined over the bitmap, which is used both to identify the centers of subjective regions in the image and to assign pixels to regions by proximity. The region boundaries generated by the method tend to match well the subjective boundaries of regions in the image. We discuss incorporation of the technique in a semi-automated interactive system for region identification in floor plans. In contrast to area-filling techniques for delineating areal regions of images, our approach works robustly for partially bounded regions. Furthermore, the frailties of the method that do remain, unlike those of alternative techniques, are well-moderated by simple human intervention.
A simple reconstruction of GPSG Shieber, Stuart Like most linguistic theories, the theory of generalized phrase structure grammar (GPSG) has described language axiomatically, that is, as a set of universal and language-specific constraints on the well-formedness of linguistic elements of some sort. The coverage and detailed analysis of English grammar in the ambitious recent volume by Gazdar, Klein, Pullum, and Sag entitled Generalized Phrase Structure Grammar are impressive, in part because of the complexity of the axiomatic system developed by the authors. In this paper. We examine the possibility that simpler descriptions of the same theory can be achieved through a slightly different, albeit still axiomatic, method. Rather than characterize the well-formed trees directly, we progress in two stages by procedurally characterizing the well-formedness axioms themselves, which in turn characterize the trees.
A uniform architecture for parsing and generation Shieber, Stuart The use of a single grammar for both parsing and generation is an idea with a certain elegance, the desirability of which several researchers have noted. In this paper, we discuss a more radical possibility: not only can a single grammar be used by different processes engaged in various "directions" of processing, but one and the same language-processing architecture can be used for processing the grammar in the various modes. In particular, parsing and generation can be viewed as two processes engaged in by a single parameterized theorem prover for the logical interpretation of the formalism. We discuss our current implementation of such an architecture, which is parameterized in such a way that it can be used for either purpose with grammars written in the PATR formalism. Furthermore, the architecture allows fine tuning to reflect different processing strategies, including parsing models intended to mimic psycholinguistic phenomena. This tuning allows the parsing system to operate within the same realm of efficiency as previous architectures for parsing alone, but with much greater flexibility for engaging in other processing regimes.
Design Galleries: A general approach to setting parameters for computer graphics and animation Gibson, Sarah; Beardsley, Paul; Ruml, Wheeler; Kang, Thomas; Mirtich, Brian; Seims, Joshua; Freeman, William; Hodgins, Jessica; Pfister, Hanspeter; Marks, Joe; Andalman, Brad; Shieber, Stuart Image rendering maps scene parameters to output pixel values; animation maps motion-control parameters to trajectory values. Because these mapping functions are usually multidimensional, nonlinear, and discontinuous, finding input parameters that yield desirable output values is often a painful process of manual tweaking. Interactive evolution and inverse design are two general methodologies for computer-assisted parameter setting in which the computer plays a prominent role. In this paper we present another such methodology: Design Gallery TM (DG) interfaces present the user with the broadest selection--- automatically generated and organized--- of perceptually different graphics or animations that can be produced by varying a given input-parameter vector. The principal technical challenges posed by the DG approach are dispersion, finding a set of input-parameter vectors that optimally disperses the resulting output-value vectors, and arrangement, organizing the resulting graphics for easy and intuitive browsing by the user. We describe the use of DGs for several parametersetting problems: light selection and placement for image rendering, both standard and image-based; opacity and color transfer-function specification for volume rendering; and motion control for particle-system and articulated-figure animation.
Design gallery browsers based on 2D and 3D graph drawing Marks, Joe; Ruml, Wheeler; Andalman, Brad; Ryall, Kathy; Shieber, Stuart Many problems in computer-aided design and graphics involve the process of setting and adjusting input parameters to obtain desirable output values. Exploring different parameter settings can be a difficult and tedious task in most such systems. In the Design GalleryTM (DG) approach, parameter setting is made easier by dividing the task more equitably between user and computer. DG interfaces present the user with the broadest selection, automatically generated and organized, of perceptually different designs that can be produced by varying a given set of input parameters. The DG approach has been applied to several difficult parameter-setting tasks from the field of computer graphics: light selection and placement for image rendering; opacity and color transfer-function specification for volume rendering; and motion control for articulated-figure and particle-system animation. The principal technical challenges posed by the DG approach are dispersion (finding a set of input-parameter vectors that optimally disperses the resulting output values) and arrangement (arranging the resulting designs for easy browsing by the user). We show how effective arrangement can be achieved with 2D and 3D graph drawing. While navigation is easier in the 2D interface, the 3D interface has proven to be surprisingly usable, and the 3D drawings sometimes provide insights that are not so obvious in the 2D drawings.
Induction of probabilistic synchronous tree-insertion grammars for machine translation. Nesson, Rebecca; Rush, Alexander; Shieber, Stuart The more expressive and flexible a base formalism for machine translation is, the less efficient parsing of it will be. However, even among formalisms with the same parse complexity, some formalisms better realize the desired characteristics for machine translation formalisms than others. We introduce a particular formalism, probabilistic synchronous treeinsertion grammar (PSTIG) that we argue satisfies the desiderata optimally within the class of formalisms that can be parsed no less efficiently than context-free grammars and demonstrate that it outperforms state-of-the-art word-based and phrasebased finite-state translation models on training and test data taken from the EuroParl corpus (Koehn, 2005). We then argue that a higher level of translation quality can be achieved by hybridizing our induced model with elementary structures produced using supervised techniques such as those of Groves et al. (2004).
A seed-growth heuristic for graph bisection Ngo, J. Thomas; Shieber, Stuart; Ruml, Wheeler; Marks, Joe We present a new heuristic algorithm for graph bisection, based on an implicit notion of clustering. We describe how the heuristic can be combined with stochastic search procedures and a postprocess application of the Kernighan-Lin algorithm. In a series of time-equated comparisons with large-sample runs of pure Kernighan-Lin, the new algorithm demonstrates significant superiority in terms of the best bisections found.
Generation and synchronous tree-adjoining grammars Schabes, Yves; Shieber, Stuart Tree-adjoining grammars (TAG) have been proposed as a formalism for generation based on the intuition that the extended domain of syntactic locality that TAGs provide should aid in localizing semantic dependencies as well, in turn serving as an aid to generation from semantic representations. We demonstrate that this intuition can be made concrete by using the formalism of synchronous tree-adjoining grammars. The use of synchronous TAGs for generation provides solutions to several problems with previous approaches to TAG generation. Furthermore, the semantic monotonicity requirement previously advocated for generation grammars as a computational aid is seen to be an inherent property of synchronous TAGs.
An interactive constraint-based system for drawing graphs Marks, Joe; Ryall, Kathy; Shieber, Stuart The glide system is an interactive constraint-based editor for drawing small- and medium-sized graphs (50 nodes or fewer) that organizes the interaction in a more collaborative manner than in previous systems. Its distinguishing features are a vocabulary of specialized constraints for graph drawing, and a simple constraintsatisfaction mechanism that allows the user to manipulate the drawing while the constraints are active. These features result in a graph-drawing editor that is superior in many ways to those based on more general and powerful constraint-satisfaction methods.
Empirical testing of algorithms for variable-sized label placement Marks, Joe; Friedman, Stacy; Christensen, Jon; Shieber, Stuart We report an empirical comparison of different heuristic techniques for variable-sized point-feature label placement.
The LinGO redwoods treebank: Motivation and preliminary applications Brants, Thorsten; Flickinger, Dan; Manning, Christopher; Shieber, Stuart; Toutanova, Kristina; Oepen, Stephan The LinGO Redwoods initiative is a seed activity in the design and development of a new type of treebank. While several medium- to large-scale treebanks exist for English (and for other major languages), pre-existing publicly available resources exhibit the following limitations: (i) annotation is mono-stratal, either encoding topological (phrase structure) or tectogrammatical (dependency) information, (ii) the depth of linguistic information recorded is comparatively shallow, (iii) the design and format of linguistic representation in the treebank hard-wires a small, predefined range of ways in which information can be extracted from the treebank, and (iv) representations in existing treebanks are static and over the (often year- or decade-long) evolution of a large-scale treebank tend to fall behind the development of the field. LinGO Redwoods aims at the development of a novel treebanking methodology, rich in nature and dynamic both in the ways linguistic data can be retrieved from the treebank in varying granularity and in the constant evolution and regular updating of the treebank itself. Since October 2001, the project is working to build the foundations for this new type of treebank, to develop a basic set of tools for treebank construction and maintenance, and to construct an initial set of 10,000 annotated trees to be distributed together with the tools under an open-source license.
Abbreviated text input Shieber, Stuart; Baker, Ellie We address the problem of improving the efficiency of natural language text input under degraded conditions (for instance, on PDAs or cell phones or by disabled users) by taking advantage of the informational redundacy in natural language. Previous approaches to this problem have been based on the idea of prediction of the text, but these require the user to take overt action to verify or select the system's predictions. We propose taking advantage of the duality between prediction and compression. We allow the user to enter text in compressed form, in particular, using a simple stipulated abbreviation method that reduces characters by about 30% yet is simple enough that it can be learned easily and generated relatively fluently. Using statistical language processing techniques, we can decode the abbreviated text with a residual word error rate of about 3%, and we expect that simple adaptive methods can improve this to about 1.5%. Because the system's operation is completely independent from the user's, the overhead from cognitive task switching and attending to the system's actions online is eliminated, opening up the possibility that the compression-based method can achieve text input efficiency improvements where the prediction-based methods have not.
Unifying annotated discourse hierarchies to create a gold standard Carbone, Marco; Shieber, Stuart; Gal, Ya'akov Kobi Human annotation of discourse corpora typically results in segmentation hierarchies that vary in their degree of agreement. This paper presents several techniques for unifying multiple discourse annotations into a single hierarchy, deemed a “gold standard ” — the segmentation that best captures the underlying linguistic structure of the discourse. It proposes and analyzes methods that consider the level of embeddedness of a segmentation as well as methods that do not. A corpus containing annotated hierarchical discourses, the Boston Directions Corpus, was used to evaluate the “goodness” of each technique, by comparing the similarity of the segmentation it derives to the original annotations in the corpus. Several metrics of similarity between hierarchical segmentations are computed: precision/recall of matching utterances, pairwise inter-reliability scores ( ¡), and non-crossing-brackets. A novel method for unification that minimizes conflicts among annotators outperforms methods that require consensus among a majority for the ¡ and recall metrics, while capturing much of the structure of the discourse. When higher recall is preferred, methods requiring a majority are preferable to those that demand full consensus among annotators.
Arabic diacritization using weighted finite-state transducers Shieber, Stuart; Nelken, Rani Arabic is usually written without short vowels and additional diacritics, which are nevertheless important for several applications. We present a novel algorithm for restoring these symbols, using a cascade of probabilistic finite- state transducers trained on the Arabic treebank, integrating a word-based language model, a letter-based language model, and an extremely simple morphological model. This combination of probabilistic methods and simple linguistic information yields high levels of accuracy.
Unifying synchronous tree-adjoining grammars and tree transducers via bimorphisms. Shieber, Stuart We place synchronous tree-adjoining grammars and tree transducers in the single overarching framework of bimorphisms, continuing the unification of synchronous grammars and tree transducers initiated by Shieber (2004). Along the way, we present a new definition of the tree-adjoining grammar derivation relation based on a novel direct inter-reduction of TAG and monadic macro tree transducers.
Referring-expression generation using a transformation-based learning approach Nickerson, Jill; Shieber, Stuart; Grosz, Barbara A natural language generation system must generate expressions that allow a reader to identify the entities to which they refer. This paper describes the creation of referring-expression (RE) generation models developed using a transformation-based learning approach. We present an evaluation of the learned models and compare their performance to the performance of a baseline system, which always generates full noun phrase REs. When compared to the baseline system, the learned models produce REs that lead to more coherent natural language documents and are more accurate and closer in length to those that people use.
Probabilistic synchronous tree-adjoining grammars for machine translation: The argument from bilingual dictionaries. Shieber, Stuart We provide a conceptual basis for thinking of machine translation in terms of synchronous grammars in general, and probabilistic synchronous tree-adjoining grammars in particular. Evidence for the view is found in the structure of bilingual dictionaries of the last several millennia.
Practical secrecy-preserving, verifiably correct and trustworthy auctions. Shieber, Stuart; Parkes, David; Rabin, Michael; Thorpe, Christopher We present a practical system for conducting sealed-bid auctions that preserves the secrecy of the bids while providing for verifiable correctness and trustworthiness of the auction. The auctioneer must accept all bids submitted and follow the published rules of the auction. No party receives any useful information about bids before the auction closes and no bidder is able to change or repudiate her bid. Our solution uses Paillier's homomorphic encryption scheme [25] for zero knowledge proofs of correctness. Only minimal cryptographic technology is required of bidders; instead of employing complex interactive protocols or multi-party computation, the single auctioneer computes optimal auction results and publishes proofs of the results' correctness. Any party can check these proofs of correctness via publicly verifiable computations on encrypted bids. The system is illustrated through application to first-price, uniform-price and second-price auctions, including multi-item auctions. Our empirical results demonstrate the practicality of our method: auctions with hundreds of bidders are within reach of a single PC, while a modest distributed computing network can accommodate auctions with thousands of bids.
Towards collaborative intelligent tutors: Automated recognition of users' strategies. Grosz, Barbara; Rubin, Andee; Yamangil, Elif; Shieber, Stuart; Gal, Ya'akov Kobi This paper addresses the problem of inferring students’ strategies when they interact with data-modeling software used for pedagogical purposes. The software enables students to learn about statistical data by building and analyzing their own models. Automatic recognition of students’ activities when interacting with pedagogical software is challenging. Students can pursue several plans in parallel and interleave the execution of these plans. The algorithm presented in this paper decomposes students’ complete interaction histories with the software into hierarchies of interdependent tasks that may be subsequently compared with ideal solutions. This algorithm is evaluated empirically using commercial software that is used in many schools. Results indicate that the algorithm is able to (1) identify the plans students use when solving problems using the software; (2) distinguish between those actions in students’ plans that play a salient part in their problem-solving and those representing exploratory actions and mistakes; and (3) capture students’ interleaving and free-order action sequences.
Parse disambiguation for a rich HPSG grammar Oepen, Stephan; Flickinger, Dan; Manning, Christopher; Shieber, Stuart; Toutanova, Kristina
The influence of task contexts on the decision-making of humans and computers. Gal, Ya'akov Kobi; Allain, Alex; Grosz, Barbara; Pfeffer, Avrom; Shieber, Stuart Many environments in which people and computer agents interact involve deploying resources to accomplish tasks and satisfy goals. This paper investigates the way that the context in which decisions are made affects the behavior of people and the performance of computer agents that interact with people in such environments. It presents experiments that measured negotiation behavior in two different types of settings. One setting was a task context that made explicit the relationships among goals, (sub)tasks and resources. The other setting was a completely abstract context in which only the payoffs for the decision choices were listed. Results show that people are more helpful, less selfish, and less competitive when making decisions in task contexts than when making them in completely abstract contexts. Further, their overall performance was better in task contexts. A predictive computational model that was trained on data obtained in the task context outperformed a model that was trained under the abstract context. These results indicate that taking context into account is essential for the design of computer agents that will interact well with people.
Extraction phenomena in synchronous TAG syntax and semantics. Shieber, Stuart; Nesson, Rebecca We present a proposal for the structure of noun phrases in Synchronous Tree-Adjoining Grammar (STAG) syntax and semantics that permits an elegant and uniform analysis of a variety of phenomena, including quantifier scope and extraction phenomena such as wh-questions with both moved and in-place wh-words, pied-piping, stranding of prepositions, and topicalization. The tight coupling between syntax and semantics enforced by the STAG helps to illuminate the critical relationships and filter out analyses that may be appealing for either syntax or semantics alone but do not allow for a meaningful relationship between them.
Colored trails: A multiagent system testbed for decision-making research (demonstration) Shieber, Stuart; Grosz, Barbara; Pfeffer, Avrom; Ficici, Sevan; Gal, Ya'akov Kobi With increasing frequency, computer agents participate in collaborative and competitive multiagent domains in which humans reason strategically to make decisions. The deployment of computer agents in such domains requires that the agents understand something about human behavior so that they can interact successfully with people; the computer agents must be sensitive to how people reason in strategic settings as well as to the social utilities people employ to inform their reasoning. To date, these design requirements for computer agents have received relatively little attention. To further research in this area, we are developing the Colored Trails (CT) testbed [5], a configurable and extensible open-source system for use by the research community at large to investigate multiagent decision making.
A writer's collaborative assistant Babaian, Tamara; Shieber, Stuart; Grosz, Barbara In traditional human-computer interfaces, a human master directs a computer system as a servant, telling it not only what to do, but also how to do it. Collaborative interfaces attempt to realign the roles, making the participants collaborators in solving the person's problem. This paper describes Writer's Aid, a system that deploys AI planning techniques to enable it to serve as an author's collaborative assistant. Writer's Aid differs from previous collaborative interfaces in both the kinds of actions the system partner takes and the underlying technology it uses to do so. While an author writes a document, Writer's Aid helps in identifying and inserting citation keys and by autonomously finding and caching potentially relevant papers and their associated bibliographic information from various on-line sources. This autonomy, enabled by the use of a planning system at the core of Writer's Aid, distinguishes this system from other collaborative interfaces. The collaborative design and its division of labor result in more efficient operation: faster and easier writing on the user's part and more effective information gathering on the part of the system. Subjects in our laboratory user study found the system effective and the interface intuitive and easy to use.
Partially ordered multiset context-free grammars and ID/LP parsing Nederhof, Mark-Jan; Shieber, Stuart; Satta, Giorgio We present a new formalism, partially ordered multiset context-free grammars (poms-CFG), along with an Earley-style parsing algorithm. The formalism, which can be thought of as a generalization of context-free grammars with partially ordered right-hand sides, is of interest in its own right, and also as infrastructure for obtaining tighter complexity bounds for more expressive context-free formalisms intended to express free or multiple word-order, such as ID/LP grammars. We reduce ID/LP grammars to poms-grammars, thereby getting finer-grained bounds on the parsing complexity of ID/LP grammars. We argue that in practice, the width of attested ID/LP grammars is small, yielding effectively polynomial time complexity for ID/LP grammar parsing.
A learning approach to improving sentence-level MT evaluation Shieber, Stuart; Kulesza, Alex The problem of evaluating machine translation (MT) systems is more challenging than it may first appear, as diverse translations can often be considered equally correct. The task is even more difficult when practical circumstances require that evaluation be done automatically over short texts, for instance, during incremental system development and error analysis. While several automatic metrics, such as BLEU, have been proposed and adopted for largescale MT system discrimination, they all fail to achieve satisfactory levels of correlation with human judgments at the sentence level. Here, a new class of metrics based on machine learning is introduced. A novel method involving classifying translations as machine or humanproduced rather than directly predicting numerical human judgments eliminates the need for labor-intensive user studies as a source of training data. The resulting metric, based on support vector machines, is shown to significantly improve upon current automatic metrics, increasing correlation with human judgments at the sentence level halfway toward that achieved by an independent human evaluator.
Towards robust context-sensitive sentence alignment for monolingual corpora Shieber, Stuart; Nelken, Rani Aligning sentences belonging to comparable monolingual corpora has been suggested as a first step towards training text rewriting algorithms, for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm, combining a sentence-based TF*IDF score, turned into a probability distribution using logistic regression, with a global alignment dynamic programming algorithm. Our approach provides a simpler and more robust solution achieving a substantial improvement in accuracy over existing systems.
Does the Turing Test demonstrate intelligence or not? Shieber, Stuart The Turing Test has served as a defining inspiration throughout the early history of artificial intelligence research. Its centrality arises in part because verbal behavior indistinguishable from that of humans seems like an incontrovertible criterion for intelligence, a "philosophical conversation stopper" as Dennett says. On the other hand, from the moment Turing's seminal Mind article was published, the conversation hasn't stopped; the appropriateness of the Test has been continually questioned, and current philosophical wisdom holds that the Turing Test is hopelessly flawed as a sufficient condition for attributing intelligence. In this short article, I summarize for an artificial intelligence audience an argument that I have presented at length for a philosophical audience that attempts to reconcile these two mutually contradictory but well-founded attitudes towards the Turing Test that have been under constant debate since 1950.
Simpler TAG semantics through synchronization Shieber, Stuart; Nesson, Rebecca In recent years Laura Kallmeyer, Maribel Romero, and their collaborators have led research on TAG semantics through a series of papers refining a system of TAG semantics computation. Kallmeyer and Romero bring together the lessons of these attempts with a set of desirable properties that such a system should have. First, computation of the semantics of a sentence should rely only on the relationships expressed in the TAG derivation tree. Second, the generated semantics should compactly represent all valid interpretations of the input sentence, in particular with respect to quantifier scope. Third, the formalism should not, if possible, increase the expressivity of the TAG formalism. We revive the proposal of using synchronous TAG (STAG) to simultaneously generate syntactic and semantic representations for an input sentence. Although STAG meets the three requirements above, no serious attempt had previously been made to determine whether it can model the semantic constructions that have proved difficult for other approaches. In this paper we begin exploration of this question by proposing STAG analyses of many of the hard cases that have spurred the research in this area. We reframe the TAG semantics problem in the context of the STAG formalism and in the process present a simple, intuitive base for further exploration of TAG semantics. We provide analyses that demonstrate how STAG can handle quantifier scope, long-distance WH-movement, interaction of raising verbs and adverbs, attitude verbs and quantifiers, relative clauses, and quantifiers within prepositional phrases.
Comma restoration using constituency information Tao, Xiaopeng; Shieber, Stuart Automatic restoration of punctuation from unpunctuated text has application in improving the fluency and applicability of speech recognition systems. We explore the possibility that syntactic information can be used to improve the performance of an HMM-based system for restoring punctuation (specifically, commas) in text. Our best methods reduce sentence error rate substantially - by some 20%, with an additional 8% reduction possible given improvements in extraction of the requisite syntactic information.
An alternative conception of tree-adjoining derivation Shieber, Stuart; Schabes, Yves The precise formulation of derivation for tree-adjoining grammars has important ramifications for a wide variety of uses of the formalism, from syntactic analysis to semantic interpretation and statistical language modeling. We argue that the definition of tree-adjoining derivation must be reformulated in order to manifest the proper linguistic dependencies in derivations. The particular proposal is both precisely characterizable, through a compilation to linear indexed grammars, and computationally operational, by virtue of an efficient algorithm for recognition and parsing.
Computing the communication costs of item allocation Rauenbusch, Timothy W.; Grosz, Barbara; Shieber, Stuart Multiagent systems require techniques for effectively allocating resources or tasks to among agents in a group. Auctions are one method for structuring communication of agents’ private values for the resource or task to a central decision maker. Different auction methods vary in their communication requirements. This paper makes three contributions to the understanding the types of group decision making for which auctions are appropriate methods. First, it shows that entropy is the best measure of communication bandwidth used by an auction in messages bidders send and receive. Second, it presents a method for measuring bandwidth usage; the dialogue trees used for this computation are a new and compact representation of the probability distribution of every possible dialogue between two agents. Third, it presents new guidelines for choosing the best auction, guidelines which differ significantly from recommendations in prior work. The new guidelines are based on detailed analysis of the communication requirements of Sealed-bid, Dutch, Staged, Japanese, and Bisection auctions. In contradistinction to previous work, the guidelines show that the auction that minimizes bandwidth depends on both the number of bidders and the sample space from which bidders’ valuations are drawn.
Using restriction to extend parsing algorithms for complex-feature-based formalisms Shieber, Stuart Grammar formalisms based on the encoding of grammatical information in complex-valued feature systems enjoy some currency both in linguistics and natural-language-processing research. Such formalisms can be thought of by analogy to context-free grammars as generalizing the notion of nonterminal symbol from a finite domain of atomic elements to a possibly infinite domain of directed graph structures of a certain sort. Unfortunately, in moving to an infinite nonterminal domain, standard methods of parsing may no longer be applicable to the formalism. Typically, the problem manifests itself as gross inefficiency or even nontermination of the algorithms. In this paper, we discuss a solution to the problem of extending parsing algorithms to formalisms with possibly infinite nonterminal domains, a solution based on a general technique we call restriction. As a particular example of such an extension, we present a complete, correct, terminating extension of Earley's algorithm that uses restriction to perform top-down filtering. Our implementation of this algorithm demonstrates the drastic elimination of chart edges that can be achieved by this technique. Finally, we describe further uses for the technique---including parsing other grammar formalisms, including definite-clause grammars; extending other parsing algorithms, including LR methods and syntactic preference modeling algorithms; and efficient indexing.
Translating English into logical form Rosenschein, Stanley J.; Shieber, Stuart A scheme for syntax-directed translation that mirrors compositional model-theoretic semantics is discussed. The scheme is the basis for an English translation system called PATR and was used to specify a semantically interesting fragment of English, including such constructs as tense, aspect, modals, and various lexically controlled verb complement structures. PATR was embedded in a question-answering system that replied appropriately to questions requiring the computation of logical entailments.
A general cartographic labeling algorithm Edmondson, Shawn; Shieber, Stuart; Christensen, Jon; Marks, Joe Some apparently powerful algorithms for automatic label placement on maps use heuristics that capture considerable cartographic expertise but are hampered by provably inefficient methods of search and optimization. On the other hand, no approach to label placement that is based on an efficient optimization technique has been applied to the production of general cartographic maps - those with labeled point, line, and area features - and shown to generate labelings of acceptable quality. We present an algorithm for label placement that achieves the twin goals of practical efficiency and high labeling quality by combining simple cartographic heuristics with effective stochastic optimization techniques.
A semantic-head-driven generation algorithm for unification-based formalisms Pereira, Fernando C. N.; Moore, Robert C.; van Noord, Gertjan; Shieber, Stuart We present an algorithm for generating strings from logical form encodings that improves upon previous algorithms in that it places fewer restrictions on the class of grammars to which it is applicable. In particular, unlike an Earley deduction generator (Shieber, 1988), it allows use of semantically nonmonotonic grammars, yet unlike topdown methods, it also permits left-recursion. The enabling design feature of the algorithm is its implicit traversal of the analysis tree for the string being generated in a semantic-head-driven fashion.
An empirical study of algorithms for point feature label placement Christensen, Jon; Shieber, Stuart; Marks, Joe A major factor affecting the clarity of graphical displays that include text labels is the degree to which labels obscure display features (including other labels) as a result of spatial overlap. Point-feature label placement (PFLP) is the problem of placing text labels adjacent to point features on a map or diagram so as to maximize legibility. This problem occurs frequently in the production of many types of informational graphics, though it arises most often in automated cartography. In this paper we present a comprehensive treatment of the PFLP problem, viewed as a type of combinatorial optimization problem. Complexity analysis reveals that the basic PFLP problem and most interesting variants of it are NP-hard. These negative results help inform a survey of previously reported algorithms for PFLP; not surprisingly, all such algorithms either have exponential time complexity or are incomplete. To solve the PFLP problem in practice, then, we must rely on good heuristic methods. We propose two new methods, one based on a discrete form of gradient descent, the other on simulated annealing, and report on a series of empirical tests comparing these and the other known algorithms for the problem. Based on this study, the first to be conducted, we identify the best approaches as a function of available computation time.
Lessons from a restricted Turing test Shieber, Stuart We report on the recent Loebner prize competition inspired by Turing's test of intelligent behavior. The presentation covers the structure of the competition and the outcome of its first instantiation in an actual event, and an analysis of the purpose, design, and appropriateness of such a competition. We argue that the competition has no clear purpose, that its design prevents any useful outcome, and that such a competition is inappropriate given the current level of technology. We then speculate as to suitable alternatives to the Loebner prize.
The problem of logical-form equivalence Shieber, Stuart
An Alternative Conception of Tree-Adjoining Derivation Shieber, Stuart; Schabes, Yves The precise formulation of derivation for tree-adjoining grammars has important ramifications for a wide variety of uses of the formalism, from syntactic analysis to semantic interpretation and statistical language modeling. We argue that the definition of tree-adjoining derivation must be reformulated in order to manifest the proper linguistic dependencies in derivations. The particular proposal is both precisely characterizable through a definition of TAG derivations as equivalence classes of ordered derivation trees, and computationally operational, by virtue of a compilation to linear indexed grammars together with an efficient algorithm for recognition and parsing according to the compiled grammar.
Variations on incremental interpretation Johnson, Mark; Shieber, Stuart The strict competence hypothesis has sparked a small dialogue among several researchers attempting to understand its ramifications for human sentence processing and incremental interpretation in particular. In this paper, we review the dialogue, reconstructing the arguments in an attempt to make them more uniform and crisp, and provide our own analyses of certain of the issues that arise. We argue that strict competence, because it requires a synchronous computation mechanism, may actually lead to more complex, rather than simpler, models of incremental interpretation. Asynchronous computation, which is arguably both psychologically more plausible and conceptually more basic, allows for incremental interpretation to fall out naturally, without additional machinery for interpreting partial constituents. We show that this is true regardless of whether the presumed interpretation mechanism is top-down or bottom-up, contra previous conclusions in the literature, and propose a particular implementation of some of these ideas using a novel representation based on tree-adjoining grammars. The research in this paper was supported in part by grant IRI-9157996 from the National Science Foundation to the first author. The authors would like to thank Fernando Pereira, Edward Stabler, and Mark Steedman for discussions on the topic of this paper and for their comments on previous drafts.
A call for collaborative interfaces Shieber, Stuart In this note, I call for a move towards viewing interfaces as means for people and computers to collaborate on solving problems rather than means for people to control computers. This collaborative perspective on user interfaces can apply quite broadly, and not only provides a source for novel interface techniques but serves as a good tool for analyzing existing interfaces. The view affects thinking on interfaces primarily by motivating a different split in the roles and responsibilities of the two participants in problem-solving, the computer and the user.
Automating the layout of network diagrams with specified visual organization. Kosak, Corey; Shieber, Stuart; Marks, Joe Network diagrams are a familiar graphic form that can express many different kinds of information. The problem of automating network-diagram layout has therefore received much attention. Previous research on network-diagram layout has focused on the problem of aesthetically optimal layout, using such criteria as the number of link crossings, the sum of all link lengths, and total diagram area. In this paper the authors propose a restatement of the network-diagram layout problem in which layout-aesthetic concerns are subordinated to perceptual-organization concerns. The authors present a notation for describing the visual organization of a network diagram. This notation is used in reformulating the layout task as a constrained-optimization problem in which constraints are derived from a visual-organization specification and optimality criteria are derived from layout-aesthetic considerations. Two new heuristic algorithms are presented for this version of the layout problem: one algorithm uses a rule-based strategy for computing a layout; the other is a massively parallel genetic algorithm. The authors demonstrate the capabilities of the two algorithms by testing them on a variety of network-diagram layout problems.
Restricting the weak-generative capacity of synchronous tree-adjoining grammars. Shieber, Stuart The formalism of synchronous tree-adjoining grammars, a variant of standard tree-adjoining grammars (TAG), was intended to allow the use of TAGs for language transduction in addition to language specification. In previous work, the definition of the transduction relation defined by a synchronous TAG was given by appeal to an iterative rewriting process. The rewriting definition of derivation is problematic in that it greatly extends the expressivity of the formalism and makes the design of parsing algorithms difficult if not impossible. We introduce a simple, natural definition of synchronous tree-adjoining derivation, based on isomorphisms between standard tree-adjoining derivations, that avoids the expressivity and implementability problems of the original rewriting definition. The decrease in expressivity, which would otherwise make the method unusable, is offset by the incorporation of an alternative definition of standard tree-adjoining derivation, previously proposed for completely separate reasons, thereby making it practical to entertain using the natural definition of synchronous derivation. Nonetheless, some remaining problematic cases call for yel more flexibility in the definition; the isomorphism requirement may have to be relaxed. It remains for future research to rune the exact requirements on the allowable mappings.
Predicting individual book use for off-site storage using decision trees Shieber, Stuart; Silverstein, Craig We explore various methods for predicting library book use, as measured by circulation records. Accurate prediction is invaluable when choosing titles to be stored in an off-site location. Previous researchers in this area concluded that past use information provides by far the most reliable predictor of future use. Because of the computerization of library data, it is now possible not only to reproduce these earlier experiments with a more substantial data set, but also to compare their algorithms with more sophisticated decision methods. We have found that while previous use is indeed an excellent predictor of future use, it can be improved upon by combining previous use information with bibliographic information in a technique that can be customized for individual collections. This has immediate application for libraries that are short on storage space and wish to identify low-demand titles to move to remote storage. For instance, simulations show that the best prediction method we develop, when used as the off-site storage selection method for the Harvard College Library, would have generated only a fifth as many off-site accesses as compared to a method based on previous use.
Easily searched encodings for number partitioning Shieber, Stuart; Marks, Joe; Ngo, J. Thomas; Ruml, Wheeler Can stochastic search algorithms outperform existing deterministic heuristics for the NP-hard problem Number Partitioning if given a sufficient, but practically realizable amount of time? In a thorough empirical investigation using a straightforward implementation of one such algorithm, simulated annealing, Johnson et al. (Ref. 1) concluded tentatively that the answer is negative. In this paper, we show that the answer can be positive if attention is devoted to the issue of problem representation (encoding). We present results from empirical tests of several encodings of Number Partitioning with problem instances consisting of multiple-precision integers drawn from a uniform probability distribution. With these instances and with an appropriate choice of representation, stochastic and deterministic searches can—routinely and in a practical amount of time—find solutions several orders of magnitude better than those constructed by the best heuristic known (Ref. 2), which does not employ searching. We thank David S. Johnson of AT&T Bell Labs for generously and promptly sharing his test instances. For stimulating discussions, we thank members of the Harvard Animation/Optimization Group (especially Jon Christensen), the Computer Science Department at the University of New Mexico, the Santa Fe Institute, and the Berkeley CAD Group. The anonymous referees made numerous constructive suggestions. We thank Rebecca Hayes for comments concerning the figures. The second author is grateful for a Graduate Fellowship from the Fannie and John Hertz Foundation. We thank the Free Software Foundation for making the GNU Multiple Precision package available. The research described in this paper was conducted mostly while the third author was at Digital Equipment Corporation Cambridge Research Lab. This work was supported in part by the National Science Foundation, principally under Grants IRI-9157996 and IRI-9350192 to the fourth author, and by matching grants from Digital Equipment Corporation and Xerox Corporation.
Principles and implementation of deductive parsing Shieber, Stuart; Pereira, Fernando C. N.; Schabes, Yves We present a system for generating parsers based directly on the metaphor of parsing as deduction. Parsing algorithms can be represented directly as deduction systems, and a single deduction engine can interpret such deduction systems so as to implement the corresponding parser. The method generalizes easily to parsers for augmented phrase structure formalisms, such as definite-clause grammars and other logic grammar formalisms, and has been used for rapid prototyping of parsing algorithms for a variety of formalisms including variants of tree-adjoining grammars, categorial grammars, and lexicalized context-free grammars.
Interactions of scope and ellipsis Pereira, Fernando C. N.; Dalrymple, Mary; Shieber, Stuart Systematic semantic ambiguities result from the interaction of the two operations that are involved in resolving ellipsis in the presence of scoping elements such as quantifiers and intensional operators: scope determination for the scoping elements and resolution of the elided relation. A variety of problematic examples previously noted - by Sag, Hirschbüihler, Gawron and Peters, Harper, and others - all have to do with such interactions. In previous work, we showed how ellipsis resolution can be stated and solved in equational terms. Furthermore, this equational analysis of ellipsis provides a uniform framework in which interactions between ellipsis resolution and scope determination can be captured. As a consequence, an account of the problematic examples follows directly from the equational method. The goal of this paper is merely to point out this pleasant aspect of the equational analysis, through its application to these cases. No new analytical methods or associated formalism are presented, with the exception of a straightforward extension of the equational method to intensional logic.
Anaphoric dependencies in ellipsis Shieber, Stuart; Kehler, Andrew
Automatic yellow-pages pagination and layout Marks, Joe; Shieber, Stuart; Johari, Ramesh; Partovi, Ali The compact and harmonious layout of ads and text is a fundamental and costly step in the production of commercial telephone directories (ldquoYellow Pagesrdquo). We formulate a canonical version of Yellow-Pages pagination and layout (YPPL) as an optimization problem in which the task is to position ads and text-stream segments on sequential pages so as to minimize total page length and maximize certain layout aesthetics, subject to constraints derived from page-format requirements and positional relations between ads and text. We present a heuristic-search approach to the YPPL problem. Our algorithm has been applied to a sample of real telephone-directory data, and produces solutions that are significantly shorter and better than the published ones.
Machine learning theory and practice as a source of insight into universal grammar. Shieber, Stuart; Lappin, Shalom In this paper, we explore the possibility that machine learning approaches to natural-language processing being developed in engineering-oriented computational linguistics may be able to provide specific scientific insights into the nature of human language. We argue that, in principle, machine learning results could inform basic debates about language, in one area at least, and that in practice, existing results may offer initial tentative support for this prospect. Further, results from computational learning theory can inform arguments carried on within linguistic theory as well.
Practical secrecy-preserving, verifiably correct and trustworthy auctions Thorpe, Christopher; Shieber, Stuart; Rabin, Michael; Parkes, David We present a practical protocol based on homomorphic cryptography for conducting provably fair sealed-bid auctions. The system preserves the secrecy of the bids, even after the announcement of auction results, while also providing for public verifiability of the correctness and trustworthiness of the outcome. No party, including the auctioneer, receives any information about bids before the auction closes, and no bidder is able to change or repudiate any bid. The system is illustrated through application to first-price, uniform-price and second-price auctions, including multi-item auctions. Empirical results based on an analysis of a prototype demonstrate the practicality of our protocol for real-world applications.
Representation in stochastic search for phylogenetictree reconstruction Shieber, Stuart; Ohno-Machado, Lucila; Weber, Griffin Phylogenetic tree reconstruction is a process in which the ancestral relationships among a group of organisms are inferred from their DNA sequences. For all but trivial sized data sets, finding the optimal tree is computationally intractable. Many heuristic algorithms exist, but the branch-swapping algorithm used in the software package PAUP is the most popular. This method performs a stochastic search over the space of trees, using a branch-swapping operation to construct neighboring trees in the search space. This study introduces a new stochastic search algorithm that operates over an alternative representation of trees, namely as permutations of taxa giving the order in which they are processed during stepwise addition. Experiments on several data sets suggest that this algorithm for generating an initial tree, when followed by branch-swapping, can produce better trees for a given total amount of time.
Direct parsing of ID/LP grammars Shieber, Stuart The Immediate Dominance/Linear Precedence (ID/LP) formalism is a recent extension of Generalized Phrase Structure Grammar (GPSG) designed to perform some of the tasks previously assigned to metarules--for example, modeling the word-order characteristics of so-called free-word-order languages. It allows a simple specification of classes of rules that differ only in constituent order. ID/LP grammars (as well as metarule grammars) have been proposed for use in parsing by expanding them into equivalent context-free grammars. We develop a parsing algorithm, based on the algorithm of Earley, for parsing ID/LP grammars directly, circumventing the initial expansion phase. A proof of correctness is supplied. We also discuss some aspects of the time complexity of the algorithm and some formal properties associated with ID/LP grammars and their relationship to context-free grammars.
Abbreviated text input using language modeling. Shieber, Stuart; Nelken, Rani We address the problem of improving the efficiency of natural language text input under degraded conditions (for instance, on mobile computing devices or by disabled users), by taking advantage of the informational redundancy in natural language. Previous approaches to this problem have been based on the idea of prediction of the text, but these require the user to take overt action to verify or select the system’s predictions. We propose taking advantage of the duality between prediction and compression. We allow the user to enter text in compressed form, in particular, using a simple stipulated abbreviation method that reduces characters by 26.4%, yet is simple enough that it can be learned easily and generated relatively fluently. We decode the abbreviated text using a statistical generative model of abbreviation, with a residual word error rate of 3.3%. The chief component of this model is an n-gram language model. Because the system’s operation is completely independent from the user’s, the overhead from cognitive task switching and attending to the system’s actions online is eliminated, opening up the possibility that the compression-based method can achieve text input efficiency improvements where the prediction-based methods have not. We report the results of a user study evaluating this method.
The Turing test as interactive proof Shieber, Stuart In 1950, Alan Turing proposed his eponymous test based on indistinguishability of verbal behavior as a replacement for the question "Can machines think?" Since then, two mutually contradictory but well-founded attitudes towards the Turing Test have arisen in the philosophical literature. On the one hand is the attitude that has become philosophical conventional wisdom, viz., that the Turing Test is hopelessly flawed as a sufficient condition for intelligence, while on the other hand is the overwhelming sense that were a machine to pass a real live full-fledged Turing Test, it would be a sign of nothing but our orneriness to deny it the attribution of intelligence. The arguments against the sufficiency of the Turing Test for determining intelligence rely on showing that some extra conditions are logically necessary for intelligence beyond the behavioral properties exhibited by an agent under a Turing Test. Therefore, it cannot follow logically from passing a Turing Test that the agent is intelligent. I argue that these extra conditions can be revealed by the Turing Test, so long as we allow a very slight weakening of the criterion from one of logical proof to one of statistical proof under weak realizability assumptions. The argument depends on the notion of interactive proof developed in theoretical computer science, along with some simple physical facts that constrain the information capacity of agents. Crucially, the weakening is so slight as to make no conceivable difference from a practical standpoint. Thus, the Gordian knot between the two opposing views of the sufficiency of the Turing Test can be cut.
Generation and synchronous tree-adjoining grammars Shieber, Stuart; Schabes, Yves Tree-adjoining grammars (TAG) have been proposed as a formalism for generation based on the intuition that the extended domain of syntactic locality that TAGs provide should aid in localizing semantic dependencies as well, in turn serving as an aid to generation from semantic representations. We demonstrate that this intuition can be made concrete by using the formalism of synchronous tree-adjoining grammars. The use of synchronous TAGs for generation provides solutions to several problems with previous approaches to TAG generation. Furthermore, the semantic monotonicity requirement previously advocated for generation grammars as a computational aid is seen to be an inherent property of synchronous TAGs.
Ellipsis and higher-order unification Pereira, Fernando C. N.; Dalrymple, Mary; Shieber, Stuart We present a new method for characterizing the interpretive possibilities generated by elliptical constructions in natural language. Unlike previous analyses, which postulate ambiguity of interpretation or derivation in the full clause source of the ellipsis, our analysis requires no such hidden ambiguity. Further, the analysis follows relatively directly from an abstract statement of the ellipsis interpretation problem. It predicts correctly a wide range of interactions between ellipsis and other semantic phenomena such as quantifier scope and bound anaphora. Finally, although the analysis itself is stated nonprocedurally, it admits of a direct computational method for generating interpretations.
Semantic-head-driven generation Moore, Robert C.; Pereira, Fernando C. N.; van Noord, Gertjan; Shieber, Stuart We present an algorithm for generating strings from logical form encodings that improves upon previous algorithms in that it places fewer restrictions on the class of grammars to which it is applicable. In particular, unlike a previous bottom-up generator, it allows use of semlantically nonmonotonic grammars, yet unlike top-down methods, it also permits left-recursion. The enabling design feature of the algorithm is its implicit traversal of the analysis tree for the string being generated in a semantic-head-driven fashion.
An algorithm for generating quantifier scopings Hobbs, Jerry; Shieber, Stuart The syntactic structure of a sentence often manifests quite clearly the predicate-argument structure and relations of grammatical subordination. But scope dependencies are not so transparent. As a result, many systems for representing the semantics of sentences have ignored scoping or generating scoping mechanisms that have often been inexplicit as to the range of scopings they choose among or profligate in the scopings they allow. In this paper, we present an algorithm, along with proofs of some of its important properties, that generates scoped semantic forms from unscoped expressions encoding predicate-argument structure. The algorithm is not profligate as are those based on permutation of quantifiers, and it can provide a solid foundation for computational solutions where completeness is sacrificed for efficiency and heuristic efficacy.
Evidence against the context-freeness of natural language Shieber, Stuart
Synchronous grammars as tree transducers Shieber, Stuart Tree transducer formalisms were developed in the formal language theory community as generalizations of finite-state transducers from strings to trees. Independently, synchronous tree-substitution and -adjoining grammars arose in the computational linguistics community as a means to augment strictly syntactic formalisms to provide for parallel semantics. We present the first synthesis of these two independently developed approaches to specifying tree relations, unifying their respective literatures for the first time, by using the framework of bimorphisms as the generalizing formalism in which all can be embedded. The central result is that synchronous tree-substitution grammars are equivalent to bimorphisms where the component homomorphisms are linear and complete.