Scholarship

Now You’re Just Messing With Me

At some point in 2014, without any warning so far as I can tell, the Japan Patent Office changed the file-naming convention for their digital archives. Whereas before the archives would be stored under a filename such as “T2014-20(01-01)20150114.ISO”, hereafter they will be stored under a file name such as “T2014-21(01_01)20150121.ISO”.

Screen Shot 2016-07-06 at 10.05.29 AMCatch the difference?  Yeah, I didn’t either. Until I let my code–which was based on the old naming convention–run all day. Then I found out the last two years’ data had corrupted all my output files, wiping out 7 GB of data. More fun after the jump…
Continue reading…

Think Different: More Translation Hijinx

I’ve been trying for about three days to figure out why one of the scripts I was given to parse all this government data has been failing when I try to run it. Because the researchers who gave me the scripts commissioned them from some outside programmers, they can’t help me debug it. So I’ve been going line-by-line through the code and cross-referencing every command, option, and character with online manuals and forums.

My best guess is that my problem is (probably) once again a failure of translation. The code I’ve been given was written for LINUX, a (relatively) open UNIX platform. Mac OSX–which I use–is also built on top of a UNIX architecture, which power users can access via the built-in Terminal application. But Apple uses an idiosyncratic and somewhat dated version of the UNIX shell scripting language–this is the computer language you can use to tell the computer to do stuff with the files stored on it. (“Think different” indeed.) There are tons of tiny differences between Apple’s shell language and the open standard implemented in LINUX, and any one of them could be responsible for causing my code to fail. I spent the better part of two days tweaking individual characters, options, and commands in this script, to no avail. Then I tried a patch to update Apple’s scripting language to more closely mirror the one used by LINUX. Still no luck. And three days of my precious seven-week residency in Tokyo gone.

So I gave up. I’ll write my own code instead.

The script I’ve been trying to debug is one of a series of algorithms used to collate and deduplicate several years’ worth of parsed data. But I can create those kinds of algorithms myself, once I know how the parsed data is structured. The hard part was parsing the data in the first place to extract it from its arcane government archive format–and the scripts that do that worked a treat, once I figured out how they function. Besides which, the deduplication strategy used by the researchers who gave me these troublesome scripts is a bit more heavy-handed than I’d use if I were starting from scratch. Which I just did–in Stata, the statistical software package I’ll use to analyze the data, which uses a native scripting language I’m much more familiar with.

Screen Shot 2016-07-05 at 1.48.31 PM

This new script seems to be working; now I just need a good solid stretch of time to allow my home-brewed code to process the several gigabytes of data I’m feeding it. Unfortunately, time is in short supply–I’m in week 3 of my 7-week stay, and I’m supposed to present my findings to my hosts during my last week here. So from here on out, days are for coding and nights are for processing.

It’ll get done. Somehow.

Flying Away on a Wing and a Prayer

If you’re roughly my age, you’ll remember this guy:

3978293-granhc3a9roe

If not, meet Ralph Hinkley, The Greatest American Hero. Ralph (played by William Katt) is the protagonist of a schlocky 1980s sitcom with the greatest television theme song ever written. The premise of the show is that Ralph was driving through the desert one evening when some aliens decided to give him a supersuit that gives him superpowers. Unfortunately, Ralph lost the instruction manual for the suit, so he can never get it to work quite right. He nevertheless attempts to use the suit’s powers for good, and hilarity–or what passed for it on early-80s network television–ensues. In one episode, a replacement copy of the suit’s instruction manual is found, but it’s written in an indecipherable alien language. What could have been a tremendous force for good becomes a frustrating reminder of one’s own shortcomings.

As you know if you’ve been following my recent posts, I’m currently working with a treasure trove of Japanese government data. I’ve been given a helpful translation of the introductory chapters of the data specification. I’ve been given an incredibly helpful set of computer scripts to parse the data and I’ve gotten them to (mostly) work. And now that I’m at the point where I’m about ready to start revising the computer scripts to extract more and different data, I’ve got to start deciphering the various alphanumeric codes that stand in as symbols for more complex data. I’m nearly two weeks in to a seven-week research residency, and I feel like I’m finally approaching the point where I can actually start doing something instead of just getting my bearings. It’s exciting. But then, well…

Up to this point, I’ve been working with a map of the data structure that is organized (like the data itself) with english-language SGML tags (if you know anything about XML, this will look familiar):

Screen Shot 2016-06-30 at 2.55.41 PM

See that column that says “Index”? The five-character sequences in that column map to a list of codes that correspond to definitions for the types of data in this archive. These definitions–set forth in a series of tables–allow the data to be stored using compact sequences that can then be expanded and explained by reference to the code definition tables.When you’re dealing with millions of data records, compactness is pretty important.

So, “B0010” tells you that the data inside this tag (the “application-number” tag) is encoded according to definition table B0010. So I’ll just flip through the code list and…

Screen Shot 2016-06-30 at 2.50.57 PM

Uh… Hmm.

Well, that’s not so bad; I can just search this document for “B0010” (it would be a sight easier if the codes were in order!) and then just copy and paste the corresponding cell from the first column into Google Translate (it’s not a terrifically accurate translator, but it’ll do in a pinch.) The description corresponding to B0010 is “出願番号,” which Google translates to “Application Number.” That makes sense; after all the code is used for data appearing inside the <application-number> SGML tag. So now I just need to look up the code table for 出願番号/B0010 to learn how to decipher the data inside the <application-number> tag, and…

Screen Shot 2016-06-30 at 3.09.54 PM.png

Hmm.

This one actually makes some sense to me. It looks like data in code B0010 consists of a 10-character sequence, in which the first four characters correspond to an application year and the last six characters correspond to an application serial number. Simple, really.

Of course, there are dozens of these codes in the data. And not all of them are so obvious. Some of them even map to other codes that are even more obscure. For example, code A0050–which appears all over this data–is described as “中間記録”. Google translates this as “Intermediate Record”. Code table A0050, in turn, maps to three other code tables–C0840, C0850, and C0870. The code table for C0840 is basically eleven pages of this:

Screen Shot 2016-06-30 at 3.26.00 PM.png

(Sigh.)

In every episode of The Greatest American Hero, there’s a point where Ralph’s malfunctioning suit starts getting in the way–hurting more than helping. Like, he tries to fly in to rescue someone from evil kidnappers and ends up crash-landing, knocking himself out, and making himself a hostage. Nevertheless, with good intentions, persistence, ingenuity, and the help of his friends, he always managed to dig himself out of whatever mess he’d gotten himself into and save the day.

So… yeah. I’m going to figure this one out.

 

Forget My Name

I am incredibly fortunate that a group of Japanese researchers has already done much of the hard work of figuring out how to turn the hundreds of gigabytes of SGML documents I’m working with into a nice handy database, and moreover has given me the code they used to do it. Instead of figuring out how to do what they did on my own, I simply have to figure out what they did, and then decide what I’d like to do differently. As with everything else this summer, this task highlights important cultural differences.

Today I’ve been going through the specification for the raw government data I’ve been given and comparing it to the code given to me by the Japanese researchers whose work I’m building on, to see what they included in their dataset and what they left out. The raw government data includes a significant amount of low-sensitivity personally identifiable information. This is mainly name (and sometimes address) information about individuals and firms who have applied for trademark registrations, about the attorneys who represent them, and about the examiners–government employees all–who consider their applications.

Similar information appears in the US government’s data on trademark applications. The US government released all this data to the public several years ago, and continues to update it on a regular basis, which means that the names and addresses of applicants and their attorneys, and the names of examining attorneys and their supervisors, are all part of the public record–freely available to anyone with the interest and wherewithal to find them.

I know a few people who were pretty shocked to learn that you could search the USPTO’s free, public trademarks dataset by examiner name, and find any examiner’s entire work history–how much of a pushover they are; how quickly they work, how long they’ve been on the job, how often they’ve been overruled, etc. I’m sure there are lots of USPTO examining attorneys who would be shocked to learn that fact. But my sense is that in the US that kind of openness about government employees, and low-sensitivity personally identifiable information about individuals who access government functions, is pretty standard. And in most of the rest of the world, it’s just not.

The Japanese researchers whose work I am building on did not include information about the examiners who reviewed applications in their dataset at all–they never even retrieved it during processing. I happen to think that correlating application outcomes by examiner is interesting and potentially useful, and I’m going to modify the code I’ve been given to extract that information from the raw data. But in deference to what I take to be cultural norms regarding the privacy of personally identifiable information–norms that I know many of my compatriots would like to import into the US–I think I will probably anonymize the examiner data before reporting my results.

Lost in Translation

Yes, the title of this post is a cliché. It was bound to happen at some point on this trip. But I promise it’s really appopriate to today’s post.

I’m finishing out my first week in Japan, and I have been overwhelmed by the generosity and support of everyone I’ve met. Everyone I’ve interacted with in a professional capacity has underpromised and overdelivered. For example: Continue reading…

Summer in Japan: Early Observations on Data and Culture

I arrived in Tokyo two days ago, and have already begun work at the Institute of Intellectual Property, digging in to the Japan Patent Office’s (JPO’s) trademark registration data. I’ve worked with several countries’ intellectual property data systems by now, and I’m starting to think they may provide a window into the societies that produced them–though I’m still too jet-lagged to thoughtfully analyze the connection. Besides which, any analysis purporting to draw such a connection would inevitably be reductive and probably chauvinistic. So, purely by way of observation:

Continue reading…

Going to Tokyo: I’ve Been Appointed an “Invited Researcher” by Japan’s Institute of Intellectual Property

I’m very excited to announce that the Institute of Intellectual Property in Tokyo has invited me to participate in its Invited Overseas Researcher Program this coming summer. Under an agreement with the Japan Patent Office, each year IIP invites a small number of foreign researchers to come to Tokyo to study Japan’s industrial property system. (Past researchers can be found here.) I’ll be spending several weeks in Tokyo this summer doing empirical research into Japan’s trademark registration system (as a foundation for the kind of work discussed in this post). Many thanks to Kevin Collins (who did this program last year) for flagging this opportunity, and to Barton Beebe, Graeme Dinwoodie, and Jay Kesan (also a previous participant in the IIP program) for their support.

Progress for Future Persons: WIPIP Slide Deck and Discussion Points

Following up on yesterday’s post, here are the slides from my WIPIP talk on Progress for Future Persons. Another take on the talk is available in Rebecca Tushnet’s summary of my panel’s presentations.

A couple of interesting points emerged from the Q&A:

  • One of the reasons why rights-talk may be more helpful in the environmental context than in the knowledge-creation context is that rights are often framed in terms of setting a floor: whatever people may come into existence in the future, we want to ensure that they enjoy certain minimum standards of human dignity and opportunity. This makes sense where the legal regime in question is trying to guard against depletion of resources, as in environmental law. It’s less obviously relevant in the knowledge-creation context, where our choices are largely about increasing (and then distributing) available resources–including cultural resources and the resources and capacities made possible by innovation.
  • One of the problems with valuing future states of the world is uncertainty: we aren’t sure what consequences will flow from our current choices. This is true, but it’s not the theoretical issue I’m concerned with in this chapter. In fact, if we were certain what consequences would flow from our current choices, that would in a sense make the problem of future persons worse, if only by presenting it more squarely. That is, under certainty, the only question to deal with in normatively evaluating future states of the world would be choosing among the identities of future persons and of the resources they will enjoy.

Slides: Progress for Future Persons WIPIP 2016

Zika, the Pope, and the Non-Identity Problem

I’m in Seattle for the Works-In-Progress in Intellectual Property Conference (WIPIP […WIPIP good!]), where I’ll be presenting a new piece of my long-running book project, Valuing Progress. This presentation deals with issues I take up in a chapter on “Progress for Future Persons.” And almost on cue, we have international news that highlights exactly the same issues.

In light of the potential risk of serious birth defects associated with the current outbreak of the Zika virus in Latin America, Pope Francis has suggested in informal comments that Catholics might be justified in avoiding pregnancy until the danger passes–a position that some are interpreting to be in tension with Church teachings on contraception. The moral issue the Pope is responding to here is actually central to an important debate in moral philosophy over the moral status of future persons, and it is this debate that I’m leveraging in my own work to discuss whether and how we ought to take account of future persons in designing our policies regarding knowledge creation. This debate centers on a puzzle known as the Non-Identity Problem.

First: the problem in a nutshell. Famously formulated by Derek Parfit in his 1984 opus Reasons and Persons, the Non-Identity Problem presents a contradiction in three moral intuitions many of us share: (1) that an act is only wrong if it wrongs (or perhaps harms) some person; (2) that it is not wrong to bring someone into existence so long as their life remains worth living; and (3) a choice which has the effect of foregoing the creation of one life and inducing the creation of a different, happier life is morally correct. The problem Parfit pointed out is that many real-world cases require us to reject one of these three propositions. The Pope’s comments on Zika present exactly this kind of case.

The choice facing potential mothers in Zika-affected regions today is essentially choice 3. They could delay their pregnancies until after the epidemic passes in the hopes of avoiding the birth defects potentially associated with Zika. Or they could become pregnant and potentially give birth to a child who will suffer from some serious life-long health problems, but still (we might posit) have a life worth living. And if we think–as the reporter who elicited Pope Francis’s news-making comments seemed to think–that delaying pregnancy in this circumstance is “the lesser of two evils,” we must reject either Proposition 1 or Proposition 2. That is, a mother’s choice to give birth to a child who suffers from some birth defect that nevertheless leaves that child’s life worth living cannot be wrong on grounds that it wrongs that child, because the alternative is for that child not to exist at all. And it is a mistake to equate that child with the different child who might be born later–and healthier–if the mother waits to conceive until after the risk posed by Zika has passed. They are, after all, different (potential future) people.

So what does this have to do with Intellectual Property? Well, quite a bit–or so I will argue. Parfit’s point about future people can be generalized to future states of the world, in at least two ways.

One way has resonances with the incommensurability critique of welfarist approaches to normative evaluation: if our policies lead to creation of certain innovations, and certain creative or cultural works, and the non-creation of others, we can certainly say that the future state of the world will be different as a result of our policies than it would have been under alternative policies. But it is hard for us to say in the abstract that this difference has a normative valence: that the world will be better or worse for the creation of one quantum of knowledge rather than another. This is particularly true for cultural works.

The second and more troubling way of generalizing the Non-Identity Problem was in fact taken up by Parfit himself (Reasons and Persons at 361):

Screen Shot 2016-02-19 at 9.01.10 AM

What happens if we try to compare these two states of the world–and future populations–created by our present policies? Assuming that we do not reject Proposition 3–that we think the difference in identity between future persons determined by our present choices does not prevent us from imbuing that choice with moral content–we ought to be able to do the same to future populations. All we need is some metric for what makes life worth living, and some way of aggregating that metric across populations. Parfit called this approach to normative evaluation of states of the world the “Impersonal Total Principle,” and he built out of it  a deep challenge to consequentialist moral theory at the level of populations, encapsulated in what he called the Repugnant Conclusion (Reasons and Persons, at 388):

Screen Shot 2016-02-19 at 9.09.57 AM

If, like Parfit, we find this conclusion repugnant, it may be that we must reject Proposition 2–the reporter’s embedded assumption about the Pope’s views on contraception in the age of Zika. This, in turn, requires us to take Propositions 1 and 3–and the Non-Identity Problem in general–more seriously. It may, in fact, require us to find some basis other than aggregate welfare (or some hypothesized “Impersonal Total”) to normatively evaluate future states of the world, and determine moral obligations in choosing among those future states.

The Repugnant Conclusion is especially relevant to policy choices we make around medical innovations. Many of the choices we make when setting policies in this area have determinative effects on what people may come into existence in the future, and what the quality of their lives will be. But we lack any coherent account of how we ought to weigh the interests of these future people, and as Parfit’s work suggests, such a coherent account may not in fact be available. For example, if we have to choose between directing resources toward curing one of two life-threatening diseases, the compounding effects of such a cure over the course of future generations will result in the non-existence of many people who could have been brought into being had we chosen differently (and conversely, the existence of many people who would not have existed but for our policy choice). If we take the non-identity problem seriously, and fear the repugnant conclusion, identifying plausible normative criteria for guiding such a policy choice is a pressing concern.

I don’t think the extant alternatives are especially promising. The typical welfarist approach to the problem avoids the repugnant conclusion by essentially assuming that future persons don’t matter relative to present persons. The mechanism for this assumption is the discount rate incorporated into most social welfare functions, according to which the well-being of future people quickly and asymptotically approaches zero in our calculation of aggregate welfare. Parfit himself noted that such discounting leads to morally implausible results–for example, it would lead us to conclude we should generate a small amount of energy today through a cheap process that generates toxic waste that will kill billions of people hundreds of years from now. (Reasons and Persons, appx. F)

Another alternative, adopted by many in the environmental policy community (which has been far better at incorporating the insights of the philosophical literature on future persons than the intellectual property community, even though we both deal with social phenomena that are inherently oriented toward the relatively remote future), is that we ought to adopt an independent norm of conservation.  This approach is sometimes justified with rights-talk: it posits that whatever future persons come into being, they have a right to a certain basic level of resources, health, or opportunity. When dealing with a policy area that deals with potential depletion of resources to the point where human life becomes literally impossible, such rights-talk may indeed be helpful. But when weighing trade-offs with less-than-apocalyptic effects on future states of the world, such as most of the trade-offs we face in knowledge-creation policy, rights-talk does a lot less work.

The main approach adopted by those who consider medical research policy–quantification of welfare effects according to Quality-Adjusted-Life-Years (QALYs)–attempts to soften the sharp edge of the repugnant conclusion by considering not only the marginal quantity of life that results from a particular policy intervention (as compared with available alternatives), but also the quality of that added life. This is, for example, the approach of Terry Fisher and Talha Syed in their forthcoming work  on medical funding for populations in developing countries. But there is reason to believe that such quality-adjustment, while practically necessary, is theoretically suspect. In particular, Parfit’s student Larry Temkin has made powerful arguments that we lack a coherent basis to compare the relative effects on welfare of a mosquito bite and a course of violent torture, to say nothing of the relative effects of two serious medical conditions. If Temkin is right, then what is intended as an effort to account for quality of future lives in policymaking begins to look more like an exercise in imposing the normative commitments of policymakers on the future state of the world.

I actually embrace this conclusion. My own developing view is that theory runs out very quickly when evaluating present policies based on their effect on future states of the world. If this is right–that a coherent theoretical account of our responsibility to future generations is simply not possible–then whatever normative content informs our consideration of policies with respect to their effects on future states of the world is probably going to be exogenous to normative or moral theory–that is, it will be based on normative or moral preferences (or, to be more charitable, commitments or axioms). This does not strike me as necessarily a bad thing, but it does require us to be particularly attentive to how we resolve disputes among holders of inconsistent preferences. This is especially true because the future has no way to communicate its preferences to us: as I argued in an earlier post, there is no market for human flourishing. It may be that we have to choose among future states of the world according to idiosyncratic and contestable normative commitments; if that’s true then it is especially important that the social choice institutions to which we entrust such choices reflect appropriate allocations of authority. Representing the interests of future persons in those institutions is a particularly difficult problem: it demands that we in the present undertake difficult other-regarding deliberation in formulating and expressing our own normative commitments, and that the institutions themselves facilitate and respond to the results of that deliberation. Suffice it to say, I have serious doubts that intellectual property regimes–which at their best incentivize knowledge-creation in response to the predictable demands of relatively better-resourced members of society over a relatively short time horizon–satisfy these conditions.

Trademarks and Economic Activity

There’s an increasing amount of empirical data available on trademark registration systems. The USPTO released a comprehensive dataset three years ago, and there are less complete and less user-friendly data sources available from other national and regional offices–though some offices make it a bit tricky to get their data, and others restrict access or charge for their data products. As with most trends in legal scholarship, the empirical turn has come late to the study of trademarks. Part of this is because the scholarly community is small, and not as quantitatively-minded as other disciplines. Part of it is because it’s not clear what questions regarding trademarks we might look to empirical evidence to answer. I’ve published a study of the impact of the federal antidilution statute on federal registration (spoiler alert: it adds to the cost of registration but doesn’t seem to affect outcomes), but that’s a pretty narrow issue. What else could we learn from this kind of data?

One possibility is to examine the link between trademarks and economic activity. People who make a living from commerce involving intellectual property like to emphasize how important IP protection is to the economy, though the numbers they throw around are a bit dubious. But if we were serious about it, could we rigorously draw some link between trademarks–which are the most common and ubiquitous form of intellectual property in the economy–and economic performance?

I’ve been thinking about how we might do so, so I brought my modest quantitative analytical skills to bear on the best data currently available: the USPTO’s dataset. I thought I’d just look to see whether there is any relationship between trademark activity (in this case, applications for federal trademark registrations) and economic activity (in this case, real GDP). And it seems that there is one…kind of.

TM Apps vs USGDP

The GDP data from the St. Louis Fed is reported quarterly and seasonally adjusted; I compiled the trademark application data on a quarterly basis and calculated a 4-quarter moving average as a seasonal smoothing kludge. We see that trademark application activity is strongly seasonal, and that it tends to roughly track GDP trends–perhaps with a bit of a lag. The lag is interesting if more rigorous analysis bears it out: it seems to suggest that trademarks, rather than driving economic activity, are merely a lagging indicator of that activity.

The big exception is the late 1990s to the early 2000s. As Barton Beebe documented in his first look at USPTO data, this spike in trademark activity seems to correspond with the dot-com boom and bust. (Registration rates also dropped during this period–lots of these applications were low-quality or quickly abandoned.) It’s interesting to see that this huge discontinuity in trademark application activity doesn’t correlate with anywhere near as big an impact in the overall economy. We could speculate about why that might be–it probably has something to do with the “gold-rush” scramble to occupy a new, untapped field of commerce, and I suspect it also reflects (poorly) on the value of the early web to the overall economy.

This is an example of the kind of analysis these new data sources might be useful for–and it’s not that tricky to carry out. Building this chart was a couple hours’ work, and I’m no expert. A more rigorous econometric model is beyond my expertise, but I’m sure it could be done (I’m less sure what we could learn from it). What other kinds of questions might we look to trademark data to answer?