4,518,184 unique applications. From four different data sources. 74.71GB. All in Stata.

Screen Shot 2016-07-14 at 5.49.48 PM

Now I just have to figure out what it all means. And I have two weeks to do it.

Turning a Corner


It has been a rough week of coding, processing, and de-bugging. But at long last, tonight I’m running the last two scripts I need to run to parse the last of the 330+ Gigabytes of data I received three weeks ago, and I’ve already tested them so I’m pretty confident they’ll work. By tomorrow, if all goes as planned, I’ll have all the data I’m going to be using on this project (a lean 60 GB or so) imported into Stata, where I can slice and dice it however I please. At exactly the halfway point of my residency in Tokyo, this is a major milestone.

The next step is some finer-grained cleaning and de-duplicating of this data, followed by some additional coding to structure it in a useful way (as you can see in the photo, I’ve already started sketching out my file trees). Then I’ll be able to describe and analyze what I’ve built. All of this will take time–the most primitive observation identifier in my data is the individual trademark application number, and it looks like I’ll be dealing with about 4.5 million of them, give or take. And each application number will have multiple records associated with it to capture lots of nitty-gritty trademark-y information like changing ownership and legal representation, renewals, divisional applications, goods and services classifications, foreign and international priority claims, and so on. Processing all that information takes time, and requires a lot of attention to detail. But today I’m feeling good. Today, I feel as confident as I ever have that this project is going to succeed.

So here is a first fruit of my research. The data I’m working with only goes back 15 years, but for any trademark registrations that have still been in force during those past 15 years, I have a fair amount of historical data. The earliest application date I’ve found in the data I’ve imported so far is July 31, 1890. That application– which became Japan Trademark Registration Number 521–is for the mark “重九”, which means literally nothing to me. But I asked around the office, and fortunately I have a colleague from Beijing here in Tokyo, who tells me 重九 is actually a Chinese brand–for cigarettes:

重九 translates roughly to “double-nine”, and the additional character (which apparently always accompanies the mark in its current use) translates roughly to “big” (i.e., “Big Double-Nine” Cigarettes). The mark was last renewed in Japan on March 28, 2015. Given that I’m here to study international aspects of intellectual property as they pertain to Japan, the fact that the earliest mark on record appears to be foreign is an interesting development.

Now You’re Just Messing With Me


At some point in 2014, without any warning so far as I can tell, the Japan Patent Office changed the file-naming convention for their digital archives. Whereas before the archives would be stored under a filename such as “T2014-20(01-01)20150114.ISO”, hereafter they will be stored under a file name such as “T2014-21(01_01)20150121.ISO”.

Screen Shot 2016-07-06 at 10.05.29 AMCatch the difference?  Yeah, I didn’t either. Until I let my code–which was based on the old naming convention–run all day. Then I found out the last two years’ data had corrupted all my output files, wiping out 7 GB of data. More fun after the jump…
Continue reading

Think Different: More Translation Hijinx


I’ve been trying for about three days to figure out why one of the scripts I was given to parse all this government data has been failing when I try to run it. Because the researchers who gave me the scripts commissioned them from some outside programmers, they can’t help me debug it. So I’ve been going line-by-line through the code and cross-referencing every command, option, and character with online manuals and forums.

My best guess is that my problem is (probably) once again a failure of translation. The code I’ve been given was written for LINUX, a (relatively) open UNIX platform. Mac OSX–which I use–is also built on top of a UNIX architecture, which power users can access via the built-in Terminal application. But Apple uses an idiosyncratic and somewhat dated version of the UNIX shell scripting language–this is the computer language you can use to tell the computer to do stuff with the files stored on it. (“Think different” indeed.) There are tons of tiny differences between Apple’s shell language and the open standard implemented in LINUX, and any one of them could be responsible for causing my code to fail. I spent the better part of two days tweaking individual characters, options, and commands in this script, to no avail. Then I tried a patch to update Apple’s scripting language to more closely mirror the one used by LINUX. Still no luck. And three days of my precious seven-week residency in Tokyo gone.

So I gave up. I’ll write my own code instead.

The script I’ve been trying to debug is one of a series of algorithms used to collate and deduplicate several years’ worth of parsed data. But I can create those kinds of algorithms myself, once I know how the parsed data is structured. The hard part was parsing the data in the first place to extract it from its arcane government archive format–and the scripts that do that worked a treat, once I figured out how they function. Besides which, the deduplication strategy used by the researchers who gave me these troublesome scripts is a bit more heavy-handed than I’d use if I were starting from scratch. Which I just did–in Stata, the statistical software package I’ll use to analyze the data, which uses a native scripting language I’m much more familiar with.

Screen Shot 2016-07-05 at 1.48.31 PM

This new script seems to be working; now I just need a good solid stretch of time to allow my home-brewed code to process the several gigabytes of data I’m feeding it. Unfortunately, time is in short supply–I’m in week 3 of my 7-week stay, and I’m supposed to present my findings to my hosts during my last week here. So from here on out, days are for coding and nights are for processing.

It’ll get done. Somehow.

Flying Away on a Wing and a Prayer


If you’re roughly my age, you’ll remember this guy:


If not, meet Ralph Hinkley, The Greatest American Hero. Ralph (played by William Katt) is the protagonist of a schlocky 1980s sitcom with the greatest television theme song ever written. The premise of the show is that Ralph was driving through the desert one evening when some aliens decided to give him a supersuit that gives him superpowers. Unfortunately, Ralph lost the instruction manual for the suit, so he can never get it to work quite right. He nevertheless attempts to use the suit’s powers for good, and hilarity–or what passed for it on early-80s network television–ensues. In one episode, a replacement copy of the suit’s instruction manual is found, but it’s written in an indecipherable alien language. What could have been a tremendous force for good becomes a frustrating reminder of one’s own shortcomings.

As you know if you’ve been following my recent posts, I’m currently working with a treasure trove of Japanese government data. I’ve been given a helpful translation of the introductory chapters of the data specification. I’ve been given an incredibly helpful set of computer scripts to parse the data and I’ve gotten them to (mostly) work. And now that I’m at the point where I’m about ready to start revising the computer scripts to extract more and different data, I’ve got to start deciphering the various alphanumeric codes that stand in as symbols for more complex data. I’m nearly two weeks in to a seven-week research residency, and I feel like I’m finally approaching the point where I can actually start doing something instead of just getting my bearings. It’s exciting. But then, well…

Up to this point, I’ve been working with a map of the data structure that is organized (like the data itself) with english-language SGML tags (if you know anything about XML, this will look familiar):

Screen Shot 2016-06-30 at 2.55.41 PM

See that column that says “Index”? The five-character sequences in that column map to a list of codes that correspond to definitions for the types of data in this archive. These definitions–set forth in a series of tables–allow the data to be stored using compact sequences that can then be expanded and explained by reference to the code definition tables.When you’re dealing with millions of data records, compactness is pretty important.

So, “B0010” tells you that the data inside this tag (the “application-number” tag) is encoded according to definition table B0010. So I’ll just flip through the code list and…

Screen Shot 2016-06-30 at 2.50.57 PM

Uh… Hmm.

Well, that’s not so bad; I can just search this document for “B0010” (it would be a sight easier if the codes were in order!) and then just copy and paste the corresponding cell from the first column into Google Translate (it’s not a terrifically accurate translator, but it’ll do in a pinch.) The description corresponding to B0010 is “出願番号,” which Google translates to “Application Number.” That makes sense; after all the code is used for data appearing inside the <application-number> SGML tag. So now I just need to look up the code table for 出願番号/B0010 to learn how to decipher the data inside the <application-number> tag, and…

Screen Shot 2016-06-30 at 3.09.54 PM.png


This one actually makes some sense to me. It looks like data in code B0010 consists of a 10-character sequence, in which the first four characters correspond to an application year and the last six characters correspond to an application serial number. Simple, really.

Of course, there are dozens of these codes in the data. And not all of them are so obvious. Some of them even map to other codes that are even more obscure. For example, code A0050–which appears all over this data–is described as “中間記録”. Google translates this as “Intermediate Record”. Code table A0050, in turn, maps to three other code tables–C0840, C0850, and C0870. The code table for C0840 is basically eleven pages of this:

Screen Shot 2016-06-30 at 3.26.00 PM.png


In every episode of The Greatest American Hero, there’s a point where Ralph’s malfunctioning suit starts getting in the way–hurting more than helping. Like, he tries to fly in to rescue someone from evil kidnappers and ends up crash-landing, knocking himself out, and making himself a hostage. Nevertheless, with good intentions, persistence, ingenuity, and the help of his friends, he always managed to dig himself out of whatever mess he’d gotten himself into and save the day.

So… yeah. I’m going to figure this one out.


Forget My Name


I am incredibly fortunate that a group of Japanese researchers has already done much of the hard work of figuring out how to turn the hundreds of gigabytes of SGML documents I’m working with into a nice handy database, and moreover has given me the code they used to do it. Instead of figuring out how to do what they did on my own, I simply have to figure out what they did, and then decide what I’d like to do differently. As with everything else this summer, this task highlights important cultural differences.

Today I’ve been going through the specification for the raw government data I’ve been given and comparing it to the code given to me by the Japanese researchers whose work I’m building on, to see what they included in their dataset and what they left out. The raw government data includes a significant amount of low-sensitivity personally identifiable information. This is mainly name (and sometimes address) information about individuals and firms who have applied for trademark registrations, about the attorneys who represent them, and about the examiners–government employees all–who consider their applications.

Similar information appears in the US government’s data on trademark applications. The US government released all this data to the public several years ago, and continues to update it on a regular basis, which means that the names and addresses of applicants and their attorneys, and the names of examining attorneys and their supervisors, are all part of the public record–freely available to anyone with the interest and wherewithal to find them.

I know a few people who were pretty shocked to learn that you could search the USPTO’s free, public trademarks dataset by examiner name, and find any examiner’s entire work history–how much of a pushover they are; how quickly they work, how long they’ve been on the job, how often they’ve been overruled, etc. I’m sure there are lots of USPTO examining attorneys who would be shocked to learn that fact. But my sense is that in the US that kind of openness about government employees, and low-sensitivity personally identifiable information about individuals who access government functions, is pretty standard. And in most of the rest of the world, it’s just not.

The Japanese researchers whose work I am building on did not include information about the examiners who reviewed applications in their dataset at all–they never even retrieved it during processing. I happen to think that correlating application outcomes by examiner is interesting and potentially useful, and I’m going to modify the code I’ve been given to extract that information from the raw data. But in deference to what I take to be cultural norms regarding the privacy of personally identifiable information–norms that I know many of my compatriots would like to import into the US–I think I will probably anonymize the examiner data before reporting my results.

Lost in Translation


Yes, the title of this post is a cliché. It was bound to happen at some point on this trip. But I promise it’s really appopriate to today’s post.

I’m finishing out my first week in Japan, and I have been overwhelmed by the generosity and support of everyone I’ve met. Everyone I’ve interacted with in a professional capacity has underpromised and overdelivered. For example: Continue reading