Today was the deadline for me to submit a draft presentation on the research I’ve been doing in Japan for the past six weeks. The deadline pressure explains why I haven’t posted here in a while. The good news is that I was able to browbeat my new (and still growing) dataset into sufficient shape to generate some interesting insights, which I will share with my generous sponsors here at the Institute for Intellectual Property next week, before heading home to New York.
I am not at liberty to share my slide deck right now, but I can’t help but post on a couple of interesting tidbits from my research. The first is a follow-up on my earlier post about the oldest Japanese trademark. I had been persuaded that the two-character mark 重九 was in fact a form of the three-character mark (大重九) a brand of Chinese cigarette. Turns out I was wrong. It is, in fact, the brand of a centuries-old brewer of mirin–a sweet rice wine used in cooking. (The cigarette brand is also registered in Japan, as of 2007–which says something about the likelihood-of-confusion standard in Japanese trademark law). And as I found out, there’s some question as to whether this mark (which, read right to left, reads “Kokonoe”) really is the oldest Japanese trademark. There’s competition from the hair-products company, Yanagiya, which traces its lineage back 400 years to the court physician of the first Tokugawa Shogun; and also from a sake brewer from Kobe prefecture who sells under the “Jukai” label. Which is the oldest depends on how you count: by registration number, by registration date, or by application date. Anyway all of them would have taken a backseat to that historic American brand, Singer–but the company allowed its oldest Japanese trademark registration to lapse six years ago.
The other tidbit is my first attempt at a map-based data visualization, which I built using Tableau, a surprisingly handy software tool with a free public build. I used it to visualize how trademark owners from outside Japan try to protect their marks in Japan–specifically, whether they seek registrations via Japan’s domestic registration system, or via the international registration system established by the Madrid Protocol. Here’s what I’ve found:
The size of each circle represents an estimate of the number of applications for Japanese trademark registrations from each country between 2001 and 2014. The color represents the proportion of those applications that were filed via the Madrid Protocol (dark blue is all Madrid Protocol; dark red is all domestic applications; paler colors are a mix). The visualization isn’t perfect because not all countries acceded to the Madrid Protocol at the same time–some acceded in the middle of the data collection period, and many have never acceded. (When I have more time maybe I’ll try to figure out how to add a time-lapse animation to bring an extra dimension to the visualization.) Still, it’s a nice, rich, dense presentation of a large and complex body of data.
4,518,184 unique applications. From four different data sources. 74.71GB. All in Stata.
Now I just have to figure out what it all means. And I have two weeks to do it.
It has been a rough week of coding, processing, and de-bugging. But at long last, tonight I’m running the last two scripts I need to run to parse the last of the 330+ Gigabytes of data I received three weeks ago, and I’ve already tested them so I’m pretty confident they’ll work. By tomorrow, if all goes as planned, I’ll have all the data I’m going to be using on this project (a lean 60 GB or so) imported into Stata, where I can slice and dice it however I please. At exactly the halfway point of my residency in Tokyo, this is a major milestone.
The next step is some finer-grained cleaning and de-duplicating of this data, followed by some additional coding to structure it in a useful way (as you can see in the photo, I’ve already started sketching out my file trees). Then I’ll be able to describe and analyze what I’ve built. All of this will take time–the most primitive observation identifier in my data is the individual trademark application number, and it looks like I’ll be dealing with about 4.5 million of them, give or take. And each application number will have multiple records associated with it to capture lots of nitty-gritty trademark-y information like changing ownership and legal representation, renewals, divisional applications, goods and services classifications, foreign and international priority claims, and so on. Processing all that information takes time, and requires a lot of attention to detail. But today I’m feeling good. Today, I feel as confident as I ever have that this project is going to succeed.
So here is a first fruit of my research. The data I’m working with only goes back 15 years, but for any trademark registrations that have still been in force during those past 15 years, I have a fair amount of historical data. The earliest application date I’ve found in the data I’ve imported so far is July 31, 1890. That application– which became Japan Trademark Registration Number 521–is for the mark “重九”, which means literally nothing to me. But I asked around the office, and fortunately I have a colleague from Beijing here in Tokyo, who tells me 重九 is actually a Chinese brand–for cigarettes:
重九 translates roughly to “double-nine”, and the additional character (which apparently always accompanies the mark in its current use) translates roughly to “big” (i.e., “Big Double-Nine” Cigarettes). The mark was last renewed in Japan on March 28, 2015. Given that I’m here to study international aspects of intellectual property as they pertain to Japan, the fact that the earliest mark on record appears to be foreign is an interesting development.
At some point in 2014, without any warning so far as I can tell, the Japan Patent Office changed the file-naming convention for their digital archives. Whereas before the archives would be stored under a filename such as “T2014-20(01-01)20150114.ISO”, hereafter they will be stored under a file name such as “T2014-21(01_01)20150121.ISO”.
Catch the difference? Yeah, I didn’t either. Until I let my code–which was based on the old naming convention–run all day. Then I found out the last two years’ data had corrupted all my output files, wiping out 7 GB of data. More fun after the jump…
I’ve been trying for about three days to figure out why one of the scripts I was given to parse all this government data has been failing when I try to run it. Because the researchers who gave me the scripts commissioned them from some outside programmers, they can’t help me debug it. So I’ve been going line-by-line through the code and cross-referencing every command, option, and character with online manuals and forums.
My best guess is that my problem is (probably) once again a failure of translation. The code I’ve been given was written for LINUX, a (relatively) open UNIX platform. Mac OSX–which I use–is also built on top of a UNIX architecture, which power users can access via the built-in Terminal application. But Apple uses an idiosyncratic and somewhat dated version of the UNIX shell scripting language–this is the computer language you can use to tell the computer to do stuff with the files stored on it. (“Think different” indeed.) There are tons of tiny differences between Apple’s shell language and the open standard implemented in LINUX, and any one of them could be responsible for causing my code to fail. I spent the better part of two days tweaking individual characters, options, and commands in this script, to no avail. Then I tried a patch to update Apple’s scripting language to more closely mirror the one used by LINUX. Still no luck. And three days of my precious seven-week residency in Tokyo gone.
So I gave up. I’ll write my own code instead.
The script I’ve been trying to debug is one of a series of algorithms used to collate and deduplicate several years’ worth of parsed data. But I can create those kinds of algorithms myself, once I know how the parsed data is structured. The hard part was parsing the data in the first place to extract it from its arcane government archive format–and the scripts that do that worked a treat, once I figured out how they function. Besides which, the deduplication strategy used by the researchers who gave me these troublesome scripts is a bit more heavy-handed than I’d use if I were starting from scratch. Which I just did–in Stata, the statistical software package I’ll use to analyze the data, which uses a native scripting language I’m much more familiar with.
This new script seems to be working; now I just need a good solid stretch of time to allow my home-brewed code to process the several gigabytes of data I’m feeding it. Unfortunately, time is in short supply–I’m in week 3 of my 7-week stay, and I’m supposed to present my findings to my hosts during my last week here. So from here on out, days are for coding and nights are for processing.
It’ll get done. Somehow.