Empirical Research

New and Improved: The Canada Trademarks Dataset 2.0

Today I released a revised and updated version of the Canada Trademarks Dataset (v.2.0): an open-access, individual-application-level dataset including records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.9 million application records.

The original dataset, released on March 2, 2021 and described in my article in the Journal of Empirical Legal Studies, was constructed from the historical trademark applications backfile of the Canada Intellectual Property Office, current through October 4, 2019, and comprising 1.6 million application records. The revised dataset represents a substantial advancement over this original dataset. In particular, I have rewritten the code used to construct the dataset, which will now build and maintain a mySQL database as a local repository of the dataset’s contents. This local database can be periodically updated and exported to .csv and/or .dta files as users see fit, using the python scripts accompanying the dataset release. Interested users can thus keep their installation of the dataset current with weekly updates from the Canada Intellectual Property Office. The .csv, .dta, and .sql files published in the new release include these weekly updates since the closing date of the historical backfile, and are current through January 24, 2023.

Full details are available at the Version 2.0 release page on Zenodo.

The Canada Trademarks Dataset

I’m happy to announce the publication (on open-access terms) of a new dataset I’ve been constructing over the past few months. The Canada Trademarks Dataset is now available for download on Zenodo, and a pre-publication draft of the paper describing it (forthcoming in the Journal of Empirical Legal Studies) is available on SSRN.

As I’m not the first to point out, doing any productive scholarly work during the pandemic has been hard, especially while caring for two young kids and teaching a combination of hastily-designed remote classes and in-person classes under disruptive public health restrictions. I have neglected other, more theoretical projects during the past year and a half because I simply could not find the sustained time for contemplation and working out of big, complex problems that such projects require. But building a dataset like this one is not a big complex problem so much as a thousand tiny puzzles, each of which can be worked out in a relatively short burst of effort. In other words, it was exactly the kind of project to take on when you could never be assured of having more than a 20 minute stretch of uninterrupted time to work. I’m very grateful to JELS for publishing the fruits of these fleeting windows of productivity.

More generally, the experience of having to prioritize certain research projects over others in the face of external constraints has made me grateful that I can count myself among the foxes rather than the hedgehogs of the legal academy. Methodological and ideological omnivorousness (or, perhaps, promiscuity) may not be the best way to make a big name for yourself as a scholar–to win followers and allies, to become the “go-to” person on a particular area of expertise, or to draw the attention of rivals and generate productive controversy. But it does help smooth out the peaks and troughs of professional life for those of us who just want to keep pushing our stone uphill using what skills we possess, hopeful that in the process we will leave behind knowledge from which others may benefit. That’s always been my preferred view of what I do for a living anyway: Il faut cultiver notre jardin.

Trademark Clutter at Northwestern Law REMIP

I’m in Chicago at Northwestern Law today to present an early-stage empirical project at the Roundtable on Empirical Methods in Intellectual Property (#REMIP). My project will use Canada’s pending change to its trademark registration system as a natural experiment to investigate the role national IP offices play in reducing “clutter”–registrations for marks that go unused, raising clearance costs and depriving competitors and the public of potentially valuable source identifiers.

Slides for the presentation are available here.

Thanks to Dave Schwartz of Northwestern, Chris Buccafusco of Cardozo, and Andrew Toole of the US Patent and Trademark Office for organizing this conference.

The Japan Trademarks Dataset: Presentation Slides

The Institute of Intellectual Property has graciously allowed me to share the slide deck from my summer research project on Japan’s trademark registration system. The slide deck includes the text of the presentation in the presenter notes, and you can download it here.

The photo leading this post was taken during my presentation at IIP in Tokyo. It shows me with my favorite visual aid: a bottle of (excellent) mirin bearing one of the contenders for Japan’s oldest registered trademark, Kokonoe Sakura.

Home Stretch

Today was the deadline for me to submit a draft presentation on the research I’ve been doing in Japan for the past six weeks. The deadline pressure explains why I haven’t posted here in a while. The good news is that I was able to browbeat my new (and still growing) dataset into sufficient shape to generate some interesting insights, which I will share with my generous sponsors here at the Institute for Intellectual Property next week, before heading home to New York.

I am not at liberty to share my slide deck right now, but I can’t help but post on a couple of interesting tidbits from my research. The first is a follow-up on my earlier post about the oldest Japanese trademark. I had been persuaded that the two-character mark 重九 was in fact a form of the three-character mark (大重九) a brand of Chinese cigarette. Turns out I was wrong. It is, in fact, the brand of a centuries-old brewer of mirin–a sweet rice wine used in cooking. (The cigarette brand is also registered in Japan, as of 2007–which says something about the likelihood-of-confusion standard in Japanese trademark law). And as I found out, there’s some question as to whether this mark (which, read right to left, reads “Kokonoe”) really is the oldest Japanese trademark. There’s competition from the hair-products company, Yanagiya, which traces its lineage back 400 years to the court physician of the first Tokugawa Shogun; and also from a sake brewer from Kobe prefecture who sells under the “Jukai” label. Which is the oldest depends on how you count: by registration number, by registration date, or by application date. Anyway all of them would have taken a backseat to that historic American brand, Singer–but the company allowed its oldest Japanese trademark registration to lapse six years ago.

The other tidbit is my first attempt at a map-based data visualization, which I built using Tableau, a surprisingly handy software tool with a free public build. I used it to visualize how trademark owners from outside Japan try to protect their marks in Japan–specifically, whether they seek registrations via Japan’s domestic registration system, or via the international registration system established by the Madrid Protocol. Here’s what I’ve found:

MadridMap
The size of each circle represents an estimate of the number of applications for Japanese trademark registrations from each country between 2001 and 2014. The color represents the proportion of those applications that were filed via the Madrid Protocol (dark blue is all Madrid Protocol; dark red is all domestic applications; paler colors are a mix). The visualization isn’t perfect because not all countries acceded to the Madrid Protocol at the same time–some acceded in the middle of the data collection period, and many have never acceded. (When I have more time maybe I’ll try to figure out how to add a time-lapse animation to bring an extra dimension to the visualization.) Still, it’s a nice, rich, dense presentation of a large and complex body of data.

 

Turning a Corner

It has been a rough week of coding, processing, and de-bugging. But at long last, tonight I’m running the last two scripts I need to run to parse the last of the 330+ Gigabytes of data I received three weeks ago, and I’ve already tested them so I’m pretty confident they’ll work. By tomorrow, if all goes as planned, I’ll have all the data I’m going to be using on this project (a lean 60 GB or so) imported into Stata, where I can slice and dice it however I please. At exactly the halfway point of my residency in Tokyo, this is a major milestone.

The next step is some finer-grained cleaning and de-duplicating of this data, followed by some additional coding to structure it in a useful way (as you can see in the photo, I’ve already started sketching out my file trees). Then I’ll be able to describe and analyze what I’ve built. All of this will take time–the most primitive observation identifier in my data is the individual trademark application number, and it looks like I’ll be dealing with about 4.5 million of them, give or take. And each application number will have multiple records associated with it to capture lots of nitty-gritty trademark-y information like changing ownership and legal representation, renewals, divisional applications, goods and services classifications, foreign and international priority claims, and so on. Processing all that information takes time, and requires a lot of attention to detail. But today I’m feeling good. Today, I feel as confident as I ever have that this project is going to succeed.

So here is a first fruit of my research. The data I’m working with only goes back 15 years, but for any trademark registrations that have still been in force during those past 15 years, I have a fair amount of historical data. The earliest application date I’ve found in the data I’ve imported so far is July 31, 1890. That application– which became Japan Trademark Registration Number 521–is for the mark “重九”, which means literally nothing to me. But I asked around the office, and fortunately I have a colleague from Beijing here in Tokyo, who tells me 重九 is actually a Chinese brand–for cigarettes:

http://www.etmoc.com/eWebEditor/2011/2011090515413693.jpg

重九 translates roughly to “double-nine”, and the additional character (which apparently always accompanies the mark in its current use) translates roughly to “big” (i.e., “Big Double-Nine” Cigarettes). The mark was last renewed in Japan on March 28, 2015. Given that I’m here to study international aspects of intellectual property as they pertain to Japan, the fact that the earliest mark on record appears to be foreign is an interesting development.

Now You’re Just Messing With Me

At some point in 2014, without any warning so far as I can tell, the Japan Patent Office changed the file-naming convention for their digital archives. Whereas before the archives would be stored under a filename such as “T2014-20(01-01)20150114.ISO”, hereafter they will be stored under a file name such as “T2014-21(01_01)20150121.ISO”.

Screen Shot 2016-07-06 at 10.05.29 AMCatch the difference?  Yeah, I didn’t either. Until I let my code–which was based on the old naming convention–run all day. Then I found out the last two years’ data had corrupted all my output files, wiping out 7 GB of data. More fun after the jump…
Continue reading…

Think Different: More Translation Hijinx

I’ve been trying for about three days to figure out why one of the scripts I was given to parse all this government data has been failing when I try to run it. Because the researchers who gave me the scripts commissioned them from some outside programmers, they can’t help me debug it. So I’ve been going line-by-line through the code and cross-referencing every command, option, and character with online manuals and forums.

My best guess is that my problem is (probably) once again a failure of translation. The code I’ve been given was written for LINUX, a (relatively) open UNIX platform. Mac OSX–which I use–is also built on top of a UNIX architecture, which power users can access via the built-in Terminal application. But Apple uses an idiosyncratic and somewhat dated version of the UNIX shell scripting language–this is the computer language you can use to tell the computer to do stuff with the files stored on it. (“Think different” indeed.) There are tons of tiny differences between Apple’s shell language and the open standard implemented in LINUX, and any one of them could be responsible for causing my code to fail. I spent the better part of two days tweaking individual characters, options, and commands in this script, to no avail. Then I tried a patch to update Apple’s scripting language to more closely mirror the one used by LINUX. Still no luck. And three days of my precious seven-week residency in Tokyo gone.

So I gave up. I’ll write my own code instead.

The script I’ve been trying to debug is one of a series of algorithms used to collate and deduplicate several years’ worth of parsed data. But I can create those kinds of algorithms myself, once I know how the parsed data is structured. The hard part was parsing the data in the first place to extract it from its arcane government archive format–and the scripts that do that worked a treat, once I figured out how they function. Besides which, the deduplication strategy used by the researchers who gave me these troublesome scripts is a bit more heavy-handed than I’d use if I were starting from scratch. Which I just did–in Stata, the statistical software package I’ll use to analyze the data, which uses a native scripting language I’m much more familiar with.

Screen Shot 2016-07-05 at 1.48.31 PM

This new script seems to be working; now I just need a good solid stretch of time to allow my home-brewed code to process the several gigabytes of data I’m feeding it. Unfortunately, time is in short supply–I’m in week 3 of my 7-week stay, and I’m supposed to present my findings to my hosts during my last week here. So from here on out, days are for coding and nights are for processing.

It’ll get done. Somehow.

Flying Away on a Wing and a Prayer

If you’re roughly my age, you’ll remember this guy:

3978293-granhc3a9roe

If not, meet Ralph Hinkley, The Greatest American Hero. Ralph (played by William Katt) is the protagonist of a schlocky 1980s sitcom with the greatest television theme song ever written. The premise of the show is that Ralph was driving through the desert one evening when some aliens decided to give him a supersuit that gives him superpowers. Unfortunately, Ralph lost the instruction manual for the suit, so he can never get it to work quite right. He nevertheless attempts to use the suit’s powers for good, and hilarity–or what passed for it on early-80s network television–ensues. In one episode, a replacement copy of the suit’s instruction manual is found, but it’s written in an indecipherable alien language. What could have been a tremendous force for good becomes a frustrating reminder of one’s own shortcomings.

As you know if you’ve been following my recent posts, I’m currently working with a treasure trove of Japanese government data. I’ve been given a helpful translation of the introductory chapters of the data specification. I’ve been given an incredibly helpful set of computer scripts to parse the data and I’ve gotten them to (mostly) work. And now that I’m at the point where I’m about ready to start revising the computer scripts to extract more and different data, I’ve got to start deciphering the various alphanumeric codes that stand in as symbols for more complex data. I’m nearly two weeks in to a seven-week research residency, and I feel like I’m finally approaching the point where I can actually start doing something instead of just getting my bearings. It’s exciting. But then, well…

Up to this point, I’ve been working with a map of the data structure that is organized (like the data itself) with english-language SGML tags (if you know anything about XML, this will look familiar):

Screen Shot 2016-06-30 at 2.55.41 PM

See that column that says “Index”? The five-character sequences in that column map to a list of codes that correspond to definitions for the types of data in this archive. These definitions–set forth in a series of tables–allow the data to be stored using compact sequences that can then be expanded and explained by reference to the code definition tables.When you’re dealing with millions of data records, compactness is pretty important.

So, “B0010” tells you that the data inside this tag (the “application-number” tag) is encoded according to definition table B0010. So I’ll just flip through the code list and…

Screen Shot 2016-06-30 at 2.50.57 PM

Uh… Hmm.

Well, that’s not so bad; I can just search this document for “B0010” (it would be a sight easier if the codes were in order!) and then just copy and paste the corresponding cell from the first column into Google Translate (it’s not a terrifically accurate translator, but it’ll do in a pinch.) The description corresponding to B0010 is “出願番号,” which Google translates to “Application Number.” That makes sense; after all the code is used for data appearing inside the <application-number> SGML tag. So now I just need to look up the code table for 出願番号/B0010 to learn how to decipher the data inside the <application-number> tag, and…

Screen Shot 2016-06-30 at 3.09.54 PM.png

Hmm.

This one actually makes some sense to me. It looks like data in code B0010 consists of a 10-character sequence, in which the first four characters correspond to an application year and the last six characters correspond to an application serial number. Simple, really.

Of course, there are dozens of these codes in the data. And not all of them are so obvious. Some of them even map to other codes that are even more obscure. For example, code A0050–which appears all over this data–is described as “中間記録”. Google translates this as “Intermediate Record”. Code table A0050, in turn, maps to three other code tables–C0840, C0850, and C0870. The code table for C0840 is basically eleven pages of this:

Screen Shot 2016-06-30 at 3.26.00 PM.png

(Sigh.)

In every episode of The Greatest American Hero, there’s a point where Ralph’s malfunctioning suit starts getting in the way–hurting more than helping. Like, he tries to fly in to rescue someone from evil kidnappers and ends up crash-landing, knocking himself out, and making himself a hostage. Nevertheless, with good intentions, persistence, ingenuity, and the help of his friends, he always managed to dig himself out of whatever mess he’d gotten himself into and save the day.

So… yeah. I’m going to figure this one out.