I am incredibly fortunate that a group of Japanese researchers has already done much of the hard work of figuring out how to turn the hundreds of gigabytes of SGML documents I’m working with into a nice handy database, and moreover has given me the code they used to do it. Instead of figuring out how to do what they did on my own, I simply have to figure out what they did, and then decide what I’d like to do differently. As with everything else this summer, this task highlights important cultural differences.
Today I’ve been going through the specification for the raw government data I’ve been given and comparing it to the code given to me by the Japanese researchers whose work I’m building on, to see what they included in their dataset and what they left out. The raw government data includes a significant amount of low-sensitivity personally identifiable information. This is mainly name (and sometimes address) information about individuals and firms who have applied for trademark registrations, about the attorneys who represent them, and about the examiners–government employees all–who consider their applications.
Similar information appears in the US government’s data on trademark applications. The US government released all this data to the public several years ago, and continues to update it on a regular basis, which means that the names and addresses of applicants and their attorneys, and the names of examining attorneys and their supervisors, are all part of the public record–freely available to anyone with the interest and wherewithal to find them.
I know a few people who were pretty shocked to learn that you could search the USPTO’s free, public trademarks dataset by examiner name, and find any examiner’s entire work history–how much of a pushover they are; how quickly they work, how long they’ve been on the job, how often they’ve been overruled, etc. I’m sure there are lots of USPTO examining attorneys who would be shocked to learn that fact. But my sense is that in the US that kind of openness about government employees, and low-sensitivity personally identifiable information about individuals who access government functions, is pretty standard. And in most of the rest of the world, it’s just not.
The Japanese researchers whose work I am building on did not include information about the examiners who reviewed applications in their dataset at all–they never even retrieved it during processing. I happen to think that correlating application outcomes by examiner is interesting and potentially useful, and I’m going to modify the code I’ve been given to extract that information from the raw data. But in deference to what I take to be cultural norms regarding the privacy of personally identifiable information–norms that I know many of my compatriots would like to import into the US–I think I will probably anonymize the examiner data before reporting my results.