Monday, June 12, 2017

Book Chapter on "Digital Advertising in Libraries"

I've written a chapter for a book, edited by Peter Fernandez and Kelly Tilton, to be published by ACRL. The book is tentatively titled Applying Library Values to Emerging Technology: Tips and Techniques for Advancing within Your Mission.

Digital Advertising in Libraries: or... How Libraries are Assisting the Ecosystem that Pays for Fake News

To understand the danger that digital advertising poses to user privacy in libraries, you first have to understand how websites of all stripes make money. And to understand that, you have to understand how advertising works on the Internet today.


The goal of advertising is simple and is quite similar to that of libraries. Advertisers want to provide information, narratives, and motivations to potential customers, in the hope that business and revenue will result. The challenge for advertisers has always been to figure out how to present the right information to the right reader at the right time. Since libraries are popular sources of information, they have long provided a useful context for many types of ads. Where better to place an ad for a new romance novel than at the end of a similar romance novel? Where better to advertise a new industrial vacuum pump but in the Journal of Vacuum Science and Technology? These types of ads have long existed without problems in printed library resources. In many cases the advertising, archived in libraries, provides a unique view into cultural history. In theory at least, the advertising revenue lowers the acquisition costs for resources that include the advertising.

On the Internet, advertising has evolved into a powerful revenue engine for free resources because of digital systems that efficiently match advertising to readers. Google's Adwords service is an example of such a system. Advertisers can target text-based ads to users based on their search terms, and they only have to pay if the user clicks on their ad. Google decides which ad to show by optimizing revenue—the price that the advertiser has bid times the rate at which the ad is clicked on. In 2016, Search Engine Watch reported that some search terms were selling for almost a thousand dollars per click. [Chris Lake, “The most expensive 100 Google Adwords keywords in the US,” Search Engine Watch (May 31, 2016).] Other types of advertising, such as display ads, video ads, and content ads, are placed by online advertising networks. In 2016, advertisers were projected to spend almost $75 billion on display ads; [Ingrid Lunden, “Internet Ad Spend To Reach $121B In 2014, 23% Of $537B Total Ad Spend, Ad Tech Boosts Display,” TechCrunch, (April 27, 2014).] Google's Doubleclick network alone is found on over a million websites. [“DoubleClick.Net Usage Statistics,” BuiltWith (accessed May 12, 2017). ]

Matching a user to a display ad is more difficult than search-driven ads. Without a search term to indicate what the user wants, the ad networks need demographic information about the user. Different ads (at different prices) can be shown to an eighteen-year-old white male resident of Tennessee interested in sports and a sixty-year-old black woman from Chicago interested in fashion, or a pregnant thirty-year-old woman anywhere. To earn a premium price on ad placements, the ad networks need to know as much as possible about the users: age, race, sex, ethnicity, where they live, what they read, what they buy, who they voted for. Luckily for the ad networks, this sort of demographic information is readily available, thank to user tracking.

Internet users are tracked using cookies. Typically, an invisible image element, sometimes called a "web bug," is place on the web page. When the page is loaded, the user's web browser requests the web bug from the tracking company. The first time the tracking company sees a user, a cookie with a unique ID is set. From then on, the tracking company can record the user's web usage for every website that is cooperating with the tracking company. This record of website visits can be mined to extract demographic information about the user. A weather website can tell the tracking company where the user is. A visit to a fashion blog can indicate a user's gender and age. A purchase of scent-free lotion can indicate a user's pregnancy. [Charles Duhigg, “How Companies Learn Your Secrets,” The New York Times Magazine, (February 16, 2012).] The more information collected about a user, the more valuable a tracking company's data will be to an ad network.

Many websites unknowingly place web bugs from tracking companies on their websites, even when they don't place adverting themselves. Companies active in the tracking business include AddThis, ShareThis, and Disqus, who provide functionality to websites in exchange for website placement. Other companies, such as Facebook, Twitter, and Google similarly track users to benefit their own advertising networks. Services provided by these companies are often placed on library websites. For example, Facebook’s “like” button is a tracker that records user visits to pages offering users the opportunity to “like” a webpage. Google’s “Analytics” service helps many libraries understand the usage of their websites, but is often configured to collect demographic information using web bugs from Google’s DoubleClick service.  [“How to Enable/Disable Privacy Protection in Google Analytics (It's Easy to Get Wrong!)” Go To Hellman (February 2, 2017).]

Cookies are not the only way that users are tracked. One problem that advertisers have with cookies is that they are restricted to a single browser. If a user has an iPhone, the ID cookie on the iPhone will be different from the cookie on the user's laptop, and the user will look like two separate users. Advanced tracking networks are able to connect these two cookies by matching browsing patterns. For example, if two different cookies track their users to a few low-traffic websites, chances are that the two cookies are tracking the same user. Another problem for advertisers occurs when a user flushes their cookies. The dead tracking ID can be revived by using "fingerprinting" techniques that depend on the details of browser configurations. [Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind Narayanan, and Claudia Diaz, “The Web Never Forgets: Persistent Tracking Mechanisms in the Wild.” In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS '14). ACM, New York, NY, USA, 674-689. DOI] Websites like Google, Facebook, and Twitter are able to connect tracking IDs across devices based on logins. 

Once a demographic profile for a user has been built up, the tracking profile can be used for a variety of ad-targeting strategies. One very visible strategy is "remarketing." If you've ever visited a product page on an e-commerce site, only to be followed around the Internet by advertising for that product, you've been the target of cookie-based remarketing.

Ad targeting is generally tolerated because it personalizes the user's experience of the web. Men, for the most part, prefer not to be targeted with ads for women’s products. An ad for a local merchant in New Jersey is wasted on a user in California. Prices in pounds sterling don't make sense to users in Nevada. Most advertisers and advertising networks take care not to base their ad targeting on sensitive demographic attributes such as race, religion, or sexual orientation, or at least they try not to be too noticeable when they do it.

The advertising network ecosystem is a huge benefit to content publishers. A high traffic website has no need of a sales staff—all they need to do is be accepted by the ad networks and draw users who either have favorable demographics or who click on a lot of ads. The advertisers often don't care about what websites their advertising dollars support. Advertisers also don't really care about the identity of the users, as long as they can target ads to them. The ad networks don't want information that can be traced to a particular user, such as email address, name or home address. This type of information is often subject to legal regulations that would prevent exchange or retention of the information they gather, and the terms of use and so-called privacy policies of the tracking companies are careful to specify that they do not capture personally identifiable information. Nonetheless, in the hands of law enforcement, an espionage agency, or a criminal enterprise, the barrier against linking a tracking ID to the real-world identity of a user is almost non-existent.

The amount of information exposed to advertising networks by tracking bugs is staggering. When a user activates a web tracker, the full URL of the referring page is typically revealed. The user's IP address, operating system, and browser type is sent along with a simple tracker; the JavaScript trackers that place ads typically send more detailed information.  It should be noted that any advertising enterprise requires a significant amount of user information collection; ad networks must guard against click-jacking, artificial users, botnet activity and other types of fraud. [Samuel Scott, “The Alleged $7.5 Billion Fraud in Online Advertising,” Moz, (June 22, 2015).]

Breitbart.com is a good example of a content site supported by advertising placed through advertising networks. A recent visit to the Breitbart home page turned up 19 advertising trackers, as characterized by Ghostery: [Ghostery is a browser plugin that can identify and block the trackers on a webpage.]
  • 33Across
  • [x+1]
  • AddThis
  • adsnative
  • Amazon Associates
  • DoubleClick
  • eXelate
  • Facebook Custom Audience
  • Google Adsense
  • Google Publisher Tags
  • LiveRamp
  • Lotame
  • Perfect Market
  • PulsePoint
  • Quantcast
  • Rocket Fuel
  • ScoreCard Research Beacon
  • Taboola
  • Tynt

While some of these will be familiar to library professionals, most of them are probably completely unknown, or at least their role in the advertising industry may be unknown. Amazon, Facebook and Google are the recognizable names on this list; each of them gathers demographic and transactional data about users of libraries and publishers. AddThis, for example, is a widget provider often found on library and publishing sites. They don't place ads themselves, but rather, they help to collect demographic data about users. When a library or publisher places the AddThis widget on their website, they allow AddThis to collect demographic information that benefits the entire advertising ecosystem. For example, a visitor to a medical journal might be marked as a target for particularly lucrative pharmaceutical advertising.

Another tracker found on Breitbart is Taboola. Taboola is responsible for the "sponsored content" links found even on reputable websites like Slate or 538.com. Taboola links go to content that is charitably described as clickbait and is often disparaged as "fake news." The reason for this is that these sites, having paid for advertising, have to sell even more click-driven advertising. Because of its links to the Trump Administration, Breitbart has been the subject of attempts to pressure advertisers to stop putting advertising on the site.  A Twitter account for "Sleeping Giants" has been encouraging activists to ask businesses to block Breitbart from placing their ads. [Osita Nwanevu, “‘Sleeping Giants’ Is Borrowing Gamergate’s Tactics to Attack Breitbart,” Slate, December 14, 2016.] While several companies have blocked Breitbart in response to this pressure, most companies remain unaware of how their advertising gets placed, or that they can block such advertising. [Pagan Kennedy, “How to Destroy the Business Model of Breitbart and Fake News,” The New York Times (January 7, 2017).] 

I'm particularly concerned about the medical journals that participate in advertising networks. Imagine that someone is researching clinical trials for a deadly disease. A smart insurance company could target such users with ads that mark them for higher premiums. A pharmaceutical company could use advertising targeting researchers at competing companies to find clues about their research directions. Most journal users (and probably most journal publishers) don't realize how easily online ads can be used to gather intelligence as well as to sell products.

It's important to note that reputable advertising networks take user privacy very seriously, as their businesses depend on user acquiescence. Google offers users a variety of tools to "personalize their ad experience." [If you’re logged into Google, the advertising settings applied when you browse can be viewed and modified.] Many of the advertising networks pledge to adhere to the guidance of the "Network Advertising Initiative" [The NAI Code and Enforcement Program: An Overview,”],  an industry group.  However, the competition in the web-advertising ecosystem is intense, and there is little transparency about enforcement of the guidance. Advertising networks have been shown to spread security vulnerabilities and other types of malware when they allow JavaScript in advertising payloads. [Randy Westergren, “Widespread XSS Vulnerabilities in Ad Network Code Affecting Top Tier Publishers, Retailers,” (March 2, 2016).]

Given the current environment, it's incumbent on libraries and the publishing industry to understand and evaluate their participation in the advertising network ecosystem. In the following sections, I discuss the extent of current participation in the advertising ecosystem by libraries, publishers, and aggregators serving the library industry.

Publishers

Advertising is a significant income stream for many publishers providing content to libraries. For example, the Massachusetts Medical Society, publisher of the New England Journal of Medicine, takes in about $25 million per year in advertising revenue. Outside of medical and pharmaceutical publishing, advertising is much less common. However, advertising networks are pervasive in research journals.

In 2015, I surveyed the websites of twenty of the top research journals and found that sixteen of the top twenty journals placed ad network trackers on their websites. [“16 of the Top 20 Research Journals Let Ad Networks Spy on Their Readers,” Go To Hellman (March 12, 2015). ]
Recently, I revisited the twenty journals to see if there had been any improvement. Most of the journals I examined had added tracking on their websites. The New England Journal of Medicine, which employed the most intense reader tracking of the twenty, is now even more intense, with nineteen trackers on a web page that had "only" fourteen trackers two years ago. A page from Elsevier's Cell went from nine to sixteen trackers. [“Reader Privacy for Research Journals is Getting Worse,” Go To Hellman (March 22, 2017). ] Intense tracking is not confined to subscription-based health science journals; I have found trackers on open access journals, economics journals, even on journals covering library science and literary studies.

It's not entirely clear why some of these publishers allow advertising trackers on their websites, because in many cases, there is no advertising. Perhaps they don’t realize the impact of tracking on reader privacy. Certainly, publishers that rely on advertising revenue need to carefully audit their advertising networks and the sorts of advertising that comes through them. The privacy commitments these partners make need to be consistent with the privacy assurances made by the publishers themselves. For publishers who value reader privacy and don't earn significant amounts from advertising, there's simply no good reason for them to continue to allow tracking by ad networks.

Vendors

The library automation industry has slowly become aware of how the systems it provides can be misused to compromise library patron privacy. For example, I have pointed out that cover images presented by catalog systems were leaking search data to Amazon, which has resulted in software changes by at least one systems vendor. [“How to Check if Your Library is Leaking Catalog Searches to Amazon,” Go To Hellman (December 22, 2016).] These systems are technically complex, and systems managers in libraries are rarely trained in web privacy assessment. Development processes need to include privacy assessments at both component and system levels.

Libraries

There is a mismatch between what libraries want to do to protect patron privacy and what they are able to do. Even when large amounts of money are at stake, there is often little leverage for a library to change the way a publisher delivers advertising bearing content. Nonetheless, together with cooperating IT and legal services, libraries have many privacy-protecting options at their disposal. 
  1. Use aggregators for journal content rather than the publisher sites. Many journals are available on multiple platforms, and platforms marketed to libraries often strip advertising and advertising trackers from the journal content. Reader privacy should be an important consideration in selecting platforms and platform content.
  2. Promote the use of privacy technologies. Privacy Badger is an open-source browser plugin that knows about, and blocks tracking of, users. Similar tools include uBlock Origin, and the aforementioned Ghostery.
  3. Use proxy-servers. Re-writing proxy servers such as EZProxy are typically deployed to serve content to remote users, but they can also be configured to remove trackers, or to forcibly expire tracking cookies. This is rarely done, as far as I am aware.
  4. Strip advertising and trackers at the network level. A more aggressive approach is to enforce privacy by blocking tracker websites at the network level. Because this can be intrusive (it affects subscribed content and unsubscribed content equally) it's appropriate mostly for corporate environments where competitive-intelligence espionage is a concern.
  5. Ask for disclosure and notification. During licensing negotiations, ask the vendor or publisher to provide a list of all third parties who might have access to patron clickstream data. Ask to be notified if the list changes. Put these requests into requests for proposals. Sunlight is a good disinfectant.
  6. Join together with others in the library and publishing industry to set out best practices for advertising in web resources.

Conclusion

The widespread infusion of the digital advertising ecosystem into library environments presents a new set of challenges to the values that have been at the core of the library profession. Advertising trackers introduce privacy breaches into the library environment and help to sustain an information-delivery channel that operates without the values grounding that has earned libraries and librarians a deep reserve of trust from users. The infusion has come about through a combination of commercial interest in user demographics, consumer apathy about privacy, and general lack of understanding of a complex technology environment. The entire information industry needs to develop understanding of that environment so that it can grow and evolve to serve users first, not the advertisers.

Tuesday, May 30, 2017

Readium's New Licensed Content Protection May Result in Better Reader Privacy

CC BY
Libraries offering ebook lending are between a rock and a hard place. They know in their heart of hearts that digital rights management (DRM) software is evil, but not allowing users to borrow the ebooks they want to read is not exactly the height of virtue. Saintly companies like Amazon will be happy to fill the gaps if libraries can't lend ebooks. The fundamental problem is that "borrowing" is a fiction, a conceptual construct, when applied to the ones and zeroes of a digital book. An ebook loan is really a short-term license. Under today's copyright law, a reader must have a license to read an ebook, and ebook rights-holders don't trust users to adhere to short-term licenses without some sort of software to enforce the license.

Unless the rock becomes a marshmallow, libraries that want to improve the ebook lending experience are hoping to make the hard place a bit softer. The most common DRM system used in libraries is run by Adobe. Adobe Content Server (ACS) is used by Overdrive, Proquest, EBSCO and Bibliotheca's Cloud Library. Adobe Content Server is a hard place for libraries in two ways. First, a payment must be made to Adobe for every lending transaction processed through ACS. Second, use of ACS affects reader privacy. When ACS first came out, Adobe got to know the identity of every borrower. Adobe says this about these records:
"Adobe keeps internet protocol (IP) address logs related to Adobe ID sign-ins for 90 days"
I wish they also said they destroyed these logs. Their privacy policy says:
"Your personal information and files are stored on Adobe’s servers and the servers of companies we hire to provide services to us. Your personal information may be transferred across national borders because we have servers located worldwide and the companies we hire to help us run our business are located in different countries around the world."
... and generally says that reader should trust Adobe to not betray you.

Thanks in part to demand from libraries and the companies that serve them, Adobe changed ACS so that borrower identities could be de-identified by intermediaries such as Overdrive. So instead of relying on Adobe's sometimes lax privacy protections, libraries could rely on vendors more responsive to library concerns. But still, the underlying DRM technology was designed to trust Adobe, and to distrust readers. Its centralized architecture requires everyone to trust participants closer to the center. A reader's privacy requires trust of the library or bookstore, which in turn have to trust a vendor, who in turn have to trust Adobe.

This state of affairs has been the motivation for the Readium Foundation's new DRM technology, called Readium Licensed Content Protection (LCP). LCP's developers claim that it offers libraries a low cost way to improve the library ebook lending experience while providing readers with the privacy assurances they expect from libraries. In addition, Readium describes LCP as Open Source... except for a few lines of code. To understand LCP, and to see if it delivers on the developer's claims, I took a close look at the recently released spec. The short description of what I found is that it can do what it claims to do... but everything depends on the implementation. Also, DRM may be a Hofstadter-Moebius loop.

Now for the longer description:

Every DRM system uses encryption and secrets. Centralized DRM systems such as ACS keep a centralized secret, and use that secret to generate, distribute and control keys that lock and unlock content. LCP takes a somewhat different approach. It uses two secrets to lock and unlock content, a user secret and and ecosystem secret. An "ecosystem" is all the libraries, booksellers, and reading system vendors who agree to interoperate. Any software that knows the ecosystem secret can combine it with a user's secret to unlock content that has been locked for a user. This way multiple content providers in an ecosystem can independently lock content for a user- there's no requirement for a central key server.

The LCP DRM system has some interesting usability and privacy features. If you want to read on several devices, you just need to remember your encryption secret, and you can move files from one device to another. If you want to share an ebook with a family member or close friend, that's ok too, as long as you're comfortable sharing your encryption secret. If you want to read anonymously, can have have a trusted friend borrow the book on your behalf. But to get publisher buy-in for these usability features, the system has to have a way for content providers to limit oversharing. Content providers don't want you to just post the file and the password on a pirate file-sharing service. So ecosystem software applications are required to "phone home" with a device identifier and license identifier when they are connected to the internet.

As you might imagine, the LCP phone-home information could have an impact on reader privacy, depending on the implementation. So for example, if you borrow a book from the library, and your reader app contacts the library to say you've opened the book, your privacy is minimally impacted since the library already knows you borrowed the book. But if the phone-home transaction is unencrypted, or if it contains too much information, then your employer might be able to find out about the union-organizer book you're reading. If the libraries or booksellers can aggregate all their phone-home logs, then your detailed reading profile could be compiled and exploited. Or if users are not permitted to select their own encryption secret, it might be much harder to read a book anonymously. (Note: my suggested changes for improving these parts of the spec were accepted by the spec's authors.) But if everything is implemented with a view to reader privacy, LCP should offer much better reader privacy than possible with existing systems.

There's some bad news, however. Because the ecosystem secret has to be protected, the openness of the reader software is not quite what it seems. The code will need to be obfuscated before distribution, and the secret will only be available to developers and to distribution channels that are willing and able to "harden" their software. If you want to fork the software to add a feature, your build will not be able to unlock ecosystem content until the ecosystem overlords deign to approve your changes. So don't expect reader software with lots of plugins and options. Don't expect a javascript web-reader.

The code obfuscation raises another issue: it will be difficult to audit reader software to make sure it doesn't harbor spyware, even if the source code is open (except for the ecosystem secret). You still have have to trust app provider, your library and the people who sell you books. But it's hard to get far without trusting somebody, so this isn't a new problem, and when was the last time anyone audited library software? And because the ecosystem overlords distribute the ecosystem secrets to trusted developers, the topology of trust and accountability is very different from Adobe's centralized system.

If you didn't like that bad news, that cloud may have a silver lining, or maybe a lead lining, depending on your perspective. If LCP becomes widely used, the ecosystem secret will inevitably leak, and an anti-ecosystem could form. There will be a Calibre plugin to strip encryption. There will be grayware that does everything that the ecosystem software isn't permitted to do. And it might even be sort-of legal to use. Library ebook lending might flourish. Or collapse. Because in the end, ebook lending requires trust to flow in both directions; while it's not perfect, LCP is a baby step in the direction of mutual trust between readers and content providers.

In Stanley Kubrick's 2001: A Space Odyssey, the computer HAL 9000 goes insane. The reason:
HAL's crisis was caused by a programming contradiction: he was constructed for "the accurate processing of information without distortion or concealment", yet his orders, directly from Dr. Heywood Floyd at the National Council on Astronautics, required him to keep the discovery of the Monolith TMA-1 a secret for reasons of national security. This contradiction created a "Hofstadter-Moebius loop", reducing HAL to paranoia. 
Readium LCP software is sort of like HAL 9000. It's charged with opening up information to readers, with expanding minds everywhere, transporting them to worlds of new knowledge and imagination, yet it must work to keep a secret and prevent users from doing things that copyright owners don't want them to do. Let's hope that the P in LCP doesn't stand for "Paranoia".

Sunday, April 2, 2017

Copyrighted Clickstream Poetry to Stop ISP Click-Selling

Congress won't let the Federal Communications Commission (FCC) protect users from Internet Service Provider (ISP) snooping-for-cash. My ISP could decide to sell a list of all the websites I visit to advertisers, and the FCC can't stop them. I wondered if there was some way I could use copyright law to prevent my ISP from selling copies of my clickstream.

So I invented "clickstream poetry". Here is my first clickstream poem, entitled My clicks are mine:
{
    "content":       
        [
        "https://roses.com",
        "http://are.com",
        "https://reddit.com",
        "http://theultraviolets.net",
        "http://are.com",
        "https://moo.com",
        "http://this.is",
        "http://work.org",
        "http://is.com",
        "https://copyright.com",
        "https://ted.com",
        "https://www.so.ch",
        "http://verizon.com",
        "http://www.faa.gov",
        "https://kyu.com",
        "https://copyright.com",
        "http://2o17.com",
        "http://eric.org",
        "http://hellman.net",
        "https://creativecommons.org/licenses/by-nc/4.0/legalcode"
        ],
    "copyright": "2017 Eric Hellman",
    "license": "https://creativecommons.org/licenses/by-nc/4.0/legalcode",
    "title": "My clicks are mine"
}

I wrote a python script that "performs" the poem for the benefit of anyone listening to my clickstream. The script requests the websites in the poem in a random order; the listener will see the website names requested, and this dataset comprises the "poem". I used a Creative Commons license that doesn't let anyone distribute copies of my poem for commercial purposes. If my ISP tries to sell a copy of my clickstream, they would be violating the license, and thus infringing my copyright to the poem. If you run the script to perform the poem (for non-commercial purposes, of course), your ISP would similarly be infringing my copyright if they try to sell your clickstream.

If I tried to sue an ISP for copyright infringement, they would likely argue that though my creation is original and used in its entirety, selling my clickstream is a "fair use". They would assert that advertising optimization (or whatever) is a "transformative use" and that it didn't affect the market for my poem. Who would pay anything for a stupid clickstream poem? How would a non-existent, hypothetical market for clickstream poetry be harmed by use in their big data algorithms?

That's why I'm offering commercial licenses to the clickstream poem My clicks are mine. This will demonstrate that a commercial market for clickstream poetry licenses exists. For only $10, you can use a copy of my poem for any purpose whatsoever, for a period of 24 hours. If an ad network wants to use my clickstream to optimize the ads they show me, more power to them, as long as they pay for a license. I imagine that, over the lifetime of my poem's copyright protection (into the 22nd century), clickstream poetry will become increasingly valuable because of uses that haven't been invented yet.

To acquire a commercial license to my poem, support my work at the Free Ebook Foundation, a 501(c)3 not-for-profit corporation, by making a donation. Or don't. I have no idea if a court would take my side against a big company (and against Congress). I'm told that judges are generally skeptical of clever "legal hacks" unless they are crafted by lawyers instead of engineers.

ISPs would probably figure out a legal or technical subterfuge around the copyright of my clickstream poem; but if they have to worry even a little, this effort will have been worth my time.

Update: I have now paid $35 to register my copyright to My clicks are mine.

Wednesday, March 22, 2017

Reader Privacy for Research Journals is Getting Worse

Ever hear of Grapeshot, Eloqua, Moat, Hubspot, Krux, or Sizmek? Probably not. Maybe you've heard of Doubleclick, AppNexus, Adsense or Addthis? Certainly you've heard of Google, which owns Doubleclick and Adsense. If you read scientific journal articles on publisher websites, these companies that you've never heard of will track and log your reading habits and try to figure out how to get you to click on ads, not just at the publisher websites but also at websites like Breitbart.com and the Huffington Post.

Two years ago I surveyed the websites of 20 of the top research journals and found that 16 of the top 20 journals placed trackers from ad networks on their web sites. Only the journals from the American Physical Society (2 of the 20) supported secure (HTTPS) connections, and even now APS does not default to being secure.

I'm working on an article about advertising in online library content, so I decided to revisit the 20 journals to see if there had been any improvement. Over half the traffic on the internet now uses secure connections, so I expected to see some movement. One of the 20 journals, Quarterly Journal of Economics, now defaults to a secure connection, significantly improving privacy for its readers. Let's have a big round of applause for Oxford University Press! Yay.

So that's the good news. The bad news is that reader privacy at most of the journals I looked at got worse. Science, which could be loaded securely 2 years ago, has reverted to insecure connections. The two Annual Reviews journals I looked at, which were among the few that did not expose users to advertising network tracking, now have trackers for AddThis and Doubleclick. The New England Journal of Medicine, which deployed the most intense reader tracking of the 20, is now even more intense, with 19 trackers on a web page that had "only" 14 trackers two years ago. A page from Elsevier's Cell went from 9 to 16 trackers.

Despite the backwardness of most journal websites, there are a few signs of hope. Some of the big journal platforms have begun to implement HTTPS. Springer Link defaults to HTTPS, and Elsevier's Science Direct is delivering some of its content with secure connections. Both of them place trackers for advertising networks, so if you want to read a journal article securely and privately, your best bet is still to use Tor.

Thursday, February 2, 2017

How to enable/disable privacy protection in Google Analytics (it's easy to get wrong!)

In my survey last year of ARL library web services, I found that 72% of them used Google Analytics. So it's not surprising that a common response to my article about leaking catalog searches to Amazon was to wonder whether the same thing is happening with Google Analytics.

The short answer is "It Depends". It might be OK to use Google Analytics on a library search facility, if the following things are true:
  1. The library trusts Google on user privacy. (Many do.)
  2. Google is acting in good faith to protect user privacy and is not acting under legal compulsion to act otherwise. (We don't really know.)
  3. Google Analytics is correctly doing what their documentation says they are doing and not being circumvented by the rest of Google. (They don't always.)
  4. The library has implemented Google Analytics correctly to enable user privacy.
There's an entire blog post to write about each of the first three conditions, but I have only so many hours in a day.  Given that many libraries have decided that the benefits using of Google Analytics outweigh the privacy risks, the rest of this post concerns only this last condition. Of the 72% of ARL libraries that use Google Analytics, I find that only 19% of them have implemented Google Analytics with privacy-protection features enabled.

So, if you care about library privacy but can't do without Google Analytics, read on!

Google Analytics has a lot of configuration options, which is why webmasters love it. For the purposes of user privacy, however, there are just two configuration options to pay attention to, the "IP Anonymization" option and the "Display Features" option.

IP Anonymization says to Google Analytics "please don't remember the exact IP address of my users". According to Google, enabling this mode masks the least significant bits of the user's IP address before the IP address is used or saved. Since many users can be identified by their IP address, this prevents anyone from discovering the search history for a given IP address. But remember, Google is still sent the IP address, and we have to trust that Google will obscure the IP address as advertised, and not save it in some log somewhere. Even with the masked IP address, it may still be possible to identify a user, particularly if a library serves a small number of geographically dispersed users.

"Display Features" says to Google to that you don't care about user privacy, and it's OK to track your users all to hell so that you can get access to "demographic" information. To understand what's happening, it's important to understand the difference between "first-party" and "third-party" cookies, and how they implicate privacy differently.

Out of the box, Google Analytics uses "first party" cookies to track users. So if you deploy Google Analytics on your "library.example.edu" server, the tracking cookie will be attached to the library.example.edu hostname. Google Analytics will have considerable difficulty connecting user number 1234 on the library.example.edu domain with user number 5678 on the "sci-hub.info" domain, because the user ids are chosen randomly for each hostname. But if you turn on Display Features, Google will connect the two user ids via a third party tracking cookie from its Doubleclick advertising service. This enables both you and Google to know more about your users. Anyone with access to Google's data will be able to connect the catalog searches saved for user number 1234 to that user's searches on any website that uses Google advertising or any site that has Display Features turned on.

IP Anonymization and Display Features can be configured in Google Analytics in three ways, depending on how it's being configured. The instructions here apply to the "Universal Analytics" script. You can tell a site uses Universal Analytics because the pages execute a javascript named "analytics.js". An older "classic" version of Google Analytics uses a script named "ga.js"; its configuration is similar to that of Universal. More complex websites may use Google Tag Manager to deploy and configure Google Analytics.

Google Analytics is usually deployed on a web page by inserting a script element that looks like this:
<script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
    ga('create', 'UA-XXXXX-Y', 'auto');
    ga('send', 'pageview');
</script>
IP Anonymization and Display Features are turned on with extra lines in the script:
    ga('create', 'UA-XXXXX-Y', 'auto');
    ga('require', 'displayfeatures');  // starts tracking users across sites
    ga('set', 'anonymizeIp', true); // makes it harder to identify the user from logs
    ga('send', 'pageview');
The Google Analytics Admin allows you to turn on cross site user tracking, though the privacy impact of what you're doing is not made clear . In the "Data Collection" item of the Tracking info pane, look at the toggle switches for "Remarketing" and "Advertising Reporting Features" if these are switched to "ON", then you've enabled cross site tracking and your users can expect no privacy.

Turning on IP anonymization is not quite as easy and turning on cross-site tracking. You have to add it explicitly in your script or turn it on in Google tag manager (where you won't find it unless you know what to look for!).

To check if cross-site tracking has been turned on in your institution's Google Analytics, use the procedures I described in my article on How to check if your library is leaking catalog searches to Amazon.  First, clear the cookies for your website, then load your site and look at the "Sources" tab in Chrome developer tools. If there's a resource from "stats.g.doubleclick.net", then your website is asking google to track your users across sites. If your institution is a library, you should not be telling Google to track your users across sites.

Bottom line: if you use Google Analytics, always remember that Google is fundamentally an advertising company and it will seldom guide you towards protecting your users' privacy.