Is Web Scraping Legal?
I had a friend get in touch with me a while back about the legalities of web scraping. He found, and I’m finding too, a tremendous lack of information about web scraping. I think this is a result of there being so many strange ramifications depending on the many variables in the facts of each situation. I got interested in the legal issues involved in web scraping, and so I put together a hypothetical to test some of them out.
I am going to reiterate the disclaimer in the legal notice of this website: this is not legal advice. The situation I describe here is incredibly specific and is the product of my imagination. There is almost no chance this situation is going to be the same as yours. In fact, the situation here isn’t even a complete (or real) one. I’m not going to spend the time to come up with a technologically-savvy hypothetical. This will have to do. Your situation is going to contain facts, details, and nuances different and exclusive from the one here. If you’re reading this for educational purposes, great – this should be a wonderful starting point to better inform yourself. Talking to a lawyer about your specific situation should be the next step in informing yourself.
The Hypothetical Situation:
My home ski area publishes the status of their lifts online. I develop a program that jumps onto the site, downloads the page to memory, scans that page for the lift status, uploads that status to a database, and then dumps all the data. With my iPhone I can then hit my new app which connects to the database on the server and grabs the data to display on my iPhone. Now I can see what the wait time is for a lift on the other side of the mountain while I’m skiing, or I can decide if I want to stay home for the day if I’m looking at the app off-mountain. What exactly are the consequences of doing this? Can I get in trouble for web scraping?
Web scraping brings many possible areas of liability into focus. It can potentially implicate contracts, copyright, trademark, patent, internet law, various federal statutes, and some other areas, too. Let’s hit them one by one.
Copyright can be troublesome for this app. Information on a website can be protected by copyright. Copyright protection exists in creative material fixed in some semi-permanent medium. However, copyright protection does not extend to facts because they aren’t considered creative. This creative threshold is quite low, but facts don’t pass it; creative arrangements of facts, however, can qualify for protection. My home ski area’s website has a good deal of information on it. The lift status is displayed in a few ways: with a green/yellow/red icon next to the lift’s name, with wait time next to the lift’s name, or with a green/yellow/red line superimposed over the lift’s route on the trail map. However, that is merely the display. The data scraped is just that: raw data. Raw facts. Most likely, copyright protection does not extend to this, so I should be in the clear for the data itself.
However, I’m clear only if my app scrapes just the data. If it loads the entire page, culls the code for the data it needs, and discards the rest, a temporary copy has been made of the page. The page is almost certainly protected by copyright, and courts have found that even a temporary copy stored in RAM is a sufficiently permanent copy such that it can lead to infringement. So, the app may be infringing the ski area’s protection of the webpage that contains the lift data, even though I’m just trying to grab the data itself.
Trademark law shouldn’t be much of a problem. Trademark law protects the public from becoming confused about the source of a product. My app will obviously display the name of the ski area so the user can look up a resort by name and find its lift wait times. The display of that name can’t create the appearance that the mountain is sponsoring the app. This shouldn’t be too hard. It is necessary to use the name in the app, but it can be done carefully: by stating something like “lift times at X Ski Area” should be sufficient to not endanger a likelihood that a consumer would be confused by the use of the name. Something like, “lift times at X Ski Area, provided by MyiPhoneAppName” would be even clearer. A disclaimer somewhere would be an additional safeguard against consumer confusion.
Patent infringement can be a tricky area. Patent owners can exclude anyone from making, using, or selling their technology. By accessing the site and interacting with the data, the app would arguably be using the technology. It is really difficult to analyze whether this app would infringe any patents without knowing exactly what the ski resort has patented (or licensed). Typically, software isn’t much of a patent-heavy industry, because it changes so rapidly that the time and money necessary to file and procure a patent just isn’t worth it. Further, a lot of ski resorts (and probably other places with online wait times) have this similar feature, which means either everyone is licensing it (doubtful), websites are stealing it (also doubtful), or the technology is in the public domain (most likely). I would venture to guess that my ski area doesn’t have a patent on the technology involved in the lift status display, but you never know.
Now, there may be other apps out there that use similar technology. My app could possibly be stepping on their patents if they have any. But do they have any? Again, hard to say. Only a really thorough freedom-to-operate opinion could tell me whether anyone has a patent on this technology and, if so, whether my app infringes it. Most likely though, because of the short-lived effective life of software and iPhone apps, there probably isn’t a patent on this sort of technology. If the technology has been around for a while, the chance that it is patented is even smaller.
5. Trespass to Chattels
Trespass to chattels is a physical-world legal wrong that has been adapted to the internet. In the tangible world, trespass to chattels is interference with someone’s personal property – trespassing on their stuff. The theory has been successfully applied to spammers, with ISPs claiming that the volume of spam ate up their bandwidth, reduced the quality of their service, and ultimately risked their business. The law has also been applied against bots that crawl sites looking for information, where those bots occupied only a small percentage of the site’s bandwidth but the risk of increased usage was feared. However, in California, where most of this law arises, the theory has been trimmed significantly, and actual damage or impairment is now required.
Central to the question of whether my app risks trespass to chattels is the coding. If the app has to jump onto the ski resort’s site every time to download information, then I risk having thousands of iPhones querying the site every day during the winter. The aggregated traffic from all these apps could cause some degradation of the site. However, if the app communicates instead with a central database, as described in the hypothetical, then the load on the site is reduced. Instead of having thousands of queries from thousands of iPhones, the site is touched only by one database several times a day, and the iPhones get all the information they need from the database without burdening the ski resort’s site.
6. Computer Fraud and Abuse Act
The Computer Fraud and Abuse Act (“CFAA”) is a federal statute that imposes civil liability where someone or something accesses a computer without authorization, or accesses a computer in a manner that exceeds the authorization that it did have. For example, if you hack into a database on a server that you were never given access to, you can be liable. If you had access to the server, but not the database, you’ve exceeded your authorized access, and can still be liable. Of course, you’re only liable if there is resulting loss or damage, but this is generally easy to find. There must be $5,000 in damage, and it can come in the form of lost revenue, repair costs, damage assessments, impairment to data, or costs of responding to the unauthorized access. The breadth of the types of damages, and the relative ease with which they can be shown (hire an IT guy to mull over your system, hire an attorney to respond to the hacker, etc.), make this element an easy one to satisfy.
7. Digital Millennium Copyright Act
The Digital Millennium Copyright Act (“DMCA”) is a controversial law that many see as an unnecessary clamp-down on fair use rights. The DMCA is designed to give copyright owners greater protection of their digital content. The DMCA creates liability for working around technological measures that protect copyrighted works (or trafficking in products that do so). For example, if you crack an RSA key to access someone’s computer and copy documents on it, you’ve not only committed copyright infringement, but you’ve also violated the DMCA for circumventing the protection that was blocking your access to the copyrighted work. A recent case has changed the law slightly, noting that the DMCA only prevents you from circumventing technological measures protecting copyrighted work and copying that work; if you circumvent the technology but only access the work, there is no DMCA liability. The case is incredibly new and there will probably be some fallout from it across the country. After all, merely “accessing” work online still necessarily requires a RAM copy to be made, and other courts have found that a RAM copy is sufficient to find copyright infringement.
My iPhone app probably doesn’t run afoul of the DMCA, though. The app doesn’t work around any technological measures protecting the lift status or the webpage. The web page source code can be viewed and scraped without bypassing any security measures. Therefore, the DMCA is probably not a problem. If, however, the lift status were hidden behind a CAPTCHA code, this would bring the activity under the DMCA.
So, it looks like my app has a couple of problems. Of course, there are some factors that balance in my favor. Does the ski area want to sue me, a skier, a customer, and a developer of a helpful iPhone app? If they do, they’ll have to spend some pricey legal fees, and they also risk the possibility that the public gets upset about a ski area suing its customers. It generally doesn’t fit with a ski area’s image.
Great Post, it really shed some light on the legal issues regarding webscraping.
I still have some questions :
Are you aware of the situation in Europe regarding web scraping laws?
How is it decided if trespass to chattels is happening because even a single visit to the site will increase the load on the site , wouldn’t it?
Thanks for the information. This is useful.
I have a comment/question about this “After all, merely ‘accessing’ work online still necessarily requires a RAM copy to be made, and other courts have found that a RAM copy is sufficient to find copyright infringement.”
If accessing a site is a copyright violation, it would seem that anyone using a web browser is violating copyright law, regardless of whether they’re extracting data from the sites they visit. All browsers must necessarily accesses website data to display it. All browsers store the entire webpage in ram and most also cache its creative content to disk. A webscraping program is just a customized web browser that automates the process of extracting publicly available information displayed on that site, typically discarding the bulk of its creative content. How can it be a violation to access a public website with a browser that stores less data than a standard browser?
Thanks for your comments. You are correct, and this is something the courts are struggling with. Scraping programs are, of course, different in a number of ways from human browsing – they can access pages and store information much more quickly than a human can, and can have much greater of an aggregated effect than a human can. But you are are correct – for any site to be displayed, or program to be run, it passes through the RAM, and at least some brief copy is made.