NSA Data Mining: How It Works: PRISM, XKeyscore, and plenty more classified information about the National Security Agency's vast surveillance program has been dragged into the light since Edward Snowden began his leaks in May. How much data is there? How does the government sort through it? What are they learning about you? Here's our guide to the NSA's data-mining.

By Joe Pappalardo September 11, 2013 6:30 AM

Most people were introduced to the arcane world of data mining when National Security Agency contractor Edward Snowden allegedly leaked classified documents that detail how the U.S. government uses the technique to track terrorists. The security breach revealed that the government gathers billions of pieces of data—phone calls, emails, photos, and videos—from Google, Facebook, Microsoft, and other communications giants, then combs through the information for leads on national security threats. The disclosure caused a global uproar over the sanctity of privacy, the need for security, and the perils of government secrecy. People rightfully have been concerned about where the government gets the data—from all of us—but equal attention has not been paid to what it actually does with it. Here's a guide to big-data mining, NSA-style.
The Information Landscape Just how much data do we produce? A recent study by IBM estimates that humanity creates 2.5 quintillion bytes of data every day. (If these data bytes were pennies laid out flat, they would blanket the earth five times.) That total includes stored information—photos, videos, social-media posts, word-processing files, phone-call records, financial records, and results from science experiments—and data that normally exists for mere moments, such as phone-call content and Skype chats.
Veins of Useful Information The concept behind the NSA's data-mining operation is that this digital information can be analyzed to establish connections between people, and these links can generate investigative leads. But in order to examine data, it has to be collected—from everyone. As the data-mining saying goes: To find a needle in a haystack, you first need to build a haystack.
Data Has to Be Tagged Before It's Bagged Data mining relies on metadata tags that enable algorithms to identify connections. Metadata is data about data—for example, the names and sizes of files on your computer. In the digital world, the label placed on data is called a tag. Tagging data is a necessary first step to data mining because it enables analysts (or the software they use) to classify and organize the information so it can be searched and processed. Tagging also enables analysts to parse the information without examining the contents. This is an important legal point in NSA data mining because the communications of U.S. citizens and lawful permanent resident aliens cannot be examined without a warrant. Metadata on a tag has no such protection, so analysts can use it to identify suspicious behavior without fear of breaking the law.
Finding Patterns in the Noise The data-analysis firm IDC estimates that only 3 percent of the information in the digital universe is tagged when it's created, so the NSA has a sophisticated software program that puts billions of metadata markers on the info it collects. These tags are the backbone of any system that makes links among different kinds of data—such as video, documents, and phone records. For example, data mining could call attention to a suspect on a watch list who downloads terrorist propaganda, visits bomb-making websites, and buys a pressure cooker. (This pattern matches behavior of the Tsarnaev brothers, who are accused of planting bombs at the Boston Marathon.) This tactic assumes terrorists have well-defined data profiles—something many security experts doubt.
Open Source and Top Secret The NSA has been a big promoter of software that can manage vast databases. One of these programs is called Accumulo, and while there is no direct evidence that it is being used in the effort to monitor global communications, it was designed precisely for tagging billions of pieces of unorganized, disparate data. The secretive agency's custom tool, which is based on Google programming, is actually open-source. This year a company called Sqrrl commercialized it and hopes the healthcare and finance industries will use it to manage their own big-data sets.
The Miners: Who Does What NSA, home to the federal government's codemakers and code-breakers, is authorized to snoop on foreign communications and also collects a vast amount of data—trillions of pieces of communication generated by people across the globe. The NSA does not chase the crooks, terrorists, and spies it identifies; it sifts information on behalf of other government players such as the Pentagon, CIA, and FBI. Here are the basic steps: To start, one of 11 judges on a secret Foreign Intelligence Surveillance (FISA) Court accepts an application from a government agency to authorize a search of data collected by the NSA. Once authorized—and most applications are—data-mining requests first go to the FBI's Electronic Communications Surveillance Unit (ECSU), according to PowerPoint slides taken by Snowden. This is a legal safeguard—FBI agents review the request to ensure no U.S. citizens are targets. The ECSU passes appropriate requests to the FBI Data Intercept Technology Unit, which obtains the information from Internet company servers and then passes it to the NSA to be examined with data-mining programs. (Many communications companies have denied they open their servers to the NSA; federal officials claim they cooperate. As of press time, it's not clear who is correct.) The NSA then passes relevant information to the government agency that requested it.
What the NSA Is Up To: Phone-Metadata Mining Dragged Into the Light: The NSA controversy began when Snowden revealed that the U.S. government was collecting the phone-metadata records of every Verizon customer—including millions of Americans. At the request of the FBI, FISA Court judge Roger Vinson issued an order compelling the company to hand over its phone records. The content of the calls was not collected, but national security officials call it "an early warning system" for detecting terror plots (see "Connecting the Dots: Phone-Metadata Tracking").
PRISM Goes Public: On the heels of the metadata-mining leak, Snowden exposed another NSA surveillance effort, called US-984XN. Every collection platform or source of raw intelligence is given a name, called a Signals Intelligence Activity Designator (SIGAD), and a code name. SIGAD US-984XN is better known by its code name: PRISM. PRISM involves the collection of digital photos, stored data, file transfers, emails, chats, videos, and video conferencing from nine Internet companies. U.S. officials say this tactic helped snare Khalid Ouazzani, a naturalized U.S. citizen who the FBI claimed was plotting to blow up the New York Stock Exchange. Ouazzani was in contact with a known extremist in Yemen, which brought him to the attention of the NSA. It identified Ouazzani as a possible conspirator and gave the information to the FBI, which "went up on the electronic surveillance and identified his coconspirators," according to congressional testimony by FBI deputy director Sean Joyce. (Details of how the agency identified the others has not been disclosed.) The NYSE plot fizzled long before the FBI intervened, but Ouazzani and two others pleaded guilty of laundering money to support al-Qaida. They were never charged with anything related to the bomb plot.
Mining Data as It's Created: Slides disclosed by Snowden indicate NSA also operates real-time surveillance tools. NSA analysts can receive "real-time notification of an email event such as a login or sent message" and "real-time notification of a chat login," the slides say. That's pretty straightforward use, but whether real-time information can stop unprecedented attacks is subject to debate. Alerting a credit-card holder of sketchy purchases in real time is easy; building a reliable model of an impending attack in real time is infinitely harder.
What is XKeyscore? In late July Snowden released a 32-page, top-secret PowerPoint presentation that describes software that can search hundreds of databases for leads. Snowden claims this program enables low-level analysts to access communications without oversight, circumventing the checks and balances of the FISA court. The NSA and White House vehemently deny this, and the documents don't indicate any misuse. The slides do describe a powerful tool that NSA analysts can use to find hidden links inside troves of information. "My target speaks German but is in Pakistan—how can I find him?" one slide reads. Another asks: "My target uses Google Maps to scope target locations—can I use this information to determine his email address?" This program enables analysts to submit one query to search 700 servers around the world at once, combing disparate sources to find the answers to these questions.
How Far Can the Data Stretch?: Oops—False Positives: Bomb-sniffing dogs sometimes bark at explosives that are not there. This kind of mistake is called a false positive. In data mining, the equivalent is a computer program sniffing around a data set and coming up with the wrong conclusion. This is when having a massive data set may be a liability. When a program examines trillions of connections between potential targets, even a very small false-positive rate equals tens of thousands of dead-end leads that agents must chase down—not to mention the unneeded incursions into innocent people's lives.
Analytics to See the Future; Ever wonder where those Netflix recommendations in your email inbox or suggested reading lists on Amazon come from? Your previous interests directed an algorithm to pitch those products to you. Big companies believe more of this kind of targeted marketing will boost sales and reduce costs. For example, this year Walmart bought a predictive analytics startup called Inkiru. The company makes software that crunches data to help retailers develop marketing campaigns that target shoppers when they are most likely to buy certain products.
Pattern Recognition or Prophecy? In 2011 British researchers created a game that simulated a van-bomb plot, and 60 percent of the "terrorist" players were spotted by a program called DScent, based on their "purchases" and "visits" to the target site. The ability of a computer to automatically match security-camera footage with records of purchases may seem like a dream to law-enforcement agents trying to save lives, but it's the kind of ubiquitous tracking that alarms civil libertarians. Although neither the NSA nor any other agency has been accused of misusing the data it collects, the public's fear over its collection remains. The question becomes, how much do you trust the people sitting at the keyboards to use this information responsibly? Your answer largely determines how you feel about NSA data mining