Video: Scraped and leaked: 48 million users' social profiles
A little-known data firm was able to build 48 million personal profiles, combining data from sites and social networks like Facebook, LinkedIn, Twitter, and Zillow, among others -- without the users' knowledge or consent.
Localblox, a Bellevue, Wash.-based firm, says it "automatically crawls, discovers, extracts, indexes, maps and augments data in a variety of formats from the web and from exchange networks." Since its founding in 2010, the company has focused its collection on publicly accessible data sources, like social networks Facebook, Twitter, and LinkedIn, and real estate site Zillow to name a few, to produce profiles.
But earlier this year, the company left a massive store of profile data on a public but unlisted Amazon S3 storage bucket without a password, allowing anyone to download its contents.
The bucket, labeled "lbdumps," contained a file that unpacked to a single file over 1.2 terabytes in size. The file listed 48 million individual records, scraped from public profiles, consolidated, then stitched together.
The data was subsequently found by Chris Vickery, director of cyber risk research at security firm UpGuard. Vickery, a well-known ethical data breach hunter, disclosed the leak to Localblox's chief technology officer Ashfaq Rahman in late February. The bucket was secured hours later.
The discovery is the latest twist among recent scandals involving tech companies and their data collection practices.
Just last month, Facebook was embroiled in a privacy row after London-based data firm Cambridge Analytica obtained data on as many as 87 million users, according to a "conservative estimate" by the social networking giant, from an academic app that collected data on its users and their friends. The data was used to build profiles on millions of Americans to predict how people will vote at the ballot box, including the 2016 presidential election.
The controversy sparked uproar, triggered congressional and parliamentary inquiries and investigations across the world, and forced Facebook to introduce stronger privacy practices.
But the data collection by Localblox can be just as invasive, and can include highly sensitive and personally identifiable information on a person -- without a person's consent.
Read more: Trump-linked data firm Cambridge Analytica harvested data on 50 million Facebook profiles to help target voters | Data breach exposes Cambridge Analytica's data mining tools | How Cambridge Analytica used your Facebook data to help elect Trump | Zuckerberg rejects law to protect privacy of children | Senate: Don't let Facebook become a "privacy nightmare" | Analysis: On Facebook, Zuckerberg gets privacy and you get nothing
Vickery showed ZDNet the data first-hand in New York last week.
The data was found in a human-readable, newline-delimited JSON file. The data collected includes names and physical addresses, and employment information and job histories data, and more, scraped from Facebook, LinkedIn, and Twitter profiles.
UpGuard's own report, published Wednesday, contained search queries that Localblox would use to cycle through email addresses that it had collected through Facebook's search engine to retrieve users' photos, current job title and employer information, and additional family information.
It's also believed that the company supplements its collected data from non-public sources, like purchased marketing data. The data is then compiled, organized and blended into existing individual profiles.
The report described the collection operation as an effort to "build a three-dimensional picture on every individual affected" to use for advertising or political campaigning.
Vickery said that some records are more complete than others.
Localblox has long boasted about the amount of data it can collect.
A sample consumer profile (since removed) on the company's website purports to additionally include a person's location, email addresses, IP addresses (which can in some cases identify a person's location), phone numbers, postal addresses, salary, employer and job title, and other precise markers. (Editor's note: we have also removed a link to the sample profile once we learned that the profile contained real information on an individual.)
The data can include, but not always, information such as if a person is a credit card user, their "Do Not Call" preferences, marital status, and net worth.
Localblox claims it has more than 650 million records in its device ID database, and 180 million records in its mobile phone database, which includes mobile phone numbers and carriers.
The company also says it has a US voter database with 180 million citizens. It's not known how old that database is, but a voter records leak (coincidentally also found by Vickery) suggests Localblox's database isn't far behind an exposed mid-2017 database that contains 197 million voter records.
"Concentrating millions of people's details can become by its very nature a weaponized thing, and something that can lead to a lot of harm," said Vickery.
ZDNet contacted Localblox before publication with several questions.
In a phone call, Ashfaq Rahman claimed Vickery "hacked in" to the publicly accessible S3 bucket. (Vickery has long said he works strictly ethically and within the law to responsibly disclose exposed data.) Rahman would not say why he restricted the bucket's permissions hours later.
Rahman also disputed the 48 million figure saying that "most" of the data was fabricated and for internal tests, but would not give a percentage. When asked about more personal data, such as geolocation and IP address data, he said they "do not link to the actual owners."
In a later email exchange, Rahman said "no other individual is believed to have accessed this file from the S3 bucket."
He reiterated that the company "joins bits and pieces to generate transformative intelligence."
According to a 2013 article, Localblox's president Sabira Arefin said it's "up to the individual sites and system to determine the terms and conditions and then enforce any security mechanism in place if they want to prevent scraping."
Arefin did not respond to our emailed questions.
ZDNet also reached out to the companies whose data Localblox scraped.
Facebook said that scraping data from its service is prohibited. In a statement, the spokesperson said: "We are currently investigating all apps that had access to large amounts of information before we changed our platform to dramatically reduce data access in 2014. We will conduct a full audit of any app with suspicious activity. And if we find developers that misused personally identifiable information, we will ban them and inform everyone affected."
LinkedIn has been battling website scraping in the courts. A spokesperson said: "Any scraping of data from our platform is a clear violation of LinkedIn's Terms of Service. Our members control the information that they make publicly available on LinkedIn and we protect that control by taking aggressive action to stop any illicit scraping when it is discovered."
Twitter, which has user profiles and tweets open and public by default, said that automated scraping data from the site "without our prior consent is expressly prohibited."
Data scraping companies are not new, but they are becoming more powerful -- and controversial in the wake of the Cambridge Analytica scandal.
But supporters of the industry say the data is fair game -- if it's already publicly available.
Nielsen, a media research firm, used to scrape data from the web but stopped unless it obtained permission. But a company spokesperson once said that, "if someone decides to share personally identifiable information, it could be included," according to a 2010 report in The Wall Street Journal.
But internet users have little to no recourse if their already-public data is scraped. No laws exist to require data companies to let people change or remove their data, unlike in Europe where data protection and privacy rules are stricter.
Though these data scraping companies are hoarding massive amounts of organized data, Vickery said it's worth remembering where they got the data from in the first place.
"I think these companies need to take a little more responsibility over what's being done with this data, and reflect on the role they're playing in this day and age," he said.