It also hopes to develop the technique to uncover trends in customer geolocation.
The independent network provider, which licenses the O2 brand, is developing Word2vec to overcome the problem of messy, unreliable data resulting from SIM cards connecting to network base transceiver stations, says Jan Romportl, O2 Czech Republic chief data scientist.
"Anybody who talks to me from outside the industry thinks we've got great geolocation data about all our customers. When people learn the truth, they get very disappointed," he tells ZDNet.
The problem is that network base stations were never designed to provide meaningful location data. Their connections to individual devices can appear quite random, and many handovers between cells are not recorded.
A known route, such as a journey by train, appears to jump unpredictably between base stations, according to the recorded data, making it very difficult to pinpoint the location from this source alone. GPS data, meanwhile, is only available to phone operating-system providers and apps with which customers have agreed to share the data.
The O2 Czech Republic data-science team wanted to use records of contact between SIM cards and base stations to segment its customers based on their patterns of movement, but it also wanted to use the data to improve network performance.
Having grappled unsuccessfully with these problems, the team turned to Word2vec, developed by researchers led by Tomáš Mikolov at Google, to find out if it could reveal the locations of those base stations from raw network data without additional tagging or interpretation.
Word2vec is a group of machine-learning models that express words as vectors, typically in 100 or more dimensions, based on analysis of a corpus of data, such as the text from Wikipedia.
The process produces word embeddings, which data scientists can manipulate to create linguistically meaningful abstractions. For example, the vector of 'Queen' is almost equal to 'King + Woman – Man'.
The technique is not normally used outside natural-language processing. But O2 Czech Republic's data-science team thought it might help interpret the corpus of data it collects from SIM cards connecting to base stations.
"We used absolutely no other information; just plain text of the cell ID tokens," Romportl says.
The team used Word2vec for each cell, creating a 100-dimensional vector for each of the 50,000 cell IDs. The problem was then to reduce the number of dimensions to produce a meaningful interpretation of the data.
Having read research published in 2018, one data scientist on the team suggested a new algorithm called Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP).
"We had no idea how it worked. We just took the default parameters we needed to reduce 100-dimensional space to a 2D space and just did the scatter plot," Romportl says.
"It was the best things I've seen in my data-science career. If you flip from the scatter plot to look at the map of the Czech Republic, you can see the reduction was able to create the longitude and latitude coordinates of each tower," he says.
"That data was not in the original state. It was just a stream of tokens. The neural network is a universal algorithm for dimensionality reduction. It compressed all invisible patterns into 100D space, all the patterns that relate to the location of the base stations. It was a eureka moment for us."
O2 Czech Republic already knew the location of its base stations, but the findings presented at Teradata Universe EMEA Conference 2019 Madrid demonstrate that Word2vec can be developed to reveal other hidden characteristics of the network, to help improve its performance and customer experience, he says.
The team is also planning to use a related technique, Doc2Vec, to group customers into segments based on their journey patterns, helping outside partners in marketing and public-sector planning, for example.
Although Word2vec has been used outside language processing, O2 Czech Republic's approach to geospatial data is probably a first, says James Kobielus, lead analyst for data science at research company Wikibon.
"These methods have been kicking around for a while, but what the O2 people are doing sounds very interesting. It's not anything I've seen done elsewhere and as far as I can tell it is an innovation in the application of Word2vec," he says.
O2 Czech Republic's work with Word2vec shows why data scientists should be allowed to experiment, says Torsten Volk, industry analyst at Enterprise Management Associates.
"Data scientists are rare and cost a lot of money to hire. Businesses think they had better produce something that works, so they tend to use established techniques that produce results. But they are generally not exploring and finding new things."
Organizations hoping to find value in the increasing volumes of data they collect could benefit from a more opened-ended approach to data sciences, exploring new applications of machine-learning techniques, as O2 Czech Republic has done, he says.
Or they could wait for the competition to do it first.