What is left to say about Big Data? Even if you only believe half of the hype surrounding it, it remains an exciting new area of Information Technology for business, social sciences, government, science and many other fields. This article will mention some successes that Bank of America had with Big Data, a company I used to work for, and will also discuss some issues of the “next big thing”, and some problems as well as less than successful uses of it, and provide some guidelines for enjoying success and avoiding problems in the world of Big Data.
What is it?
Big Data is a catchy but deceptive name. There are really two sides to it. The first refers to huge quantities of varied data, defined below and the technological advances that enable ingesting data at a high speed, and the storage, management, and processing of it within a reasonable cost and time.
Various groups have contributed to this definition. Big Data is high volume, velocity and variety (Gartner 2012) and high variability and complexity (SAS) and varying veracity (i.e. quality) (Wikipedia) data. The technologies are now well known and utilize methods of spreading the processing over many processors as well as methods for breaking the data into blocks for secure and efficient storage and access.
The second side of Big Data is that these information assets demand cost-effective, innovative forms of information processing for enhanced insight and decision making. This is accomplished through Big Data analytical tools that enable us to derive information from often heterogenous data sources, information which enables business, sociological, and scientific discovery. Many of these methods and tools were already around prior to the technologies that enable processing big data and were used in Business Intelligence.
The Gartner Hype Cycle
Gartner, one of the leading consultant groups in the area of Information Technology, long ago defined the Hype Cycle, shown here, to describe the life cycle of new technologies and the products that accompany them. Only hindsight can provide a truly accurate analysis of where we are now but it may be safe to assume, judging from the publicity surrounding big data, that we are still close to the point called “Peak of Inflated Expectations”. A point at which it is difficult to discern the value of the product and answer the question like whether it is optional or absolutely mandatory for success in your business or does it even fit your business? It is important to understand the cycle in order to make decisions which fit your company profile to the product life cycle. That cycle begins with many vendors of products and services entering the market, followed by a shakeout phase in which the number of products and vendors will diminish as some leave the market and others will be consolidated or be subject to purchases by bigger fish. In the last part of the cycle, products may become commodities. Where and if your company acquire Big Data capacity depends on the nature and character of your company and the business sector it is in.
Some Examples: Big Data
I was asked to provide examples from my years at Bank of America but it was too long ago to provide relevant current examples. It was easy, however, to get some from the Internet. These are drawn from The Evry Company of Norway in their White Paper “Big Data in Banking for Marketers”. Bank of America has had Big Data for a long time as it was the originator of banking for the masses and many years ago when it was still only a California bank it was the largest retail bank in the United States. In the 1980’s with over 12 million customers, it took three days to process a “Household Matching” algorithm which showed the use of its various products by household. Today the bank operates in all 50 states as well as 35 countries., covering approximately 80% of the U.S. population including 46 million retail customers and 3 million small business customers, and 21.6 million active users of its mobile banking application.
In a Bank of America Customer Segmentation case data was taken from customers in the target demographic who had home equity loans. From this data, researchers found a large cluster of customers that have both home equity loans and their own business. It took further research to identify that this pattern was caused by customers using home equity to build their business. In this example, the K-means clustering algorithm was used, which is likely the most popular algorithm for this type of problem.
Targeted Marketing Programs
By segmenting Bank of America was also able to remove its assumption about its customers. This led to a change in its marketing message from “use the value of your home to send your kids to school” to “use the value of your home to do what you always wanted to do” and increased conversion rates tenfold. Bank of America is employing targeted marketing programs to increase the card usage of its customers through its Bankamerideal loyalty. The Bankamerideal’s loyalty program includes tailored customer-centric rewards and charity-or-choice donation. Gordon S. Linoff and Michael J.S. Berry wrote a book on Data Mining Techniques based on their experience with Bank of America (then Nations Bank) and others and Tom Groenfeldt recounted for Forbes magazine his experience with banks, including Bank of America, using big data to understand customers across channels.
There are many analysis methods that can be used for big data. Below are some of the more popular ones. Many of them predate the advent of Big data and many depend on sampling, i.e. the assumption that analysis based on subset from a large body of data can accurately provide meaningful information about the whole. The above mentioned White Paper also discusses them as they were used with Banking Customers including Bank of America.
Decision Trees & Random Forests
Decision trees are one of the most powerful data mining techniques as they can handle a diverse array of problems and almost any data type. They are relatively easy to understand and interpret. They work by splitting data up along its dimensions into smaller data cells with the aim of exploring which are the most important fields for a particular data set. They help determine worst, best and expected values for different scenarios. Random forests is a technique used to boost the efficiency of decision tree models by creating an ensemble of slightly varying decision trees that model the same target thus acting as a safety net between possible errors and the noise of an individual decision tree. Although they are a “white” vs a “black” box model they can be biased in favour of attributes with more levels, and the calculations can get very complex.
Clustering detection is the automation of finding meaningful patterns within a data set. Often the problem in data mining is not a lack of patterns but identifying which patterns are useful among the noise of possibly hundreds of competing patterns within the data. Cluster detection helps penetrate this noise by finding clusters of data that form natural groupings within the data assets.
It is not possible for humans or even machines to read all the text being generated internally and externally for the organization. Text analytic algorithms assist by automating the reading process, deriving patterns within the data, and providing a brief summary compiled from many documents. The algorithms rely heavily on probability theory and the rarity and occurrence of certain words which can be used to predict the meanings and themes of the text.
Text analytic technique is a form of classification algorithm so a target field for the algorithm is clearly defined. Naïve Bayes formulas can be used to determine the probability that certain words belong to documents of particular classes. For example, words associated with spam emails as identified by a Naïve Bayes formula are widely used to identify spam emails.
These are used heavily in Data Mining. They are classification algorithms that can be used for a wide range of analyses. Modelled after nodes in the brain which are activated by a signal that in turn transmits a response signal to activate other nodes. In the model, each node is comprised of a combination function that receives various incoming signals and calculates the total received signals based on a set of weights. A transfer function then sends a signal based on the total received signals from the combination function. The target field would typically be a scoring function, such as a customer’s propensity for a particular product. The algorithms are black box in that they can answer what is the most likely action B given person/event A but not why.
Link analysis, part of a subset of mathematics called graph theory, is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Link analysis has been used for investigation of criminal activity (fraud detection, counterterrorism, and intelligence), computer security analysis, search engine optimization, market research, medical research, and art.
Is a time-to-event analysis to explain when you should start worrying about an event. It can be used to answer questions like:
- When is a customer likely to leave,
- When will the customer move to a new customer segment,
- The next time a customer will narrow or widen their interaction with the organization.
The Big Picture and a Big Data Failure
“All models are wrong, but some are useful” George Box, Professor of Statistics (Wisconsin), 1987
“All model are wrong, and increasingly you can succeed without them” Peter Norvig, Google, 2011
Any discussion of Big Data needs to also point out the problems and provide some guidance to obtaining successful outcomes. There has been considerable discussion of the risks in the use of Big Data Analysis. It is not possible to have analysis of how often it succeeds or fails. For obvious reasons companies rarely supply data on failed projects. But there are many examples of mistaken interpretations and there is one case of an enormous failure which is akin to the proverbial “Elephant in the Living Room” which we all try with great difficulty to ignore.
Failure of Polls
The enormous failures were the polls of the 2016 U.S. elections and those of the British general election in 2015. In previous years polls have often been off the mark but in the U.S. in 2016 they were a complete disaster. Yet that was the year in which Big Data techniques were extensively used. According to the New York Times “Virtually all the major vote forecasters, including Nate Silver’s FiveThirtyEight site, The New York Times Upshot and the Princeton Election Consortium, put Mrs. Clinton’s chances of winning in the 70 to 99 percent range.”
Aaron Timms, the Director of Content at Predata, a New York-based predictive analytics firm offers the following explanation to the disastrous polls in an article from Fortune, 14 Nov 2006
All data sets and data-driven forecasting models — even those that claim to run off artificial intelligence — are, to some extent, a reflection of their creator’s own biases. There is a subjectivity embedded in every curatorial choice that goes into the creation of a poll, or a set of signals to monitor debate online, or a prediction model. The interpretation of data, too, is necessarily subjective. But one mistake does not mean we should forfeit the game. Gather data, crunch data, interpret data: there is nothing fundamentally unsound or stupid about this basic exercise. It’s still worth doing. But we need to get better at understanding what the data can tell us — its potential and limitations — and how it fits into a broader analytical picture.
This may not be a sufficient explanation of what happened but the outcome should alert everyone to the risks involved in Big Data Analytics.
Failure of Economic Models
The failure of models is not new to Big Data Analytics. For years economics has been based on models which sadly failed to foresee or to understand the major crises of recent times. The U.S. Government, and in particular the Federal Reserve banks, the Universities and Corporations of the country have cadres of economists, data “scientists”, and financial analysts with truly huge amounts of data both structured and unstructured, and including frequent direct reporting from the U.S. banks.
James K. Galbraith, one of the most respected economic thinkers and writers as well as the son of John Kenneth Galbraith, discusses this in The End of Normal, Simon & Schuster, 2014. This is not the appropriate place to present all the arguments of this complex discussion. Essentially the economic models, the same ones that anyone who took an economics course had to memorize, simply did not reflect reality. Economists and financial analysts had all the Big Data tools but still lacked the appropriate analytical models to come to the correct conclusions. Even after it was quite obvious that the models were inappropriate many economists continued to defend them.
Common Business Strategy failures
On a much smaller, scale there are many other risk factors in Big Data analytics. From the development of a business oriented strategy to the selection of the appropriate analysis methods both for the data available and fulfilling the business strategy. These problems are on a much smaller scale that those discussed above. For example, one discusses an analysis that indicated which households were about to churn their banking products, but in the end, a ‘hand on’ analysis revealed that these were cases of a pending divorce where assets were being moved out of joint custody.
A recent Gartner symposium discussed some strategic failure risks for Big Data projects.
- Organizational inertia. This is the failure to align corporate goals with those of the Big Data project or vice versa. The ideal situation is to get “buy in” on the project from all the critical players, which is not an easy proposition. You may get the business managers to buy in but not the IT staff. That is not an unusual situation as IT is primarily concerned with the smooth running of existing systems. Any new technology threatens to be disruptive. Sometimes you can make this work by stealth. For example, by sneaking in a prototype which can become something management cannot live without. Sometimes the project may have initiated in the IT department without a clear business directive. One of the proposals for Big Data Projects is a new type of project life cycle in which the business goals are only loosely defined. The idea is that we won’t know what can be derived from the data until we start to analyze it. Theoretically, this avoids the old project methods of having a clear goal that only remains valid for a short time which then requires a redevelopment of the project. This was also true of Business Intelligence projects. But from personal experience, it is preferable to have some specific realizable goals for the projects. If it is successful, because of the nature of the project other goals can easily be pursued without redeveloping it. These choices will depend on the profile of the company and if Big Data Analytics are managed by a “C” level executive who may be the CEO or CTO. Some believe that there should be a Chief Analytics Officer (CAO) to ensure the correct organization support.
- Selecting the wrong use cases. In an entry level physics class in a “one million dollar” classroom the professor had an elaborate experiment to demonstrate waves. At the end of it, he said, “of course it is not possible to show an example from real life as real life is too complicated.” Defining a model for resolving problems is extremely difficult and there is some amount of art in determining where to place parameters which enable the analysis but do not overcomplicate it to the point of inexplicability. A Gartner example case was the relationship between good and bad habits and the propensity for buying life insurance. At first good vs bad was too general and thus it focused on smokers versus non-smokers which was turned out to be too limited. The recommendation was to prioritize the elements of the use case and gradually increase the complexity of the problems to be solved.
- The feasibility of the project at a production level. The need to consider big data and big data analytics requires a multitude of skills and cross-functional IT support to get off the ground. Issues like networks, security and simply the facility can prevent your project from taking off. I once ordered a half million dollar server with all the necessary approvals. It arrived at the remote site but no one would install it simply because I hadn’t gone there personally to meet with the responsible staff. Literally, a two-hour flight and an informal meeting were all it took.
- Lack of the correct Big Data analytics skills. These are really the heart of the project. But with the hype surrounding Big Data, there is an acute shortage of skilled and experienced staff. A lack of the correct skills may lead the team to fake it.
- Not questioning the data or the results. Failure to ask the right questions. These are two sides of the same coin. The first is to get a data review by people who are close enough to it to have some feeling for what it should look like. The second is understanding from those same users what questions they needed to answer.
- Assure that the modelling is pertinent. The example given was one of churning, in which the customers were about to leave a company not because of dissatisfaction with service, but because they were divorcing.
The symposium presentation stated that understanding the right models to use, the right level of data abstraction and the model’s nuances “is very challenging. This is one of the keys of big data analytics.”
The Communication Gaps
Ideally, the projects need strong management support., The data management side is relatively easy to explain. The dilemma is in trying to explain to business types the concepts entailed in Big Data Analysis. This sales pitch is helped at present by the very strong hype found in just about every discussion of Big Data. But it remains difficult to tie the strong hype to concrete and realizable business propositions. It is also the latest in a succession of “the next big thing” starting with Data Warehouses, to Business Intelligence, and to Business Analytics. Many of these types of projects either did not deliver or did not meet expectations. In the case of Big Data lack of a specific business goal is viewed as an advantage, giving the project team more flexibility in creating and modifying the system. And in many cases companies are advised to make a go at it and worry later about what they will gain from it. These choices will depend on the profile of the company.
“We (business in general) need to create a new department with a new Analytics head . . . to serve as an internal consulting groups to the rest of the business, assigned to clients, to sit with them and ensure analytical integration” Mikael Weigelt, PhD., Director – CRM Customer Intelligence T-Mobile
The “Geek Divide”
The essential problem has always been communication. And the problem is shared by both the business and technical sides. Since the beginnings of technology, there has always been a “geek divide”. Years ago IT staff were weirdos hidden away from other staff. There’s still a cultural divide that separates the geeks from everyone else which is only expanded by now referring to the geeks as “data scientists”. And since the beginnings of business there has been a double communication difficulty for the business side. On one hand, there is a need to express business needs in a way that can be met by a technological solution. And on the other hand, there is a need for those who understand the business to know the potential of technology to meet the needs.
When these happen the result can be easily greater than the sum of the parts because it enables the visionary to understand what could be done for the business that can be enabled by technology. This is where the best of competitive advantage occurs. This is how we can move from an attitude which implied that there was no competitive advantage in technology (Nicolas G. Carr, IT doesn’t really matter, Harvard Business Review, May 2003) to understanding that competitive advantage is enabling new functionality by exploiting the potential of technology for new and innovative business strategies.
All too often we have meetings where someone who may be a “data scientist” speaks using terminology that many don’t understand. He or she soon loses the listeners whose eyes glaze over suffering from a barrage of the latest obscure buzz words. At the end none of the listeners, especially those who didn’t follow the thread, ask further explanation, perhaps assuming that someone in the group understood everything. Instead, we should have meetings where Big Data Analysts understand the processes and needs of the business and where business types are IT literate enough to explain the business and understand how technology can help it. In these meeting the business staff insists on explanations which truly explain the technology proposed, and technology types reach across the divide and explain technology on a business level. Unfortunately, the fault is on both sides. Academic institutions are not producing technical types who also have the real world skills to understand business and business communications. They are also not producing enough MBA types who understand technology, which today is the literally the lifeblood of business. But in the cases where these two types were present meetings were held in which each side is stimulated by the other to imagine new innovative business analysis and enhanced services/products to customers.
By J.M., retired.