Patrick Schwerdtfeger is a motivational speaker who can cover the topic of ‘Big Data’ and structured data versus unstructured data at your next business and/or technology event. Contact us to check availability. The full transcript of the above video is included below.
Full Video Transcript:
Hi and welcome to another edition of Strategic Business Insights. Today we’re going to talk about big data and in particular the opportunity with unstructured data within the big data trend. So there’s basically the structured data and there’s unstructured data. So let’s just start with that distinction.
Most people, when they think about data, they think about like an Excel spreadsheet where you have rows and columns and very neat, tidy information that all kind of lines up evenly and all it is is a question of crunching that data. Okay, that’s structured data. What’s unstructured data? Well, unstructured data is stuff that’s not in that clean format. So, for example, when you do anything on a computer, let’s say you go to a particular website, you’ve just done one thing – you’ve typed the URL into the browser and gone to that website. But there are logs, machine logs that probably have 10 or even 15 little data points of what actually took place. So it’s your computer reaching out to your ISP and your ISP figuring out what the IP address of the website is and what browser are you using and where is that website located, and then how long did it take for the information to go back to your browser and how long did you spend there. And there are all these machine logs, and if you look at them it looks like garble, it looks like tons of stuff, like computer code almost.
And the interesting thing is that depending on which website you go to or where you’re located—so the routing of the information over the Internet between the website servers and your computer—depending on what kind of computer you’re using, depending on what kind of browser you’re using, that machine log of what took place is always different. It’s different in almost every case. So every time you look at it, so you really need to do a lot of programming to help identify where are the important pieces of information – the IP address of the website, the IP address where you’re located, perhaps what browser or machine you’re using or the speed at which… Whatever the useful information is, it’s going to be in a different place every time. So it’s very tricky to get the value out of the unstructured data.
So most people, when it comes to doing data analytics—that’s the big phrase these days, is analytics, predictive analytics and things like that, that business intelligence—they immediately go to the structured data because it’s the easiest to mine, to go through. But the reality is that the opportunity, the biggest opportunity, is in the unstructured data. It’s trickier to get at, it’s messy, it’s hard to get at, but the insights are extraordinary. And I’ll give you a couple of examples in just a second, but first it’s important to know that just in general in the world 95% of data is unstructured; only 5% is structured. And in the business world, in a given business, there’s slightly more structured data because businesses are in the business of trying to keep all their activity organized, but it’s still roughly 80% unstructured data and 20% structured. So the vast majority of data that’s out there is unstructured data. So there’s a ton more of this than there is of this structured stuff.
Now, let me make another quick distinction, because there’s the distinction of pre-big data and what is now taking place in big data. In the old world, it was always about sampling because we didn’t have the capacity to get all of the information in one place. Number one, it wasn’t economical, and number two, there was no way to even store it or process it at that time, so it was always a question of sampling.
So think election polling: When an election is coming up, they call like a thousand people and then they ask questions of these thousand people, and then they extrapolate the answers from here to draw inferences of the entire population. So they didn’t ask the entire population, they just asked a thousand people, or maybe even a hundred people or 2000 or whatever that number is. So they always used the notation “N equals whatever.” So N is the number of samples that you took. So if the sample size is a thousand, then N = 1000.
But today, in the big data world, “N equals all.” N equals everything. We have the capacity to harvest the data, to store the data, and to process the data. So N equals all.
Now, it’s really important to know that there’s a very important distinction here. It’s a fundamental shift because when you’re sampling, inevitably what you’re doing is trying to identify the primary trend, and by virtue of doing that you basically weed out outliers. So you’re kind of getting rid of the outliers and identifying the primary trend. When N equals all, when you’re taking in all of the data, you are including all of the outliers and in fact embracing the outliers, because the outliers is what give true depth and meaning to the overall quantity of data, and those are what the examples are that I’m going to get to. But that distinction between sampling which limits outliers and big data which embraces outliers is a fundamental shift in how people are approaching data analytics and business intelligence.
So, a couple of examples. Google Translate. Google Translate is by no means the first attempt at making an automated translation algorithm. This has been done a number of times in the past but it was always based on basically like a thesaurus of sorts, a translation dictionary where every word had an equivalent word in the other language, and so the program just went through individually in taking this word and translating, this word and translating, this word and translating. And so what you ended up with is quite often the context was lost and the translation was really poor because different languages operate differently: The phrases are different, the contextual environment of every single sentence is always unique, and even within a given language you might be able to say the same thing in different ways.
Well, what Google did—and it was the first company that took translation to this level—is the algorithm feeds off of the entire Internet. And there are all sorts of different places where companies have translated their websites and the companies have hired people to do that translation. Some are good, some are not so good. There are countries that have multiple languages like, for example, in Switzerland they speak Italian and French and German; in Canada, they speak English and French. There are all sorts of countries that have multiple languages, and so their parliaments and there are government organizations by mandate have to translate everything into the two languages. So there’s all this content out there and Google’s algorithms actually harvest all of it, and what it allows them to do is to draw probabilities of what’s the most likely contextual equivalent to what’s being said here.
And so the quality of translation went way, way up because they included the messy data. They included a vast amount of messy data including the outliers. They included the outliers. Some of the translations were good, some of the translations were not good. They included it all because it allowed the algorithm to calculate probabilities based on the entire volume of data that’s available, and immediately the quality went way up.
The second example is chess algorithms. Chess algorithms – for a long time they put every possible move on a chessboard and the computer would calculate like 15 or 20 moves into the future and calculate the best move that they could do given all those possible options. But when they had done that, the players, the chess masters around the world, were still able to beat the chess algorithms on a fairly regular basis. That changed when they did one thing: They included tens of thousands of past chess games by chess masters, some of which the chess master lost and others in which the chess master won. Again, it was imperfect data. It was the big messy data, and it was an amount of data that was orders of magnitude larger than just the simple possible moves of each individual pawn or whatever in a chessboard and what they can do and how they can do it and who trumps who. It was an enormous amount of data that was added on top of all this past experience, and so all of a sudden the computer could contextualize its knowledge of the moves and make plays. It could learn from all those past experiences, both the winners and the losers.
So again, by adding all the messy data—N equals all—by including the whole thing including the outliers, that’s where the quality went up. So this is really at the core of what’s happening in big data today: People are looking to accumulate the maximum amount of data possible.
So, recently Facebook purchased WhatsApp for 19 billion dollars—that’s a huge amount of money—and all of the talk was about why did they pay that much for WhatsApp. Now, there are a number of reasons. One is that WhatsApp was catering to an audience that Facebook was actually fairly weak in, and so they got an extra audience that way, and just the user base in general. But what they wanted more than anything is data. It’s a data play. They’re trying to accumulate the maximum amount of data. And do they know the value of that data today? No, we’re still at the very beginning of big data. People don’t necessarily know what you’re going to be able to do with all of that data. But Facebook is harvesting and storing colossal amounts of data because they know that eventually people are going to go through all that unstructured crazy logs and data and find insights that they can monetize.
Everyone that’s in the big data space today, what they’re after is data. They’re looking to accumulate data that has value. That’s why the mobile apps, many of the mobile apps, they don’t need to know your location. They don’t need to know your location for what it is they’re doing. But if you look into what you have to agree to, you almost always have to agree to location data. Because what are they doing? They’re accumulating the locations so they know not only how you’re using it but where you are when you’re using their app. That makes their data way more valuable. And they may not know exactly where they’re going to sell it today, but they know that eventually people are going to say, “You know what? We want to know this information,” and they’re going to be able to sell that data.
So that’s what’s going on today, is people are accumulating as much data as they can because it’s cheap and we can store it, and there’s going to be value down the road. We don’t necessarily know what that value is going to be yet, but that value is going to materialize. So in your own company, look at your own structured data, look at the data that you have, look for insights in the unstructured data, and perhaps more importantly, look for opportunities to accumulate more data that provides some value – any data you can get your hands on and start accumulating it, because that’s a growing asset.
And the beauty of big data is you can sell it multiple times. You can keep selling the same asset over and over again for different applications, for different uses, for different opportunities. So look at the unstructured data and get as much of it as you can.
Thanks so much for watching this video. My name is Patrick, reminding you as always to think bigger about your business, think bigger about your life.
Patrick Schwerdtfeger is a keynote speaker who has spoken at business conferences in North America, South America, Europe, Africa, the Middle East and Asia.