Whenever presented with a new data set, there are a couple of activities that I go through first to “get the lay of the [data] land”:
- What type of data set is this? Point-in-time? Sequential?
- Where is my unique key? Is there a unique key?
- Can I generate a unique key?
- What are my counts (rows, # of unique keys, etc.)
- Anything that is an obvious sub-context? (e.g. regions, countries, etc.)
- Is there geo-data lurking around? Does this look better in a map?
- Anything that represents a ranking or an ability to generate a ranking?
More often than not, I get data that is output from a SQL database. If I’m lucky, it has already been deconstructed for me. What do I mean by that?
Deconstruction 101
When you ask for a “logical” view of information from a SQL database, there is at least a FROM clause in that data which includes multiple tables or a JOIN. In non-technical language, that means you are looking at two different things to come up with an understanding. For example:
- Table 1: Book Titles
BOOK ID, BOOK TITLE, AUTHOR ID, PUBLICATION DATE - Table 2: Author
AUTHOR ID, AUTHOR FIRST NAME, AUTHOR LAST NAME - Result View: Book Titles + Authors
BOOK TITLE, AUTHOR NAME, PUBLICATION DATE
Therefore, when I say deconstructed, it means that instead of just sending me the Result View, I got the Result View, Table 1, and Table 2.
More often than not, I’m unlucky, and I get just the Result View, which is fine, but it means that I need to either back into Table 1 and/or Table 2, if I want to generate things like:
- How many books did one author publish?
- How many book titles are non-unique?
- How many authors published more than one book in a single month, quarter, or year?
Now, that is a super simple dataset, but imagine how much more difficult it gets when you are dealing with things like:
- User clicks
- Trade and transaction data
- Restaurant ratings
- Economic statistics
Deconstruction is your friend.
Geo-data is also your friend
I like geospatial data because it is an immediate sub-context to play around with. Why do I call it a sub-context? Because the overall data set you are playing with is the context: I consider “context” as a domain or a bound. Therefore, within broader datasets, I am often looking for the ways to break them down into meaningful, smaller chunks, and geo data is one of the easiest to work with.
Why?
Because geo data can provide a ‘soft’ representation of either cultural or social norms. Now, the emphasis here is on soft: humans of different gender, status, race, age, et al. do have different behaviors, and all humans, despite certain triggers and norms, do have unique behaviors, but the proven idea that ‘birds of a feather do flock together’ and increased location mobility within the United States and around the world does speak to the power of geospatial data to support some group insight.
Alternatively, if you want to look at that in the context of the US Presidential Election 2016:
- Red states are Red (red = Republican)
- Blue states are Blue (blue = Democratic)
- Within Red states you will find some Blue domains (see = Austin, TX)
- Within Blue states you will find some Red domains (see = Staten Island, NY)
- Swing states are where the sentiment balance of Red versus Blue tends to hover around that 50% mark (see = Florida)
- Swing sub-contexts (towns, villages, etc.) exist in every single state (see = everywhere)
Geospatial data helps you sort that out. One day I will post some of the work I did on restaurant ratings in NYC based on information from NYC Open Data, but I can tell you that the zip code data made working with that data set super-easy and then visualizing it in Tableau even easier.
[Note: Also, the overwhelming amount of “A” ratings (definitely not a standardized distribution), made me wonder just how poorly you had to do to get anything less than A. Really!]
Just the basics, please
Once you have identified the unique identifiers (user IDs, process IDs, other property/entity IDs), and perhaps some of the sub-contexts (region, zip codes, ranks, other groups), it’s time to throw a lot of brute force and basic checks.
- What are my counts for the overall and the sub-contexts? If my subs don’t add up to my overall, what did I miss?
- Averages, medians, mins, and maxes – what are my bounds that I am playing within?
- Basic standard deviations – do we have meaningful movements?
- Tops (10, 50, 100) and Bottoms (Lowest 10, 50, 100) – looking for concentrations
- Nonsensical data: text where numbers should be and vice versa, garbage data, etc.
Now, this is not a perfect way to analyze a complete data set, but it is a way to get about 80% of the work out of a way.
If you take any good cooking class, the instructor will tell you that one of the first things any chef does to an object they are about to cut on a board is create a flat surface. From onions to carrots, from tomatoes to mangos: a flat surface allows you to put something down with balance and not slice your finger off.
Data is similar in that aspect. Complete datasets are jagged and non-regular in form because data you are analyzing is often a quantification of some people-driven behavior (even books and authors and what is published and when). However, the analysis of a data set is a form of slicing and dicing, so—
Give yourself a flat surface to work with:
- Find your unique IDs.
- Find your sub-contexts.
- Identify your pivot points.
- Throw a little brute force at it (counts, sums, averages, etc.)
So that you know what you are working with. If done right, you will likely see where you should be digging in, like, why do clicks drop off from this page to the next? Or, why do voting participation rates drop so steeply from one county to the next and they are right by each other? These are questions that matter and they become obvious if you have done the clean-up work upfront.
And, really, answering those questions? Well, that is where the beautiful thing hides. That is where you can turn mere data into beautiful information.