# StevenClark.com.au

## Question Underlying Data & it’s Assumptions

When I wrote Never Quote Percentages Without Context there was a little voice in my head saying “Now you’re going to have to talk a little about statistics needing context, too”. It appears that little voice in my head was on the money.

### Don’t be Afraid of Statistical Representation

Statistics have a bad name and I’m personally someone who struggles with them. Out of all the units in the MBA program I put the most work into statistics and received my second lowest result – 68% credit. However, understanding at least the basic premise of statistical representation will go a long way to your being able to look at data and information as a manager, a journalist or a researcher.

### Be Cautious of all Underlying Data – Question It Closely

Programmers will probably get this faster than anybody else… the quality and cleanliness of your data defines the value you can derive from the information it purports to provide.

That means when you go back to the data you critically look at the who, what, when, where, how and why to determine its value. How was it collected? By who? What size were the sample groups the data were taken from? What do they mean?

If a column is headed Education and ranges from 1 through 14… what does that mean? Does it measure years in school, achievement in school or meaningful study towards a specific career? Where would a high school graduate sit? And where would an undergraduate degree sit?

Similarly, a column headed Marriage could be binary – but what about same sex partners, de facto couples and people unwilling to be represented by the religious constraint of the formal marriage relationship?

Or, Urban versus Rural. Unless this column is identified by postcode you might discover self-identification warps the underlying data – what if a person lives 100 metres from the town limits, or in a leafy suburb right on the city’s fringe. Urban or rural?

### Take a Gander at the Assumptions of that Data

The next most important thing you have to consider are the assumptions that were made in collection and allocation of values in the dataset. These might include an assumption that all Education values begin at 0 for pre-year 10, then 1 means a year 10 graduate and 2 means a year 11 graduate through to a PhD. Does an increment equal a year? Or the next qualification?

Another assumption might be that male / female couples living in houses together, even though not identified as couples, are for the terms of the collected data to be identified as Married. Only, what if this is a neighbourhood with a high incidence of grown children staying home to nurse widowed mothers?

An assumption of the data might be that Education was undertaken in a linear path – how is Trade School or Technical College represented? Are apprenticeships represented? You need to understand what that data really means before you can hope to make valuable sense from it.

### You get further with Questions than with Blind Acceptance

It’s easy (and lazy) to read research reports and mass journalism that just fudges through information. You should continuously question those underlying data and assumptions. I just want you to be aware that your role is to dig and think and pull that underlying dataset apart with your hands and teeth. You might feel the author of the paper had a bias toward certain representations of the data… or they misinterpreted it… or it was collected from a small sample… or other data within the set were more representative of the greater truth.

Accepting data with blind acceptance makes you a lazy journalist, an incompetent manager and an ignorant citizen (subject to all the bells and whistles of manipulation and propaganda).

### Data Plus Context Equals Information

Finally, based on the questioned data you can consider its ability to reveal information. Data points on their own mean little or nothing. It means nearly nothing to me that 334 people identified as having an Education of 14…

… but if I know the context in which that data was collected, if I understand why it was put together and have some idea about the inherent biases and underlying assumptions, providing a contextual meaning to that Education of 14, then I can begin to match it to other socio-economic factors within the dataset. Then I can start to work things out like whether males or females from certain geographic or ethnic origins are dropping out of the education system. It’s a very basic example but you should see the point. It’s not until that process of questioning occurs that real statistical analysis makes sense.

Data plus context equals information. Your job, even if you’re just reading a report or a piece of journalism, is to question. And don’t just accept that some research is correct because you read it in the newspaper. It might be fact. But it may be total garbage.