Data Profiling

15 Feb

Data profiling is one of the most under-rated, though extremely important tasks a data scientist needs to come to grips with. It is something that really needs to be done before any serious investigation can be considered. Much maligned as irrelevant or superfluous by many a data-jockey, data profiling is in fact what allows the serious data scientist the ability to really understand the nuances of what they are attempting to achieve.

To me, data profiling is not merely taking a look at the data set you are working with and understanding the column names and data types. It is so much more than that.

Essentially you are trying to work out the quality of the data, the extent to which it is complete as well as how accurately it mirrors what it is attempting to represent in the real world.

From the outcome of these investigations, you can come to an understanding of any limitations that may lurk within the data, what you can and cannot expect to use it for, and therefore the accuracy of any analytics you may attempt to run over it.

Overview of Aims

To profile your data properly, you really need to follow through on a number of tasks:

Construct a data dictionary to understand the elements of the data at the atomic level
Generate a statistical profile of each data attribute and look for anomalies, outliers, attribute completeness and commonly expected values
Generate a non-statistical profile of the data set as a whole, as well as some field properties it would be advantageous to know
Visualise the frequency and distributions of interesting looking attributes, or at least those you want to leverage in your analytics

Creating a Data Dictionary

This involves:

Researching each attribute as it applies to the business context
Discussing the data with subject matter experts
Articulating the findings thoroughly but succinctly, in business-friendly language
Generating an accessible data dictionary asset that is promoted to become a business-wide go-to resource
Possible automation, where new attributes are periodically added, and defunct attributes are flagged for removal

Often, if the data is being generated from an application, there will already be a data dictionary that the vendor has made available. Even so, this document will still need to be verified and socialised. This may involve validation of the dictionary against the actual output, and possible republishing in an online or otherwise accessible format within the business. Sometimes it may be necessary to contact the software vendor for further clarification of terms and how they were designed to be understood.

The components of a data dictionary asset may vary depending upon the business. Essential components would, however, include the field names for each table or data asset, the data type, and a description of what each field is for and where it is used in the business. Also useful might be permitted data formats and terms that the field may alternatively be known by.

Generating a statistical profile

A statistical profile is data-dependant and so a prescriptive list of what statistics to run can not really be presumed.

A rule of thumb would be to at least describe the data attributes that will most matter to the outcome of any analyses you intend to run. For example, if your sales data is well structured, and you are after insights into probable sales for the next quarter, it behooves you to profile the data quality of the sales-related attributes such as amount of each deal, date of sale, and purchaser identifiers. How many records contain missing data in these fields, and how many records contain data in a non-standard format will have a direct bearing on how useful the data is when using it to answer your questions.

Other quick statistical insights can be gained from determining the most and least prevalent values in each of the columns to give you a quick idea of skew or outliers, as well as the mean, median, largest and smallest values for each attribute, to get some idea of numerical magnitudes in these fields.

Generating a non-statistical profile

Some non-statistical characteristics to note about the data would be to determine whether it was time-bound - and if so what is the time range the data is recorded against, the scale and level of precision used for particular attributes, the level of aggregation in which the attributes are recorded, and what units of measurement are used throughout.

On unstructured or textual data, word completeness, the existence of non-printing characters - such as emojis, and the number of distinct thematic categories or subjects may be of interest.

Visualisations

As part of the data profiling process you should be thinking how to visualise the attributes of key fields in the data set. Commonly, frequency charts are generated to quickly see the 10-15 most and least frequent values in particular fields. This allows the analyst to understand commonly expected as well as unusual values. It assists with picking up unusual behaviour in the data, as well as confirming behaviour that is expected.

Other visualisations may include time-series charts to try to discern distinct times of differing activity such as spikes or plateauing out; or histograms, to get an idea of the distribution the values of particular attributes follow, and even heat maps to get a quick overview of the intensity of values across a distribution.

It really all depends on what information is stored in the data. It often helps to not just generate these charts for self-understanding, but also to use them as a tool to help stakeholders understand the limitations of the data. Inclusion of visuals in a presentation always trumps tables and text.

Conclusion

Overall, I would say that the benefits of completing a full data profiling exercise are:

You become much more familiar with the data and the subject area
You become much more aware of how the business works
You establish essential contacts with the business amongst stakeholders and subject matter experts
You are readily able to answer client or stakeholder requests to provide analyses as to which analyses are possible
You become cognisant of possible improvements to future data collection requirements

I hope that I've laid it out succinctly enough, and that this turns out to be useful for someone out there contemplating doing a data profiling exercise.

All the best!

Daniel

Daniel Karp