Skip to content

James Allman | JA Technology Solutions LLC

Data Profiling: Know Your Data Before You Move It

What profiling reveals about a dataset, the surprises it catches, and why it is the first step in any migration, integration, or reporting project.

Most data work starts the same way: someone hands you a file, an export, or a table and asks you to move it, connect it, or report on it. Before any of that, there is a question worth answering honestly, which is what is actually in this data? Not what the field names suggest, not what the documentation claims, but what the values really are.

Data profiling is how you answer it. It is the unglamorous first step that separates projects that go smoothly from projects that discover in production that a "numeric" column was full of text and a "unique" identifier was not unique. This guide covers what profiling reveals, the surprises it routinely catches, and how to do it on your own data. The profiling described here runs entirely in your browser through the free Data Profiler; the files you load never leave your machine.

What Profiling Actually Tells You

A profile turns an unknown file into a documented one: detected type, null rate, and cardinality for every column.

For each column in a dataset, profiling reports a consistent set of facts. The detected data type tells you whether the values actually parse as numbers, dates, booleans, or plain text, which is often different from what the column name implies. The null and empty counts tell you how complete the column is. The unique-value count (the cardinality) tells you how much variety the column holds, from a handful of distinct statuses to a unique value on every row.

On top of that, profiling reports the minimum and maximum values, the string-length range, and for numeric columns the mean, median, and standard deviation. It also surfaces the most frequent values and a few samples, so you can see the actual shape of the data rather than guess at it. None of these facts is complicated on its own. Together, they turn a file you have never seen into one you understand well enough to make decisions about.

The Surprises Profiling Catches

The reason to profile is not curiosity. It is that real data is almost never as clean as the schema or the spreadsheet header suggests, and the gaps are exactly the things that derail a project later. A few minutes of profiling surfaces them while they are still cheap to handle.

The numeric column that is not numeric. A quantity or amount field that holds "N/A", "TBD", a stray comma, or a blank in even a few rows will not load into a numeric target column. Profiling flags it as text, or shows a type you did not expect, before the import fails.

Inconsistent date formats. One source writes 2026-06-08, another writes 06/08/2026, a third writes 8-Jun-26. Profiling shows the spread so you can normalize it in the mapping instead of discovering it when half the dates land in the wrong century.

The key that is not unique. A column you planned to use as the primary or join key turns out to have duplicates. Profiling shows the unique count against the row count, so a near-key (or a column that needs to be combined with another to be unique) is obvious up front.

More nulls than anyone expected. A field everyone assumed was always populated is empty on twelve percent of rows. That is a business question (what does a missing value mean here?) far better asked before the data moves than after a report quietly under-counts.

Hidden formatting. Trailing spaces, mixed case, leading zeros stripped by a spreadsheet, encoding artifacts from a legacy export. These are invisible to the eye and obvious to a profile.

Is This Column a Key?

Completeness against uniqueness decides key suitability: only customer_id is both complete and unique enough to stand alone.

One of the most useful things a profile answers is whether a column can serve as a key, because so much downstream work depends on it. A join between two datasets, a deduplication pass, a record-level comparison, an upsert into a target table: all of them need a column (or a combination of columns) that is both complete and unique.

Completeness and uniqueness together give you the answer. A column that is fully complete and has a unique value on every row is a clean key. A column that is complete but only partly unique is a category or a near-key, useful only when combined with another column. A column that is highly unique but missing on some rows cannot be trusted as a key until the gaps are resolved. Seeing these two numbers side by side, before you design the join, saves the rework of discovering mid-project that the join produces duplicates or drops records.

Profiling Before a Migration

Migration is where skipping profiling hurts most. The pattern is familiar: the mapping is designed from the schema and the field names, the conversion is built, and then testing reveals that the real data does not match the assumptions. Columns assumed numeric contain text, dates use formats nobody documented, identifiers collide, and nullable columns are far emptier than expected. Each of these is a change to the mapping and transformation logic, and changes found in testing (or in production) cost far more attention than the same issues found in an afternoon of profiling up front.

Profiling the source before designing the migration flips that order. You see the data quality issues first, build the transformation to handle them, and validate against a known baseline. After the conversion runs, the Data Diff & Compare tool confirms the result holds the records you expected. This is the same discipline I describe in Data Migration Testing: understand the data, then prove the move preserved it. For the broader picture of how data moves between systems, see ETL: The Invisible Backbone of Enterprise Data.

Profiling for Integration and Reporting

Profiling is not only a migration tool. Any time data crosses a boundary, profiling the inbound or outbound side reduces surprises. When a trading partner or a new system starts sending you a file, profiling the first few deliveries tells you what the fields really contain, which ones are reliably populated, and where you will need validation before the data flows into your systems. That feeds directly into integration work, where the cost of a bad assumption is a broken nightly feed.

Reporting has the same dependency in reverse. A report is only as trustworthy as the data underneath it, and a column that is twelve percent null or that mixes two date formats will quietly produce wrong totals. Profiling the source data behind a report, before building dashboards on top of it, is the difference between numbers a team trusts and numbers they learn to second-guess. That is part of how I approach custom reporting: confirm the inputs before formatting the outputs.

How to Profile Your Data

The Data Profiler on this site does all of the above on CSV and TSV data, in the browser, with nothing uploaded to a server. Paste your data or drop a file and it presents three views. The Summary view lists every column with its detected type, null and unique counts, min and max, and (for numeric columns) mean, median, and standard deviation. The Column Detail view drills into a single column with a frequency distribution of its values and completeness and uniqueness metrics. The Quality Matrix scores every column on completeness, uniqueness, and consistency at once, so problem columns stand out at a glance. You can export the full profile to CSV, Excel, or JSON to document the dataset or hand it to a teammate.

If your data does not start as clean CSV, a couple of converters get it into shape first. Spreadsheet data should go through the Excel Converter (with Force Text on the columns that hold leading zeros, so SKUs and ZIP codes survive), and positional or mainframe data through the Fixed-Width ↔ CSV converter. Once profiled, the Data Diff & Compare tool is the natural next step for validating a transformation, and the CSV to SQL Generator turns a clean, understood file into CREATE TABLE and INSERT statements.

When Profiling Becomes a Pipeline

Profiling one file by hand is the right tool for a quick assessment, a vendor file, or a single table you are about to migrate. It does not scale on its own to a project with hundreds of tables across IBM i, mainframe, or legacy database systems, where the assessment itself becomes a deliverable.

For that kind of work I run comprehensive data assessments that catalog every table and column, document the type mappings, flag the quality issues and the transformation requirements, and produce a migration plan grounded in what the data actually is rather than what the schema claims. That is the same discipline as the browser tool, applied at scale and across platforms. If you are sizing up a migration, an integration, or a reporting project and want to start from a clear picture of the data, Ask James. I read every conversation that comes through the chat assistant on this site, and I'll follow up directly.

Have thoughts on this article? Share them.