Data Quality
Data Quality
Data Quality

Published in Data Engineering

Published in Data Engineering

Published in Data Engineering

Image credit by Data Quality

Image credit by Data Quality

Image credit by Data Quality

Sanda Trip

Sanda Trip

Sanda Trip

It contains various daily stories of Sanda.

It contains various daily stories of Sanda.

It contains various daily stories of Sanda.

June 15, 2024

June 15, 2024

June 15, 2024

The Secret of Data Quality Chapter 1: Why You Should Focus on Data Quality

The Secret of Data Quality Chapter 1: Why You Should Focus on Data Quality

The Secret of Data Quality Chapter 1: Why You Should Focus on Data Quality

An introduction to the secrets of data quality.

An introduction to the secrets of data quality.

An introduction to the secrets of data quality.

Preview: Why You Should Focus on Data Quality

💡 “Data Downtime”

  • Have you ever received feedback from a department that necessary data was missing after completing extensive query work or building a data pipeline?

  • Have you received feedback about duplicate data?

  • Urgent messages from IT departments or notes from executives about incorrect or inaccurate numbers?


40% of Total Work Time

  • Data teams spend over 40% of their total work time addressing data quality issues.

  • After a long day, you return to your desk exhausted, only to find a post-it note on your monitor that reads ‘Data is incorrect.’


News


1.1 What is Data Quality?

“If you cannot measure it, you cannot manage it. If you cannot manage it, you cannot improve it.”

Historically

  • Reliability of data

  • Completeness of data

  • Accuracy of data

cf. Informatica Data Quality Elements: Link

  • Accuracy:

    • Data accurately reflects the real-world entities and/or events it is intended to model.

    • Accuracy is measured by how well data conforms to known correct values.


  • Completeness:

    • Data includes all required records and values.

    • No records or fields should be missing.


  • Validity:

    • Data adheres to defined business rules and falls within allowed parameters when those rules are applied.


  • Timeliness:

    • Data is updated as frequently as necessary to meet user requirements for accuracy, accessibility, and availability.

    • Connected to reliability


  • Consistency:

    • Data values across different locations do not conflict with each other within a record, message, or attribute.

    • Consistent data is not necessarily accurate or complete.


  • Uniqueness:

    • No duplicate records exist within the dataset.

    • All records can be uniquely identified and accessed across the dataset and applications.

cf. Users might say, “No data is better than poor-quality data.”

  • While this can be said, practically, it is not feasible.

  • It's necessary to consider how to systemically improve data quality.


1.2 The Current State of Data Quality

→ The importance of data quality is increasing through DataOps and data engineering.


1.2.1 Increasing Data Downtime

“There is more data to handle and the complexity of the pipelines to manage it has increased.”

  • Cloud migration

  • More data elements

  • Increased complexity of data pipelines

  • Enhanced specialization of data teams

  • Distributed data organizations


1.2.2 Data Industry Trends

  • Data Mesh: The data platform version of microservices architecture

    • Oracle Data Mesh: Link

  • Streaming Data

  • Data Lakehouse

    • Combination of Data Warehouse and Data Lake

    • Adding SQL query and schema capabilities to Data Lake

      • Amazon Athena, a query service

    • Adding lake features to Data Warehouse

      • Amazon Redshift Spectrum

      • Databricks’ Lakehouse

Preview: Why You Should Focus on Data Quality

💡 “Data Downtime”

  • Have you ever received feedback from a department that necessary data was missing after completing extensive query work or building a data pipeline?

  • Have you received feedback about duplicate data?

  • Urgent messages from IT departments or notes from executives about incorrect or inaccurate numbers?


40% of Total Work Time

  • Data teams spend over 40% of their total work time addressing data quality issues.

  • After a long day, you return to your desk exhausted, only to find a post-it note on your monitor that reads ‘Data is incorrect.’


News


1.1 What is Data Quality?

“If you cannot measure it, you cannot manage it. If you cannot manage it, you cannot improve it.”

Historically

  • Reliability of data

  • Completeness of data

  • Accuracy of data

cf. Informatica Data Quality Elements: Link

  • Accuracy:

    • Data accurately reflects the real-world entities and/or events it is intended to model.

    • Accuracy is measured by how well data conforms to known correct values.


  • Completeness:

    • Data includes all required records and values.

    • No records or fields should be missing.


  • Validity:

    • Data adheres to defined business rules and falls within allowed parameters when those rules are applied.


  • Timeliness:

    • Data is updated as frequently as necessary to meet user requirements for accuracy, accessibility, and availability.

    • Connected to reliability


  • Consistency:

    • Data values across different locations do not conflict with each other within a record, message, or attribute.

    • Consistent data is not necessarily accurate or complete.


  • Uniqueness:

    • No duplicate records exist within the dataset.

    • All records can be uniquely identified and accessed across the dataset and applications.

cf. Users might say, “No data is better than poor-quality data.”

  • While this can be said, practically, it is not feasible.

  • It's necessary to consider how to systemically improve data quality.


1.2 The Current State of Data Quality

→ The importance of data quality is increasing through DataOps and data engineering.


1.2.1 Increasing Data Downtime

“There is more data to handle and the complexity of the pipelines to manage it has increased.”

  • Cloud migration

  • More data elements

  • Increased complexity of data pipelines

  • Enhanced specialization of data teams

  • Distributed data organizations


1.2.2 Data Industry Trends

  • Data Mesh: The data platform version of microservices architecture

    • Oracle Data Mesh: Link

  • Streaming Data

  • Data Lakehouse

    • Combination of Data Warehouse and Data Lake

    • Adding SQL query and schema capabilities to Data Lake

      • Amazon Athena, a query service

    • Adding lake features to Data Warehouse

      • Amazon Redshift Spectrum

      • Databricks’ Lakehouse

Preview: Why You Should Focus on Data Quality

💡 “Data Downtime”

  • Have you ever received feedback from a department that necessary data was missing after completing extensive query work or building a data pipeline?

  • Have you received feedback about duplicate data?

  • Urgent messages from IT departments or notes from executives about incorrect or inaccurate numbers?


40% of Total Work Time

  • Data teams spend over 40% of their total work time addressing data quality issues.

  • After a long day, you return to your desk exhausted, only to find a post-it note on your monitor that reads ‘Data is incorrect.’


News


1.1 What is Data Quality?

“If you cannot measure it, you cannot manage it. If you cannot manage it, you cannot improve it.”

Historically

  • Reliability of data

  • Completeness of data

  • Accuracy of data

cf. Informatica Data Quality Elements: Link

  • Accuracy:

    • Data accurately reflects the real-world entities and/or events it is intended to model.

    • Accuracy is measured by how well data conforms to known correct values.


  • Completeness:

    • Data includes all required records and values.

    • No records or fields should be missing.


  • Validity:

    • Data adheres to defined business rules and falls within allowed parameters when those rules are applied.


  • Timeliness:

    • Data is updated as frequently as necessary to meet user requirements for accuracy, accessibility, and availability.

    • Connected to reliability


  • Consistency:

    • Data values across different locations do not conflict with each other within a record, message, or attribute.

    • Consistent data is not necessarily accurate or complete.


  • Uniqueness:

    • No duplicate records exist within the dataset.

    • All records can be uniquely identified and accessed across the dataset and applications.

cf. Users might say, “No data is better than poor-quality data.”

  • While this can be said, practically, it is not feasible.

  • It's necessary to consider how to systemically improve data quality.


1.2 The Current State of Data Quality

→ The importance of data quality is increasing through DataOps and data engineering.


1.2.1 Increasing Data Downtime

“There is more data to handle and the complexity of the pipelines to manage it has increased.”

  • Cloud migration

  • More data elements

  • Increased complexity of data pipelines

  • Enhanced specialization of data teams

  • Distributed data organizations


1.2.2 Data Industry Trends

  • Data Mesh: The data platform version of microservices architecture

    • Oracle Data Mesh: Link

  • Streaming Data

  • Data Lakehouse

    • Combination of Data Warehouse and Data Lake

    • Adding SQL query and schema capabilities to Data Lake

      • Amazon Athena, a query service

    • Adding lake features to Data Warehouse

      • Amazon Redshift Spectrum

      • Databricks’ Lakehouse