Published in Data Engineering
Published in Data Engineering
Published in Data Engineering
Image credit by Data Quality
Image credit by Data Quality
Image credit by Data Quality
Sanda Trip
Sanda Trip
Sanda Trip
It contains various daily stories of Sanda.
It contains various daily stories of Sanda.
It contains various daily stories of Sanda.
June 15, 2024
June 15, 2024
June 15, 2024
The Secret of Data Quality Chapter 1: Why You Should Focus on Data Quality
The Secret of Data Quality Chapter 1: Why You Should Focus on Data Quality
The Secret of Data Quality Chapter 1: Why You Should Focus on Data Quality
An introduction to the secrets of data quality.
An introduction to the secrets of data quality.
An introduction to the secrets of data quality.
Preview: Why You Should Focus on Data Quality
💡 “Data Downtime”
Have you ever received feedback from a department that necessary data was missing after completing extensive query work or building a data pipeline?
Have you received feedback about duplicate data?
Urgent messages from IT departments or notes from executives about incorrect or inaccurate numbers?
40% of Total Work Time
Data teams spend over 40% of their total work time addressing data quality issues.
After a long day, you return to your desk exhausted, only to find a post-it note on your monitor that reads ‘Data is incorrect.’
News
1.1 What is Data Quality?
“If you cannot measure it, you cannot manage it. If you cannot manage it, you cannot improve it.”
Historically
Reliability of data
Completeness of data
Accuracy of data
cf. Informatica Data Quality Elements: Link
Accuracy:
Data accurately reflects the real-world entities and/or events it is intended to model.
Accuracy is measured by how well data conforms to known correct values.
Completeness:
Data includes all required records and values.
No records or fields should be missing.
Validity:
Data adheres to defined business rules and falls within allowed parameters when those rules are applied.
Timeliness:
Data is updated as frequently as necessary to meet user requirements for accuracy, accessibility, and availability.
Connected to reliability
Consistency:
Data values across different locations do not conflict with each other within a record, message, or attribute.
Consistent data is not necessarily accurate or complete.
Uniqueness:
No duplicate records exist within the dataset.
All records can be uniquely identified and accessed across the dataset and applications.
cf. Users might say, “No data is better than poor-quality data.”
While this can be said, practically, it is not feasible.
It's necessary to consider how to systemically improve data quality.
1.2 The Current State of Data Quality
→ The importance of data quality is increasing through DataOps and data engineering.
Hyundai Motor Company DataOps Recruitment
Data engineer + modeling
Naver Webtoon DataOps Recruitment
1.2.1 Increasing Data Downtime
“There is more data to handle and the complexity of the pipelines to manage it has increased.”
Cloud migration
More data elements
Increased complexity of data pipelines
Enhanced specialization of data teams
Distributed data organizations
1.2.2 Data Industry Trends
Data Mesh: The data platform version of microservices architecture
Oracle Data Mesh: Link
Streaming Data
Data Lakehouse
Combination of Data Warehouse and Data Lake
Adding SQL query and schema capabilities to Data Lake
Amazon Athena, a query service
Adding lake features to Data Warehouse
Amazon Redshift Spectrum
Databricks’ Lakehouse
Preview: Why You Should Focus on Data Quality
💡 “Data Downtime”
Have you ever received feedback from a department that necessary data was missing after completing extensive query work or building a data pipeline?
Have you received feedback about duplicate data?
Urgent messages from IT departments or notes from executives about incorrect or inaccurate numbers?
40% of Total Work Time
Data teams spend over 40% of their total work time addressing data quality issues.
After a long day, you return to your desk exhausted, only to find a post-it note on your monitor that reads ‘Data is incorrect.’
News
1.1 What is Data Quality?
“If you cannot measure it, you cannot manage it. If you cannot manage it, you cannot improve it.”
Historically
Reliability of data
Completeness of data
Accuracy of data
cf. Informatica Data Quality Elements: Link
Accuracy:
Data accurately reflects the real-world entities and/or events it is intended to model.
Accuracy is measured by how well data conforms to known correct values.
Completeness:
Data includes all required records and values.
No records or fields should be missing.
Validity:
Data adheres to defined business rules and falls within allowed parameters when those rules are applied.
Timeliness:
Data is updated as frequently as necessary to meet user requirements for accuracy, accessibility, and availability.
Connected to reliability
Consistency:
Data values across different locations do not conflict with each other within a record, message, or attribute.
Consistent data is not necessarily accurate or complete.
Uniqueness:
No duplicate records exist within the dataset.
All records can be uniquely identified and accessed across the dataset and applications.
cf. Users might say, “No data is better than poor-quality data.”
While this can be said, practically, it is not feasible.
It's necessary to consider how to systemically improve data quality.
1.2 The Current State of Data Quality
→ The importance of data quality is increasing through DataOps and data engineering.
Hyundai Motor Company DataOps Recruitment
Data engineer + modeling
Naver Webtoon DataOps Recruitment
1.2.1 Increasing Data Downtime
“There is more data to handle and the complexity of the pipelines to manage it has increased.”
Cloud migration
More data elements
Increased complexity of data pipelines
Enhanced specialization of data teams
Distributed data organizations
1.2.2 Data Industry Trends
Data Mesh: The data platform version of microservices architecture
Oracle Data Mesh: Link
Streaming Data
Data Lakehouse
Combination of Data Warehouse and Data Lake
Adding SQL query and schema capabilities to Data Lake
Amazon Athena, a query service
Adding lake features to Data Warehouse
Amazon Redshift Spectrum
Databricks’ Lakehouse
Preview: Why You Should Focus on Data Quality
💡 “Data Downtime”
Have you ever received feedback from a department that necessary data was missing after completing extensive query work or building a data pipeline?
Have you received feedback about duplicate data?
Urgent messages from IT departments or notes from executives about incorrect or inaccurate numbers?
40% of Total Work Time
Data teams spend over 40% of their total work time addressing data quality issues.
After a long day, you return to your desk exhausted, only to find a post-it note on your monitor that reads ‘Data is incorrect.’
News
1.1 What is Data Quality?
“If you cannot measure it, you cannot manage it. If you cannot manage it, you cannot improve it.”
Historically
Reliability of data
Completeness of data
Accuracy of data
cf. Informatica Data Quality Elements: Link
Accuracy:
Data accurately reflects the real-world entities and/or events it is intended to model.
Accuracy is measured by how well data conforms to known correct values.
Completeness:
Data includes all required records and values.
No records or fields should be missing.
Validity:
Data adheres to defined business rules and falls within allowed parameters when those rules are applied.
Timeliness:
Data is updated as frequently as necessary to meet user requirements for accuracy, accessibility, and availability.
Connected to reliability
Consistency:
Data values across different locations do not conflict with each other within a record, message, or attribute.
Consistent data is not necessarily accurate or complete.
Uniqueness:
No duplicate records exist within the dataset.
All records can be uniquely identified and accessed across the dataset and applications.
cf. Users might say, “No data is better than poor-quality data.”
While this can be said, practically, it is not feasible.
It's necessary to consider how to systemically improve data quality.
1.2 The Current State of Data Quality
→ The importance of data quality is increasing through DataOps and data engineering.
Hyundai Motor Company DataOps Recruitment
Data engineer + modeling
Naver Webtoon DataOps Recruitment
1.2.1 Increasing Data Downtime
“There is more data to handle and the complexity of the pipelines to manage it has increased.”
Cloud migration
More data elements
Increased complexity of data pipelines
Enhanced specialization of data teams
Distributed data organizations
1.2.2 Data Industry Trends
Data Mesh: The data platform version of microservices architecture
Oracle Data Mesh: Link
Streaming Data
Data Lakehouse
Combination of Data Warehouse and Data Lake
Adding SQL query and schema capabilities to Data Lake
Amazon Athena, a query service
Adding lake features to Data Warehouse
Amazon Redshift Spectrum
Databricks’ Lakehouse
Continue Reading
Continue Reading
Continue Reading