Why is Data Validation Important in Maintaining Accuracy

Ever ask why is data validation important?  This page might help.  Using the guidance in the Alberta Air Quality Monitoring Directive can help provide assistance with the step-by-step procedures involved here.

Validation and verification are crucial for data accuracy, consistency, and reliability, for these reasons generally:

  • By comparing the data to a trusted source, validation confirms its accuracy,
  • These processes can prevent faulty data from propagating through systems by catching errors early in the data entry or processing stages,
  • Data handling regulations are strict in many industries. By validating and verifying, organizations can avoid legal issues and fines,
  • Making informed business decisions requires reliable data. Misguided strategies and inefficiencies can result from poor data quality,
  • Integrating data across systems supports seamless data flow and integration, which enhances overall operational efficiency, and
  • Consistently validated and verified data builds trust among stakeholders and helps maintain a company's reputation

It also outlines data validation best practices, how to use validation codes, keep logs, check for completeness and gives you best practices to improve the accuracy and reliability of your data.

You'll find out about software systems that support graphical and tabular displays.  You might also make sure the tools and systems you use align with these recommendations.

Learn how to keep a transparent and defensible audit trail for data validation, including logs and data completeness criteria.  Also, how to identify and fix quality issues such as rapid changes in data, zero-drift in your instruments, and relationships between parameters.  Use this to troubleshoot problems.

Read the entire Chapter 6 of the AMD here.

4.0 Verification and validation of data

Why is data validation important?  It helps us make sure it's accurate and consistent. Validation qualifies the accuracy of data collected, while verification evaluates instrument or system performance.

Software systems should support graphical and tabular display and validation for effective data review and validation. While tabular displays help identify specific issues with data, like start and end times, graphical displays reveal relationships between parameters, outliers, and subtle changes you might miss in tables.

There are several levels to data verification and validation:
- Preliminary Verification - Level 0
- Primary Validation - Level 1
- Final Validation - Level 2
- Independent Data Review - Level 3
- Validation after the final review

Quality and accuracy are ensured at every level.  This is a good example of why is data validation important.

Why is Data Validation important?Validating data and making sure it's complete

4.1 Records of the validation process
Validation involves deciding if the data is valid. It includes checking whether data are valid, making adjustments to make them valid, or marking them invalid if they don't meet acceptance criteria.

4.1.1 Validation codes
Validation codes tell you why data points are invalid or missing. All data submitted to central data repositories must have the same codes. Validation codes can be flags or qualifiers. There are flags that explain why a data point is missing, like "C" for calibration or "P" for power outage. A qualifier describes the quality or characteristics of a data point, like ">" for over the range or "L" for local interference.

The person who's responsible for continuous ambient air data should:
- (a) Assign a data validation code to flag missing data, qualify data outside the instrument's normal range, and qualify anomalous data.
- (b) Leave blank data validation codes for valid data points not covered in (a).
- (c) Put internal data validation codes in a Quality Assurance Plan (QAP) or Standard Operating Procedure (SOP).
- (d) Transform internal codes into Alberta's Ambient Air Quality Data Warehouse codes.

Data points that are valid don't have validation codes, which are used when data is missing or needs qualification. A validation code will also be assigned to data points that initially seem anomalous but are later confirmed to be valid. AMD's website has a list of current validation codes.

4.1.2 Logs for data validation
Data validation logs keep track of the validation process, including validation codes. Logs summarize and justify flagged, edited, or modified data decisions. All such data is audited, ensuring transparency and facilitating future questions about specific data.

For each continuous ambient air monitoring station, the person responsible should keep a validation log with the following info:
- The person who validated it.
- Why is data validation important and when it is done.
- Affected parameter(s).
- Any data adjustments or invalidations need to be identified and justified.
- Any corrective actions taken to fix data issues.
- Analyzing anomalous data and justifying their validity.
- Post-validation changes, if any.

Air data loggerTables and logs of air monitoring data

These logs make sure there's a transparent and defensible audit trail for data validation. To keep this audit trail, don't delete invalid or suspect data.

4.1.3 Criteria for data completeness
The following criteria are used to determine data completeness:
- Ensure each continuous ambient monitoring instrument and its data recording system are operational at least 90% of the time each month.
- You should exclude data collected during quality assurance and quality control (QA/QC) activities, zeros, spans, calibrations, audit checks, or equipment start-up/stabilization when calculating data completeness.
- When a data acquisition system goes down, some form of backup should be in place (electronic, chart recorder, or internal analyzer memory).
If continuous ambient monitoring lasts less than three months:
- (a) For the overall monitoring period, keep each instrument running at least 90% of the time.
- (b) Compensate for lost hours during the monitoring period to reach 90% operational time.
- (c) A new 30-day monitoring period should be started if an instrument operates less than 75% of the time in a month.

A full month's worth of data should be collected at the beginning of a calendar month.

4.2 Level 0 - Preliminary Verification
Data at Level 0 is raw data from the data acquisition system or directly from the instrument. During preliminary verification, these data are screened and flagged manually or automatically.

Checks include:
- Identifying missing data periods, checking time stamps, ensuring instrument diagnostics and datalogger flags are normal, and checking data against upper and lower limits.
- Data changes that are too rapid or nonexistent can be flagged by rate-of-change flagging. We check zero, span, and multipoint performance.

Data quality issues can be quickly identified and mitigated with comprehensive instrument diagnostics. Data graphs should be reviewed by experienced personnel, considering field observations.

Benefits of data qualityValidation of unexpected data

Regular data reviews (e.g., daily) and prompt troubleshooting are recommended. If you're not sure how good the ambient data is:

- (a) During preliminary verification, document suspect data.
- (b) Investigate and document invalid data's root cause.
- (c) Start corrective action once the root cause is found.
- (d) Document the corrective action.
- (e) Verify the effectiveness of the corrective action.

You can fix it by adjusting it, troubleshooting it on site, or repairing it. Log notes should document all data issues and corrective actions.

Disclaimers should be included with publicly available real-time data (Level 0). Feedback from the public may help identify problems.

4.3 Level 1 - Primary Validation

Data validation builds on data verification by evaluating flagged issues and applying validation codes. Validation should be done weekly or monthly. Validation actions include:
- Verifying all screening flags assigned during preliminary verification.
- Examining all the documentation and information about the site.
- Analyzing operational acceptance limits for each parameter.
- Calibration results for gaseous parameters on a daily and monthly basis.
- Adjusting data, like baselines.

4.3.1 Review of supporting documents
Why is data validation important here?  It involves reviewing all instrument status information, including diagnostics and datalogger flags used during screening. Data validity is determined by this thorough review. Further investigation can confirm data that was initially considered suspect.

In addition, any documentation and instrument diagnostics that weren't present during data collection, like station log notes, calibration records, and audit records, should be assessed.

4.3.2 Criteria for operational acceptance
It's important to consider any instrument-specific limitations during data validation. Quality Assurance Plans (QAPs) or Standard Operating Procedures (SOPs) should document these operational acceptance limits. Data that violates these limits is usually invalid, unless other quality control information shows otherwise. Temperature tolerances, converter efficiency for NO2 measurements, and flow rate ranges that are too high for particulate measurement instruments with size-cut inlets, like PM2.5 and PM10.

4.3.3 Criteria for calibration acceptance
In Chapter 7 (Calibration) of the AMD, calibration checks are performed to ensure measurement uncertainty remains within specified limits.

4.3.4 Values that go over the range
Monitoring instruments have specific operating ranges outlined in Chapter 4 (Monitoring). Over-range values should be adjusted if they happen a lot. Invalidate over-range values that fall outside the required operating ranges. Nevertheless, over-range values that indicate anomalously high events, like wildfires nearby, can still be valid. The data validation logs should note the exceptional event and the potential underestimation of the actual concentration based on the over-range value.

The relationship between parameters is complex and nebulousThe basics and a collection of parameters

4.3.5 Adjustments to the baseline

It's possible for continuous air quality instruments to experience zero drift, which changes baseline concentrations over time. Performance checks can confirm this drift. Data affected by analyzer drift can be corrected by subtracting a verified drift value from the data. You can make adjustments automatically or manually.

Maintaining calibration specifications will minimize the need for zero adjustments. A review process should be in place when applying data adjustments based on zero check data. Adjustments between multipoint calibrations shouldn't be based on span check results. Only full calibration adjustments based on reference standards should be upscaled. Reporting ambient concentration values shouldn't be negatively affected by zero baseline adjustments. There's a risk of erroneous zero adjustments with automated zero adjustments.

4.3.6 Relationships between parameters
To maintain the relationship between measured and derived parameters, adjustments should be applied uniformly to all continuous ambient air parameters during validation procedures. For example, if you adjust NO, you should also adjust NOX. It's especially relevant for parameters like NO/NO2/NOX. Commercial NOX analyzers measure NO, and NO2 is derived from NOX minus NO. The precision of relationships between parameters may be affected by rounding or signal noise when they're reported directly from an analyzer. Table 1 shows specific examples of parameter relationships and considerations of why is data validation important.

Table 1 - Here are some validation considerations for specific parameter relationships:
- NO/NO2/NOX: NO and NO2 should add up to NOX. Data adjustments must preserve the relationship between these parameters.
- CH4/NMHC/THC: Methane (CH4) and non-methane hydrocarbons (NMHC) should equal total hydrocarbons (THC). It's important to preserve the relationship between these parameters when making data adjustments.
- If PM10 flow is split for PM2.5 measurement, PM10 is calculated as the sum of PM2.5 and coarse particles.
- Vector wind speed (VWS), vector wind direction (VWD), and standard deviation of wind direction (SDWD) should all be invalid during the same time period. It should also invalidate VWS, VWD, and SDWD if scalar parameters like scalar wind speed (SWS) or scalar wind direction (SWD) are invalidated.
- VWS and SWS: SWS can be equal to or greater than VWS.
- The delta temperature is the difference between two temperatures. Delta temperature is invalid if either level is invalid.

4.3.7 Adjustments below zero
The person responsible must adjust continuous ambient hourly averages for all valid negative gas and particulate concentrations to zero before reporting them (DQ 4-J). A valid -1 ppb ozone reading should be reported as 0 ppb ozone. Instrument precision and zero noise limitations require this adjustment.

DQ 4-K: Don't apply zero adjustments to sub-hourly intervals before aggregating them into 1-hour averages. Negative values should be adjusted to zero after baseline adjustments. From the Canada-wide National Air Pollution Surveillance (NAPS), Table 2 defines the lower acceptable limits for 1-hour PM2.5 data.

Table 2 - Here are the zero adjustment criteria for different aggregation levels:
- Before aggregating hourly averages for intervals less than one hour and all parameters, all negative values that are valid should remain negative.
- If PM2.5 falls between -3 and 0, it should be adjusted to 0. PM2.5 less than -3 is invalid. Under-zero values should be adjusted to zero for all gases.
- By addressing negative values before aggregation and reporting, these criteria help maintain data quality. PM2.5 criteria are from National Air Pollution Surveillance (NAPS).

Confidence in data to be submittedChaos into consistency

4.4 Level 2 - Final Validation
Level 2 - Final Validation involves a thorough review of data to identify any anomalies. Data validity is determined by checking for relationships between dependent and independent data. Here's what you need to know:
- Analyze and investigate anomalous data or outliers collected from validated continuous ambient air data.
- Logs should be kept for decisions regarding anomalous data or outliers, along with justifications.
- Unless there's compelling evidence to the contrary, inconsistent data should be considered valid.
- An investigation of suspect data might reveal instrument malfunctions or other issues.
- Logs should be used to record and justify data validity decisions.

Plotting time series, especially when multiple parameters are displayed together, can reveal relationships that are hard to see in tables. Check these examples:
- Dependent data relationships include anti-correlation between O3 and NO, O3 levels increasing with UV and temperature, and pollutant events affecting multiple parameters.
- By showing consistency across large geographic areas, independent data sets can be used to validate meteorological data.
- By evaluating both dependent and independent data relationships and conducting thorough investigations when anomalies or outliers are detected, Level 2 validation ensures the quality and consistency of data.

4.5 Level 3 - Independent Assessment
Independent Assessment is a final review of validated data by someone who isn't involved in field operations or primary data validation. This review is to make sure data gets an independent quality assurance review before submission. The key points are:
- Submitted data should be reviewed by an independent person who isn't involved in field operations or primary data validation.
- As per the AMD Reporting Chapter, the person responsible should record and report certification statements from the independent reviewer.
- To evaluate data based on expected and historical behavior, the independent reviewer should have some knowledge of pollutant and meteorological behavior.
- Time series plots should be used for manual review of validated data reduced to hourly averages.
- Any suspect data can be communicated to the data validator for investigation, data validation modifications, or justification such as why is data validation important.

All reports (monthly/annual) and data must be certified, as described in the AMD's Reporting Chapter. Validation at Level 3 ensures the data undergoes a final, independent quality check to ensure data reliability.

4.6 Procedures for Post-Final Validation - Annual Validation
Annual Validation - This is a step to re-evaluate data after the initial validation to catch any errors or omissions. Here's what you need to know:
- Data validation may contain errors or omissions despite our best efforts.
- Annual data reviews help identify issues or patterns that may not have been obvious on a monthly basis.
- Before submitting an annual report to the regulator, the person responsible should review all validated data for the previous calendar year.

Review includes annual charts and basic statistics, including comparisons to historical mean, maximum, and minimum values. Independent review certification (DQ 4-P) should indicate who conducted this annual validation review.

As outlined in the Reporting Chapter of the AMD, annual reports will include a report certification form to confirm this review has been completed. Before submitting annual reports to the regulator, the validation review serves as an additional quality assurance step.

5.0 References - Why is Data Validation important to others

"NAPS 2002," dated January 2002. The memo seems to be for several people, including NAPS Network Managers, Workshop Participants, and NAPS Data Analysts. This document is described as a "Personal Communication," which indicates it's not publicly available or published, but rather a direct communication or memo shared with specific people in the NAPS (National Air Pollution Surveillance) network.

Emissions from an industrial siteA worksite with a lot of emissions may need an air quality specialist

This memo would be tailored to the recipients and the context of the NAPS program, which likely deals with air quality monitoring and data analysis. It may contain information, guidelines, or instructions about the activities or responsibilities of the recipients.

We know how important accurate and reliable data is to your operations, especially when it comes to air quality.  We wanted to let you know about the data validation services Calvin Consulting Group Ltd. offers, designed to make your data management more effective.

Why is Data Validation Important?

Validating and verifying data is key to accuracy.  Providing Quality Assurance Plans (QAP) and audit consulting services tailored to the specific needs of air quality monitoring operations is what Calvin Consulting Group Ltd. does.  

Here's what sets us apart:

We implement step-by-step procedures that align with the Alberta Air Quality Monitoring Directive.  We ensure a transparent and defensible audit trail for your data by adhering to best practices, using validation codes, and keeping meticulous logs.

Here's what we do:

We help you develop and implement Quality Assurance Plans (QAPs) so your data validation processes are in line with industry standards.  Consulting: Our team conducts thorough audits of your data validation processes, identifying areas for improvement and ensuring compliance.  We understand that every monitoring station has its own needs.

Feel free to contact us:

Improve the quality and accuracy of your air quality data.  Get in touch with Calvin Consulting Group Ltd. by replying to this email...

Calvin Consulting for Data Validation assistance

Let's explore your needs and come up with a plan that works for you.  You can also visit Chapter 6 of the Alberta Air Quality Monitoring Directive to learn more about data validation, why is data validation important, and to see if our services might help your organization maintain the highest standards of data accuracy.

New! Comments

Do you like what you see here? Please let us know in the box below.

This comprehensive guide explains for you...Why is Data Validation Important?

From the initial verification stages to the final independent assessment, it walks you through the meticulous process of maintaining data accuracy and reliability.  The guide discusses validation codes, logs, and criteria, emphasizing the importance of each validation level in ensuring completeness and consistency.



Do you have concerns about air pollution in your area??

Perhaps modelling air pollution will provide the answers to your question.

That is what I do on a full-time basis.  Find out if it is necessary for your project.



Have your Say...

on the StuffintheAir         facebook page


Other topics listed in these guides:

The Stuff-in-the-Air Site Map

And, 

See the newsletter chronicle. 


Thank you to my research and writing assistants, ChatGPT and WordTune, as well as Wombo and others for the images.

GPT-4, OpenAI's large-scale language generation model (and others provided by Google and Meta), helped generate this text.  As soon as draft language is generated, the author reviews, edits, and revises it to their own liking and is responsible for the content.