ISO/IEC 17043 : Choosing the Right Statistical Methods for Proficiency Testing
Last Updated on September 25, 2025 by Melissa Lazaro
ISO/IEC 17043 : Choosing the Right Statistical Methods for Proficiency Testing
If you’re running proficiency testing (PT) schemes under ISO/IEC 17043, then you already know the technical side can get… overwhelming. Especially when it comes to statistics.
I’ve worked with dozens of PT providers, and the number one question I get when they’re setting up or revising their PT scheme is:
“What statistical method should we use to evaluate participant results?”
It’s a good question—and an important one. Because if you choose the wrong method or can’t justify it during your accreditation assessment, it can lead to findings or even loss of credibility with participants.
So in this article, I’ll walk you through the most common and accepted statistical approaches in ISO/IEC 17043. We’ll talk about how to choose, how to justify your choice, and how to avoid common mistakes.
No complicated math, no dry theory—just real-world guidance that works.
Why Statistics Matter in Proficiency Testing
It’s not just about numbers—it’s about fairness
Statistical evaluation is at the heart of any PT scheme. It’s what turns raw participant data into meaningful insight. Without it, you can’t:
- Score participant performance
- Identify outliers
- Compare results objectively
- Improve future schemes
In other words, the statistics you choose determine whether your PT scheme is truly fit for purpose.
What ISO/IEC 17043:2023 says
The standard requires you to apply appropriate statistical methods for:
- Assigning values
- Evaluating performance
- Reporting results
But it doesn’t dictate exactly which method to use. That’s up to you—as long as it makes sense, matches your scheme’s purpose, and is justifiable.
Before You Pick a Method: Ask These Questions
1. What kind of data are you evaluating?
Is it quantitative (numerical, like mg/L or °C) or qualitative (pass/fail, present/absent, correct/incorrect)?
Your method depends heavily on this distinction.
2. How many participants are in the scheme?
Some methods (like calculating standard deviation) only work well when you have a larger number of results. If you’re running a small round with 5–8 participants, robust or expert-derived statistics may be more appropriate.
3. What’s the nature of the test item?
Is it prone to degradation or instability? Were the homogeneity and stability tests clean? The more confident you are in your test item, the more confidently you can apply tight evaluation criteria.
4. Does your data appear normally distributed?
This is critical. A lot of PT providers use z-scores assuming their data fits a normal (bell-shaped) distribution—but don’t actually test that assumption. If the data isn’t normal, you’ll need robust or non-parametric alternatives.
Commonly Used Statistical Methods (That Actually Work)
Let’s break down the main ones you’ll come across—and when to use them.
For Quantitative PT Schemes
Z-scores
Probably the most commonly used performance metric. It compares a participant’s result to the assigned value, scaled by the standard deviation for proficiency assessment (SDPA).
- Z = (Result – Assigned Value) / SDPA
Use it when:
- You have 12+ participants
- Data is roughly normally distributed
- You have clear, validated SDPA values
En-scores (En numbers)
Often used when measurement uncertainty is involved. It’s popular in calibration or metrology-based PT schemes.
- En = (Result – Assigned Value) / √(U²lab + U²assigned)
Use it when:
- Measurement uncertainty is critical
- Your scheme involves metrology, chemistry, or high-precision work
Robust statistics
These include:
- Robust means
- Median absolute deviation (MAD)
- Huber estimators
Use them when:
- You have outliers
- Your data doesn’t follow a normal distribution
- Your participant group is small or highly variable
Example:
An environmental testing provider I worked with had 9 participants in one PT round. Two of them reported unusually high results. They used robust mean and MAD instead of traditional mean and SD, which prevented the outliers from skewing the evaluation.
For Qualitative PT Schemes
Consensus-based evaluation
This is common when participants are asked to identify a contaminant, observe a color change, or give a categorical answer.
- If 18 out of 20 participants say “positive,” the consensus is “positive.”
Use it when:
- There’s no quantitative measurement
- You want to see if participants reach the same conclusion
Scoring systems or matrices
Some PT providers use custom scoring systems—e.g., 2 points for correct ID, 1 point for partial, 0 for incorrect.
Use it when:
- You’re dealing with multiple-choice, descriptive, or observational results
- A simple “correct/incorrect” system doesn’t reflect the nuance of the task
Example:
A food testing PT provider evaluating allergen detection gave 2 points for “correct allergen named,” 1 point for “general category correct,” and 0 for incorrect. This helped give a more balanced view of performance.
How to Justify the Statistical Method You Use
This is where most providers go wrong—not because they picked the wrong method, but because they couldn’t explain why they picked it.
What assessors look for:
- That your method fits the type and size of your data
- That it’s consistent across PT rounds
- That you’ve documented your logic somewhere—ideally in a PT Scheme Plan or SOP
Practical ways to justify your method:
- Use historical data to support chosen SDPA or scoring thresholds
- Show that the data supports a normal distribution (or explain why you used robust stats instead)
- Reference accepted guidelines (e.g., ISO 13528 for statistical design)
Example:
A PT provider I supported used a robust mean and MAD because their participant group was highly variable. They included a paragraph in their scheme documentation explaining that decision. The assessor had no issue.
How to Report Results Clearly
Tell the story—not just the stats
Participants aren’t statisticians. Even experienced labs appreciate a clear, plain-language explanation of how you calculated results and what the scores mean.
Good reports include:
- The assigned value and how it was determined
- The performance metric used (z, En, etc.)
- What the scores mean (e.g., z > 3 = action required)
- A visual summary, like a histogram or table with color coding
- Guidance on what participants should do if their result was “questionable” or “unsatisfactory”
Example:
One PT provider added a short interpretation guide at the end of their report:
“A z-score between -2 and +2 means satisfactory. Between -2 and -3 or +2 and +3 means borderline. Above ±3 indicates unsatisfactory—please review your method.”
Participants appreciated the clarity, and follow-up questions dropped significantly.
Pro Tips for Smarter Statistical Planning
- Always check your data distribution.
Don’t assume your values are normally distributed—run a histogram, use a Shapiro-Wilk test, or just visualize the data. - If you have <10 participants, avoid traditional mean and SD.
Go robust. Use median, MAD, or even expert-derived values. - Maintain a Statistical Methods SOP.
It’s one of the first things an assessor will ask for. Keep it updated and include rationale for each method. - Keep a log of statistical methods used by PT round.
This shows consistency and helps you defend your decisions during audits. - When in doubt, reference ISO 13528.
It’s the supporting guidance for ISO/IEC 17043 and offers accepted statistical techniques.
Common Mistakes to Avoid
- Using z-scores without checking for outliers
One extreme result can distort your assigned value and SDPA if you’re not using robust stats. - Not justifying how you chose your SDPA
Saying “we used 0.2” isn’t enough. Was it based on historical data? Consensus? Expert input? - Applying complex stats that participants don’t understand
Your goal is not to show off—it’s to give clear, fair feedback that participants can use. - Mixing methods without reason
If you switch from z-score to En score or from mean to median, document why—and when you did it.
FAQs
Q: Can I use the mean as the assigned value if the data isn’t normally distributed?
Only if you’ve tested for normality and confirmed it’s safe. Otherwise, use a robust mean or expert value.
Q: What if I only have 5–6 participants?
That’s tricky. Use robust statistics or expert-derived values—and widen your performance criteria slightly. Make sure it’s documented.
Q: Do I need a statistician on staff?
Not necessarily. But someone on your team should be able to explain and defend your statistical methods—clearly and confidently.
Final Thoughts
Choosing the right statistical method for PT evaluation isn’t about finding the “perfect” formula. It’s about choosing a method that fits your data, your participants, and your test item—and then documenting your rationale clearly.
I’ve seen labs struggle for months over stats… and I’ve seen others simplify their process, justify their methods well, and breeze through audits. The difference? Clarity and consistency.
If you’re still unsure which method to use—or want to review your current process—I’ve created a PT Statistics Method Summary Sheet that outlines common methods, when to use them, and how to justify them for audit.
Want a copy? Let me know and I’ll send it your way.
Or if you’d prefer a full review of your PT scheme design, I’m happy to help.
Let’s make ISO/IEC 17043 statistics simple—and something you can actually feel confident about.
Melissa Lavaro is a seasoned ISO consultant and an enthusiastic advocate for quality management standards. With a rich experience in conducting audits and providing consultancy services, Melissa specializes in helping organizations implement and adapt to ISO standards. Her passion for quality management is evident in her hands-on approach and deep understanding of the regulatory frameworks. Melissa’s expertise and energetic commitment make her a sought-after consultant, dedicated to elevating organizational compliance and performance through practical, insightful guidance.