🇺🇸 APA / ASA Standards 🇬🇧 UK Research Councils 🔬 FDA / EMA Drug Trials ⚠️ Reproducibility Crisis

P-Value Calculator

Calculate the p-value from a z-score or t-score for one-tailed or two-tailed hypothesis tests. Identifies significance at both α = 0.05 and α = 0.01 thresholds.

Quick Answer
p-value = probability of results this extreme IF H₀ is true. p < 0.05 → reject H₀ at 5% significance. p < 0.01 → highly significant. p ≠ probability your hypothesis is true.
Field / Country Standard α One or Two-Tailed? Governing Body
🇺🇸 US Psychology (APA) 0.05Two-tailedAPA Publication Manual
🇬🇧 UK Medical Research 0.05Two-tailedNIHR / BMJ / Lancet
🇪🇺 EU Drug Approvals (EMA) 0.05 (one-sided 0.025)Two-tailedICH E9 Guideline
🇺🇸 US Drug Approvals (FDA) 0.05Two-tailedFDA / ICH E9
🔭 Physics (particle physics) 0.0000003 (5-sigma)N/ACERN / PDG
🧬 Genetics (GWAS) 5×10⁻⁸Two-tailedMultiple testing correction

Frequently Asked Questions

What is the reproducibility crisis and how does it relate to p-values?

The reproducibility crisis (or replication crisis) refers to findings that many published scientific results fail to replicate when repeated. A 2015 study in Science reproduced only 36% of 100 psychology studies. Key contributors: p-hacking (testing multiple hypotheses and only reporting p < 0.05), HARKing (hypothesising after results are known), publication bias (journals favouring significant results). Major efforts to address this include pre-registration of studies (AsPredicted.org, OSF) and the use of Bayesian methods. The UK Medical Research Council and US NIH both now require pre-registration.

What is the difference between p-value significance thresholds in physics vs biology?

Physics (particle physics at CERN): requires 5-sigma (p < 0.0000003) for a "discovery" — the Higgs boson announcement used this threshold. This is because: (1) results can be checked against theory precisely, (2) experiments run for years producing massive datasets, (3) a wrong announcement would be a major setback. Biology and psychology: 0.05 threshold, partly because effect sizes are smaller and data noisier. Genomics uses genome-wide significance p < 5×10⁻⁸ to correct for testing ~1 million genetic variants simultaneously (Bonferroni correction).