Tuesday, May 8, 2012

Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022




http://ow.ly/aLTMu

An article by Ralph Losey, Esq. posted on his blog e-Discovery Team®.

This article discusses sampling, and provides detailed analysis and future projections based on mathematical formulas provided by the author.  Based on the author's projects, an expected range of 270,000 to 333,000 attorneys will be using random samples to perform an attorney review, with some form of technology assisted review after the samples have been tagged.

The author states, "Random sampling is still a rare exception in U.S. legal culture. And therein lies the problem, at least in so far as e-discovery quality control is concerned. Sampling now has a very low prevalence rate.

But those of us in the world of e-discovery are used to that. There are still very few full-time specialists in e-discovery. This is changing fast. It has to in order for the profession to cope with the exploding volume and complexity of written evidence, meaning of course, evidence stored electronically. We e-discovery professionals are also used to the scarcity of valuable evidence in any large e-discovery search. Relevant evidence, especially evidence that is actually used at trial, is a very small percentage of the total data stored electronically. DCG Sys., Inc. v. Checkpoint Techs, LLC, 2011 WL 5244356 at *1 (N.D. Cal. Nov. 2, 2011) (quoting Chief Judge Rader: only .0074% of e-docs discovered ever make it onto a trial exhibit list). Again, this is a question of low prevalence. So yes, we are used to that. See Good, Better, Best: a Tale of Three Proportionality Cases – Part Two; and, Secrets of Searcharticle, Part Three (Relevant Is Irrelevant)."  Links to the other informative article referenced by Mr. Losey are provided in his article.

The article goes on to state, "Assuming that by the year 2022 there are 1.5 Million lawyers (the ABA estimated there were 1,128,729 resident, active lawyers in 2006), I predict that 300,000 lawyers in the U.S. will be using random sampling by 2022. The confidence interval of 2% by which I qualified my prediction means that the range will be between 18% and 22%, which means between 270,000 lawyers and 330,000 lawyers. I have a 95% level of confidence in my prediction, which means there is a 5% chance I could be way wrong, that there could be fewer than 270,000 using random sampling, or more than 330,000."

Mr. Losey provides some intriguing mathematical formulas to support his projections.  In addition, the author  provides rationale as to why the legal profession needs to further embrace random sampling during the discovery phase.  The article states, "For my purposes as an e-discovery lawyer concerned with quality control of document reviews, this explanation of near certainty is the essence of random probability theory. This kind of probabilistic knowledge, and use of random samples to gain an accurate picture of a larger group, has been used successfully for decades by science, technology, and manufacturing. It is key to both quality control and understanding large sets of data. The legal profession must now also adopt random sampling techniques to accomplish the same goals in large-scale document reviews."

The article goes on to discuss issues such as "prevalence", which looks at the percentage of relevant information within a larger corpus.  The article also provides links to RaoSoft's "calculator", which provides ability to determine the number of documents that must be reviewed in order to have a correct sample size, based on the desired prevalence level that is desired to be attained.

Mr. Losey states, "Here is one way of expressing the basic formula behind most standard random sample size calculators:
n = Z² x p(1-p) ÷ I²

Description of the symbols in the formula:

n = required sample size

Z = confidence level (The value of Z is statistics is called the “Standard Score,” wherein a 90% confidence level=1.645, 95%=1.96, and 99%=2.577)

p = estimated prevalence of target data (richness)

I = confidence interval or margin of error

Putting the formula into words – the required sample size is equal to the confidence level squared, times (the estimated prevalence times one minus the estimated prevalence), then divided by the square of the confidence interval."

P.S.  The conclusion section of the article also provides a nice recap of the formulas relied upon by the author, and methods used to provide the anticipated results for the year 2022.  

No comments:

Post a Comment