Source: UNIV OF WISCONSIN submitted to
DETECTING PROBLEMS IN SURVEY DATA USING BENFORD'S LAW
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
TERMINATED
Funding Source
Reporting Frequency
Annual
Accession No.
0206933
Grant No.
(N/A)
Project No.
WIS01038
Proposal No.
(N/A)
Multistate No.
(N/A)
Program Code
(N/A)
Project Start Date
Sep 1, 2006
Project End Date
Aug 31, 2009
Grant Year
(N/A)
Project Director
Schechter, L.
Recipient Organization
UNIV OF WISCONSIN
21 N PARK ST STE 6401
MADISON,WI 53715-1218
Performing Department
AGRI & APPLIED ECONOMICS
Non Technical Summary
Economists and other scientists depend on survey data for their analysis, but this data may include many biases. This project proposes to develop a tool for detecting biases in survey data. First, are certain types of questions more susceptible to biases. Second, how does the quality of data collected by government organizations compare to that of data collected by academic researchers. Lastly, is it possible to identify enumerators who are not doing a good job early in the collection process.
Animal Health Component
(N/A)
Research Effort Categories
Basic
10%
Applied
30%
Developmental
60%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
6092410209010%
6093910209010%
6095299209010%
6097299209050%
6097310209020%
Goals / Objectives
This research will develop tools to detect respondent bias and fraudulent data from farm survey data. This is of great importance for the USDA which both collects farm household survey data and also buys data sets from outside sources. Benford's law describes the distribution of significant digits observed in naturally occurring data. Numbers with a first digit of one are observed more often than those starting with two, three, etc. I will use Benford's law to test for biases in two types of farm surveys: a) data collected by academic researchers, and b) data collected by national statistical bureaus in developing countries. Given the current trend for academics to collect their own data both in the US and in other countries, it will be of use to know how the quality compares between the different types of surveys. Using these data sets there are many questions I can answer. First, are certain types of questions more susceptible to biases across a variety of settings. Second, how does the quality of data collected by government organizations compare to that of data collected by academic researchers. Third, can enumerators themselves identify biased data while collecting it. Lastly, is it possible to identify enumerators who are not doing a good job early in the collection process. This alone would be quite useful, but sometimes researchers only have limited information about data. Maximum entropy can be used to recover the unknown probability distribution for undetermined problems. In cases with only limited information I will use maximum entropy to recover the underlying distribution of digits.
Project Methods
A first goal of this research project will be to examine whether farm household survey data conforms with Benford's Law, and to determine which questions are most susceptible to biases. In this proposal I perform such a statistical exercise on a few variables in a data set collected by academic researchers in rural Paraguay. First I examine whether an aggregated variable such as total income conforms to Benford's Law. This variable was not directly recorded by enumerators, but comes from manipulations of variables recorded by enumerators. The results suggest we can not reject the hypothesis that the data on total income follows the Benford distribution at the 90% significance level. In addition to testing the aggregated data, we can test the data for which the enumerator wrote down the answer directly as stated by the farmer. The farmer was asked how much of each crop he harvested in the past year. I put together all of the quantities produced of all possible crops. I find that more quantities produced begin with a five than would be suggested by Benford's law while fewer begin with a four, six, or seven. I can reject that the data comes from Benford's distribution. Because there were three enumerators, we might worry that one of the enumerators was not doing his job properly. I find that the results for enumerator 3 are slightly worse than those for the other two enumerators, but all three enumerators are reporting more first digits of five than would be implied by Benford's law. This suggests that the enumerators were not falsifying data, but that this is a case instead of respondent bias. It is probably because the farmers are not always sure exactly how much of a crop they have harvested and so they tend to choose `nice' round numbers. A farmer is more likely to claim to have harvested 500 kilos of corn than 422 kilos. These are just a few examples of how Benford's law can be used to detect biases in data and to determine which types of questions are prone to these biases. In this research project I will examine multiple farm household surveys in multiple countries more systematically. Benford's law can be of use to practitioners in detecting respondent bias or falsified data even when they do not have access to an entire data set, but instead only know limited summary statistics about the data. The physicist Jaynes (1957) developed classical maximum entropy. It is a technique which can be used to estimate unknown probabilities on the basis of partial knowledge. Although this technique was first introduced by physical scientists, econometricians have extended the original techniques to recover information about economic systems. In cases with only limited information I will use maximum entropy to recover the underlying distribution of digits and then compare that distribution with the distribution implied by Benford's law. By developing these tools using an actual data set I can see which pieces of limited information are the most useful in determining the underlying distribution and its concordance with Benford's law.

Progress 09/01/06 to 08/31/09

Outputs
OUTPUTS: I presented the results from this research at the BREAD conference, Purdue's Agricultural Economics department seminar, and UW Madison Agricultural Economics department seminar. In addition, we created a website called www.checkyourdata.com at which anyone can check their data quality using the statistical techniques we developed. PARTICIPANTS: Laura Schechter, PI, led the research on this topic. John Morrow, Research Assistant, conducted data analysis and created the website www.checkyourdata.com Alex Yuskavage, Research Assistant, conducted data analysis and statistical development of the tests used. George Judge, collaborator. Marian Grendar, collaborator. TARGET AUDIENCES: The creation of the website www.checkyourdata.com was the main effort made to reach out to the target audience. The target audience is anybody who collects or uses survey data. The website is available to anyone who would like to test the quality of his or her data. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
The main outcome of this research was that we developed new statistical techniques for measuring data quality. These techniques can be carried out while collecting data to improve data quality.

Publications

  • "Detecting Problems in Survey Data using Benford's Law," G. Judge and L. Schechter, 2009. Journal of Human Resources, 44(1): 1-24.
  • "An Empirical Non-Parametric Likelihood Family of Data-Based Benford-Like Distributions," M. Grendar, G. Judge and L. Schechter, 2007. Physica A: Statistical Mechanics and its Applications, 380: 429-438.


Progress 01/01/08 to 12/31/08

Outputs
OUTPUTS: I have presented this research in multiple seminars. This includes Purdue's Ag Econ department, University of Wisconsin's Ag Econ department, and the BREAD conference at Yale university. We have also created a website: www.checkyourdata.com at which individuals can plug in their own data sets and test their accord with Benford's law. PARTICIPANTS: Not relevant to this project. TARGET AUDIENCES: The target audience for this project is any researcher who uses survey data in their analysis. The effort made to cause a change in actions is the creation of the website, www.checkyourdata.com, making these data quality tests easily available to the general population of researchers (or anyone else). PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
There are not yet any outcomes/impacts.

Publications

  • "Detecting Problems in Survey Data using Benford's Law," (2009) with G. Judge, Forthcoming. Journal of Human Resources.


Progress 01/01/07 to 12/31/07

Outputs
OUTPUTS: This research was disseminated by being presented at an internal seminar in the Agricultural and Applied Economics department at UW Madison, as well as at the Ninth BREAD Conference on Development Economics at Yale University. In addition, a website has been created (www.checkyourdata.com) which allows users to upload their data and measure data quality. TARGET AUDIENCES: The target audience for this project is any researcher who uses survey data in their analysis. The effort made to cause a change in actions is the creation of the website, www.checkyourdata.com, making these data quality tests easily available to the general population of researchers (or anyone else).

Impacts
There are not yet any outcomes/impacts.

Publications

  • "An Empirical Non-Parametric Likelihood Family of Data-Based Benford-Like Distributions," (with M. Grendar and G. Judge) 2007. Physica A: Statistical Mechanics and its Applications, 380: 429 - 438.


Progress 09/01/06 to 12/31/06

Outputs
We are in the process of looking at seven different agricultural household survey data sets world-wide. We are comparing the quality of data on crop and animal production. We have been finding that small data sets collected by academics tend to be of better quality than larger data sets collected by government institutes. The data is also of worse quality for more diversified farmers. There doesn't seem to be large quality differentials between results when both the husband and the wife report.

Impacts
The expected impact is to give practitioners an easy-to-use tool with which they can judge the quality of data sets, enumerators, and specific survey questions.

Publications

  • No publications reported this period