Kathryn NewcomerProfessor and Associate Director
Trachtenberg School of Public Policy and Public Administration
George Washington University
Over the last decade social scientists in the U.S and Western Europe have been promoting evidence-based policymaking. They argue that random control trials (RCTs), or true experiments with random assignment and control groups, are the “gold standard” for evaluating programs. Over this same period time legislative as well as executive initiatives have required managers in federal agencies to measure and evaluate the results of their programs. As the Government Accountability Office (GAO) has reported in a series of reports, much progress has been made in the federal agencies in terms of measuring and reporting on performance over these years. (See most recently GAO-08-1026T).
Under the George W. Bush Administration, Office of Management and Budget (OMB) examiners have been assessing program effectiveness with the Program Assessment Rating Tool (PART), and have assigned scores to the programs based on the evidence brought to bear by the agency staffs. Thus many staff and managers in both the agencies and OMB have gained experience in assessing the quality of evidence on program results. So what have we learned about collecting and assessing evidence about programs?
• In this era of “evidence based- policymaking” deciphering what constitutes valid, relevant and reliable evidence is neither easy nor straightforward.
• Attributing measurable outcomes, or impact, to programs is virtually impossible much of the time, and
• Advocating a “one size fits all” approach to evaluation (assessment) of public programs is neither effective nor desirable.
In many cases, measuring performance on an on-going basis for federal agencies involves reliance on a large number of state, local and nonprofit agencies collecting and reporting data, thus introducing concerns about the comparability and consistency of collection criteria. And as pressures to measure the outcomes, or results, of federal programs rather than simply workload measures, or outputs, have risen, in many cases managers confront the reality that the time required to validly capture outcomes exceeds the length of the reporting period.
Measurement of the net impact of a program requires that the many external factors that also affect the ultimate behaviors of the citizens or neighborhoods served can be ruled out through research design and/or careful analysis of data. However, that is extremely difficult to accomplish, even with attempts to constitute comparison groups of citizens or neighborhoods that were not served, and in many cases, such as programs to improve the quality of air and water, that is not possible. What is more realistic is that analysts use both data and logic to make the case for plausible attribution of outcomes to programs – but not claim they can demonstrate causal links.
When OMB examiners were required to assess evidence on program effectiveness as part of the PART process they were advised to look and ask for evidence from RCTs, thus agency staff were told to use RCTs to the greatest extent possible in evaluating their programs. Guidance provided to OMB examiners, and to agency staff, largely praised and described RCTs – not the variation of evaluation strategies appropriate for evaluating sometimes rather complex systems.
In March 2008 a task force of experts from the American Evaluation Association reviewed the OMB guidance on evaluation and concluded that “a more balanced and considered presentation of the role of RCTs in assessing the effectiveness of federal programs was needed,” and recommended that a “balanced presentation of the spectrum of appropriate, rigorous, evaluation methods will improve the likelihood of selecting appropriate measures and methods to assess and improve program performance.” It should be noted that within the evaluation profession, arguments over the superiority of RCTs continues to rage- contributing to the evidence wars.
What’s Next?
Let’s hope that the next President’s approach to assessing programmatic performance builds on what has been learned from the experience with GRPA and PART. Federal agencies are measuring too much in response to legislative and executive requirements. We need to now focus on the more relevant and valid measures and not measure everything that can be measured. OMB should not be tasked with assessing the effectiveness of all federal programs, nor to provide one-sided guidance on evaluation methods to the agencies. We have learned from the PART experience: the responsibility for program evaluation should be placed on the top leadership in the agencies, not on the shoulders of OMB examiners. Obtaining useful evidence on program effectiveness is neither easy nor cheap.
Kathryn Newcomer is a Professor and Associate Director of the Trachtenberg School of Public Policy and Public Administration at the George Washington University where she also is Co-Director of the Midge Smith Center for Evaluation Effectiveness. She received the Elmer B. Staats Award for Efforts to Improve Accountability in Government, conferred by the National Capital Area Chapter of the American Society for Public Administration in May 2008.
You need to be a member of New Ideas For Government to add comments!
Join this Ning Network