Project Description

REPRODUCIBLE RESEARCH IN REAL ESTATE

Abstract. The practice of reproducible research, a central component of the burgeoning ‘‘open science’’ movement, has been thrust into the public spotlight over the past few years. In this paper, I offer an overview of reproducibility in science, review specific concerns for the real estate field, and survey the current policy regarding reproducibility among top real estate journals. Performing research reproducibly requires a change from the status quo and represents an educational issue. Toward that end, I demonstrate reproducible research via a fully documented and freely-available example of a reproducible hedonic price analysis complete with all data, code, and results hosted online.

The broader scientific community and the public at large both benefit in numerous ways from the open and transparent exchange of all data and methods used to generate any individual scientific finding (Fomel and Claerbout, 2009). The greatly enhanced ability of subsequent researchers to verify existing work through reproduction and/ or replication is one such benefit resulting from increased transparency. The verification of existing work via reproduction has been deemed a ‘‘cornerstone’’ (McCullough, 2009), a ‘‘fundamental tenet’’ (Crick, Hall, and Ishtiaq, 2014) and an outright ‘‘requirement’’ (Feigenbaum and Levy, 1993a) of legitimate scientific practice. In short, good science is reproducible. Yet, despite these strong calls for reproducibility—studies conducted such that others can recreate the results—and its obvious tie to scientific rigor, empirical studies examining the occurrence of reproducibility in scientific research, regardless of discipline, find that it exists as an exception and not the rule (Anderson, Greene, McCullough, and Vinod, 2008).

Discussions of the crisis of (ir)reproducibility have expanded to the popular media as well. Recent articles in The Economist (Economist, 2013), The New York Times (Carey, 2015), the Boston Globe (Johnson, 2015), and The Atlantic (Yong, 2015) have all highlighted the difficulty with which highly regarded studies in a variety of scientific fields have had in being reproduced by scientists who were not the original author(s). In other words, concerns over reproducibility are no longer confined to research conference presentations and editorial meetings, but, rather, are discussed openly in some of the most widely distributed print and online media outlets in the world. This ‘‘crisis’’ damages the public’s trust in research findings from all scientific disciplines and research outlets (Yamada and Hall, 2015).

In a number of disciplines, particularly the natural and life sciences, the past decade has seen a considerable push towards more transparency and, in many cases, strict requirements regarding reproducibility (LeVeque, Mitchell, and Stodden, 2012). In these and other computationally intense fields, mandatory reproducibility has been lauded as almost ‘‘a movement’’ by cautious proponents (Donoho, 2010). Observers in the social sciences (including economics and real estate), however, have highlighted the struggle to increase the frequency of reproducible work in their own fields (Freese, 2007; Hamermesh, 2007; Camfield and Palmer-Jones, 2013).

In this paper, I approach reproducibility and the lack thereof in the context of the real estate discipline. I begin by discussing the place of reproducibility in the larger open science movement, as well as clarifying the differences between reproduction, replication, and re-analysis. The practice of reproducible research, although appealing on many levels, is not without hurdles, some of which are unique to the field of real estate. A discussion of the rationale for reproducible research, its challenges, and the possible solutions follows. I then focus on specific issues within the real estate discipline, including a survey of the current practice and regulations regarding reproducibility in the top real estate journals. I conclude with an example of a fully reproducible hedonic price study examining the impact of view ratings on single-family home prices in King County, Washington. In the spirit of reproducible and open science, access to all data, data provenance, analytic code, process documentation, and an interactive tool for examining the study results are openly hosted online.

Background

The idea that any research finding should be verifiable through reproduction by other researchers has been a core principle of the scientific method for centuries (Rey, 2014). Under this framework, all results, data, and methods of published studies must be accessible to any and all interested parties in such a manner that the study could be recreated with identical results. Over the past century or so science has, to varying degrees, morphed into a less ‘‘open’’ model; one that has been described as ‘‘proprietary’’ (Aksulu and Wade, 2010), ‘‘captured’’ (Rey, 2014) or ‘‘closed’’ (Sui, 2014). As a result, the promotion of, and return to, science as an ‘‘open’’ endeavor has gained cross-disciplinary momentum (Society, 2011).

The open science model has many dimensions; one of which is the notion of reproducibility or replicability as a requirement for scientific acceptance (i.e., publication). The terms ‘‘reproducibility’’ and ‘‘replicability’’ are often used somewhat interchangeably across various scientific disciplines. Within the context of this discussion, I will utilize the definitions that are most common in the social, economic, and computer science fields (Peng, 2011; Boylan, Goodwin, Mohammadipour, and Syntetos, 2015). Reproduction is defined as producing identical quantitative results through the use of identical data and methods. Replication is defined as producing qualitatively similar results through the use of identical methods but different data.

In this context, qualitatively similar results are taken to mean results that are similar in the overall conclusion but may differ slightly in the actual numerical findings.1 Boylan, Goodwin, Mohammadipour, and Syntetos (2015) also note that a third method exists, re-analysis, wherein a researcher(s) uses identical data from a previous study, but employs a different method. There is no a priori expectation that results of a re-analysis will match those of the previous study, although widely divergent conclusions suggest spurious results in either or both studies.

Within the discussion of reproducibility there are two related terms that require definition: data provenance and analytical workflow (Goodman et al., 2014). Data province is defined as the ‘‘sum of all processes, people and documents that were involved in the research outcome.’’ Analytical workflow is defined as the ‘‘order in which the data provenance occurred.’’

Depending on the discipline, other terms are used to describe the same or closely related phenomena. Gentleman and Lang (2007), applied statisticians, have proposed the notion of a research ‘‘compendium’’; a document(s) encompassing all of the analytical workflow. In the computer science field, Lynch (2009) suggests the term ‘‘data curation’’ to refer the specific tasks applied directly to the data excluding the actual empirical analysis. Curation, as so defined, is the second phase of Gray’s (2009) ‘‘research lifecycle’’ composed of data capture, data curation, analysis, and visualization (publication).

Regardless of the specific nomenclature, the idea is the same: good scientific research explains all decisions made, documents all steps taken, and shares all data used to create the final results. In short, effective scientific communication under the open model requires complete transparency (Koenker and Zeileis, 2009). Reproducible research can be considered both a process and an outcome. As a process it entails full documentation of all steps in the analytical workflow combined with the open and transparent sharing of this information. Research that follows these guidelines and can be recreated with identical results by another researcher can then be considered reproducible (an outcome).

In this study, I discuss reproducibility as both a process and an outcome. I focus on reproducibility because reproducibility is a first step in developing a more robust discipline. Holding research to the reproducible standard helps eliminate errant or poorly executed research from the literature. Verification by replication is the step following reproduction. Replication aids in eliminating spurious claims and poorly designed research (Peng, 2011). In terms of verifying existing work, reproduce first, then replicate.

 

WHY REPRODUCIBLE RESEARCH?

Other than satisfying a lofty ideal, the benefits of producing reproducible research accrue to the scientific community, the general public, and to the researchers themselves. Simply put, the author(s) profits while creating positive externalities for his or her community. Donoho (2010) notes that after introducing colleagues to the notion, they likewise do the same through their own networks. In a sense, the practice of reproducible research is contagious.

The first, and most immediate, benefit is that when preparing documentation of the data provenance authors often find errors in their own work and are able to correct them prior to publication (King, 1995; Camfield and Palmer-Jones, 2013). Developing a habit of workflow documentation can also lead to improved work habits and inter-team dynamics due to a decrease in wasted time re-doing various tasks (Donoho, 2010). Often, researchers also experience an overall decrease in work effort due to efficiencies gained through reproducible practices such as coding all data transformations as opposed to spreadsheet-based point and click methods (Sandve, Nekrutenko, Taylor, and Hovig, 2013). Additionally, reproducible research has been shown to garner more citations than nonreproducible work (King, 1995; Stodden, 2009). In short, reproducible practices can create more, faster, better, and higher impact research (Anderson, Greene, McCullough, and Vinod, 2008).

Much like open source software, open and reproducible research has greater potential for expansion in both continued use by others and in citations, as well as a more immediate feedback loop to enhance a researcher’s own development (Aksulu and Wade, 2010). Related, ‘‘open’’ scales more quickly than ‘‘closed’’ ever will, whether it be citations, reuse of a method or proliferation of a software product (Wilbanks, 2009).

Positive externalities from reproducible research benefit the entire community and the public at large. More accurate research, as evidenced by a lower rate of errors found in reproducible studies (Ferguson et al., 2014), enhances public trust in reported results and provides ‘‘sturdier’’ shoulders upon which future researchers can stand. Data shared through reproducible research also provides improved continuity for related research and represents an act of better stewardship towards public resources (research funding), as well as expanding access to the generation of knowledge (Donoho, 2010). Finally, openly accessible data from reproducible studies can be combined to form larger and better datasets (Ferguson et al., 2014) and/ or can be used to create large, confirmatory meta-analyses (Rey, 2014), a key method for turning individual empirical findings into acknowledged scientific theories.

REPRODUCIBILITYS CHALLENGES AND SOLUTIONS

A recent review of reproducibility across a range of disciplines found that, depending on the field, anywhere from 2% to 20% of studies published in top journals were reproducible (Boylan, Goodwin, Mohammadipour, and Syntetos, 2015). Focusing on the economics literature, a field closely related to real estate in many ways, past research has shown similar figures in terms of reproducibility with estimates as low as 4% (Dewald, Thursby, and Anderson, 1986) and 9% (McCullough, McGeary, and Harrison, 2006) to as high as 30% (Vinod, 2001). Given the internal and external benefits resulting from reproducible research, why then the low rates of adherence to its principles?

Like all changes to the status quo, significant challenges exist. In the case of reproducible research, these challenges are both technical and institutional. From the institutional perspective, challenges can be seen as deriving from the supply side (the researchers themselves), as well as from demand factors (editors, journals, and potential reproducers) (Ahlqvist et al., 2013). Tractable solutions exist or have been proposed for most of the key challenges often levied against more stringent requirements for reproducibility across scientific disciplines.

Technical Challenges. Decades ago, prior to personal computers and high-speed telecommunications, truly reproducible research would have likely faced a number of technical difficulties. Today, as data storage and transmission are practically free and nearly instantaneous, with the exception of perhaps the very largest of datasets, many technological challenges to reproducible research have been minimized, although some do remain.

A remaining technical issue is that of changing software environments, especially within open source software. Analytical code that functions in one working environment (i.e., version, additional packages, etc.) may not work years or decades later in the same software. Nearly all open source software projects maintain archives of past releases (versions), as well as package updates. The critical component to solving this challenge is to have researchers report not only the software used, but also the complete environment in which their work was performed—version number, package names and release dates, etc.—within their data provenance (Koenker and Zeileis, 2009). With this information, many issues revolving around backwards compatibility and changes to software algorithms and functions can be avoided.

In the case of data involving confidential subject information, the transmission of raw data may not be legally allowed due to ethics requirements (Firebaugh, 2007; Nosek, Spies, and Motyl, 2012). This situation can create significant hurdles for data sharing. However, in many instances key identifying information can be removed (King, 1995) or researchers aiming to reproduce a given study can submit the necessary requirements in order to obtain data access, the same as the original researchers did initially (Freese, 2007). Nonetheless, it is acknowledged that in some situations data sharing may simply not be legally (technically) possible and verification of results cannot proceed through the traditional method of reproduction (King, 1995).

Institutional Challenges. From an economic perspective, the lack of reproducibility in contemporary research is exacerbated by negative incentives on both the supply and demand side. Thus, the technological aspects of reproducible research may be the easier factor to overcome; it is fixing the perverse incentives toward it that is more difficult (Koenker and Zeileis, 2009).

From a supply perspective, there are a number of factors that discourage or disincentivize individual researchers from practicing reproducible research. First off, cleaning code and properly documenting data provenance and analytical workflow takes time, a resource that is often in short supply (LeVeque, Mitchell, and Stodden, 2012). Related, for research done in spreadsheets or other ‘‘point-and-click’’ software, documenting data provenance and workflow can be a painstaking process and one that cannot be done retrospectively (Asuncion, 2013). Issues with spreadsheet-based software were recently highlighted by the retraction of the highly influential Reinhart and Rogoff (2010) paper due to a Microsoft Excel error found years later by graduate students (Herndon, Ash, and Pollin, 2014). Sandve, Nekrutenko, Taylor, and Hovig (2013) present an elegant, if not abrupt, solution: ‘‘Avoid manual data manipulation steps.’’ As noted above, cleaning and preparing code and data for sharing often leads to the discovery of errors, as well as improves researcher skills that will likely save time in the future. Once understood by researchers, these two incentives should be enough to tip the scales in favor of reproducible workflow practices.

Researchers often spend considerable energy in gathering, cleaning, and preparing data for an individual study. Some estimates put the time spent on the data itself at near 80% of the total research effort (Dasu and Johnson, 2003). Due to the heavy workload ascribed to data collection and cleaning, researchers would like ‘‘patent-like’’ protections to use the developed data for multiple publications (McCullough, 2009) and/ or receive other forms of recognition or rewards to make the time spent gathering data worthwhile (Freese, 2007). Data sharing as part of reproducible research generally occurs at the time of publication (or just before it), so even if data are openly shared to peer reviewers, it is unlikely that another party could beat the original author(s) to publication. To allay fears that remain, rights of first publication could be ascribed to the originating authors King (1995). Additionally, the growing trend of data citations—citing data as one would prior publications—offers another method for data collectors to receive proper recognition for their work (Altman and King, 2007; Mooney and Newton, 2012; Piwowar and Vision, 2013).

It is not just time-constrained authors that present challenges to the spread of reproducible research practices. The scholarly institutions themselves—journals, editors, promotion committees, and funding bodies—contribute as well. While a number of journals have instituted reproducibility requirements such as mandatory data and code sharing, the vast majority do not. In economics, 87% of journals have no such requirements (Andreoli-Versbach and Mueller-Langer, 2014). A lack of strict rules may be a conscious choice as editors are wary of scaring off potential high-impact papers with the threat of additional administrative requirements (Anderson, Greene, McCullough, and Vinod, 2008). Editors may also wish to avoid possible controversy that may result from having one of its studies refuted through an unsuccessful future reproduction or replication attempt (Camfield and Palmer-Jones, 2013). Reproductions of others’ work is often regarded as a ‘‘second class’’ research output (Goodman et al., 2014). In short, there simply may not be much demand for reproducible research and the resulting reproduction studies by journals and their editors.

Solutions to these issues include the development and promotion of data-only journals (Andreoli-Versbach and Mueller-Langer, 2014) and journals that focus on reproduction or replication of studies2 in a given field (Anderson, Greene, McCullough, and Vinod, 2008). Research outputs such as these would offer a centralized repository for researchers looking to borrow data from others and/ or to check on the verification (via reproduction or replication) of important work in their discipline.

Reproduction or replication studies could also be published alongside original contributions, thereby offering immediate verification that is obvious to the reader (Feigenbaum and Levy, 1993a). Another proposed solution is to encourage graduate students and early career researchers to focus on reproduction at the onset of their careers. This would allow inexperienced researchers to learn from existing research while performing a necessary function—the verification of empirical publications—to the scientific community (Feigenbaum and Levy, 1993b; King, 1995).

As publications are the currency of academic tenure, promotion committees also play a large role in setting the reward structures, and ultimate behaviors, of researchers. Fierce competition among junior researchers may lead to a reduction in the willingness to share data, as evidenced by research showing that tenured researchers are more likely to share data than those that are nontenured (Andreoli-Versbach and Mueller-Langer, 2014). As noted above, the demand by journals for reproduction studies is often low and what does not get published will not help build a case for tenure. A solution would be for tenure committees to expand the set of research output that qualify towards promotion. In the case of data and code, this may mean lowering the threshold of the minimum publishable unit (Hannay, 2009) or at the very least developing a mechanism to better capture the researcher’s overall contribution to knowledge in their field, not just publication counts. Full adoption of comprehensive data citations would also be helpful in this regard.

Finally, funding sources should require that all data, code, and results from publically-funded research be shared and presented in a reproducible manner. Many of the largest agencies such as the National Science Foundation (U.S), the European Science Foundation, and the Australian Research Council have already adopted stringent policies on this manner. Smaller government, private, and academic agencies should be encouraged to do the same.

While there are many challenges to furthering the preponderance of reproducible research in the applied sciences, most are not without logical and, generally, lowcost solutions. The current technological abilities mean that few issues cannot be overcome in promoting reproducible research. Rather, it is the institutional challenges that loom larger. Changing academic culture is difficult and if internal movements by a concerned minority are not enough to elicit full-on changes, government and larger institutions may need to lead by supplying education on the benefits of reproducibility and continuing to provide ‘‘reproducible-only’’ funding incentives (Barnes, 2010).

Real Estate-Specific Issues

Most empirical studies in real estate are highly quantitative, contain lengthy analytical workflows, and depend heavily on well-documented data provenance. As a result, nearly all of the challenges and solutions mentioned above are relevant to some degree in the field of real estate. In this section, I begin by noting a number of additions or expansions on the previous ideas as they apply specifically to real estate. This is followed by an examination of the current state of practice in top empirical real estate journals in regards to reproducible research.

 

ADDITIONAL CHALLENGES

A major challenge to reproducibility in real estate research is proprietary data. Real estate is an observational discipline; the data ‘‘simply exist’’ (Azzalini and Scarpa, 2012). More observations cannot be created through another laboratory trial. As a result, the field relies heavily on industry professionals for access to data. Within the real estate industry, information is power and data collectors are often unwilling to part with data unless specifically licensed and, often, dutifully paid for. Under such a regime, researchers may not be able to share data given the existing licensing agreements. Without data sharing, few if any studies could be expected to be conceivably reproduced.

The existence of proprietary data does not, however, invalidate all attempts at practicing reproducible research in the real estate discipline. Even if data cannot legally be shared, researchers themselves own the code, data provenance, and methods used to transform the raw data into final results; the entirety of which can and should be shared either in the appendices or, preferably, in an online repository. If everything but the data were available, the simplest solution to reproducibility would be to allow potential reproducers to enter into similar licensing agreements with data providers. Understandably, the costs of this action will likely decrease the attractiveness of reproducing existing work. Another option may be for the original authors to negotiate licensing agreements such that limited data sharing with publication editors and/ or reviewers is allowable. While this would not create truly open research, it would, at the very least, allow for reviewers to verify existing results. Adding additional hurdles to data access could prove troublesome and this limited data sharing condition would need to be encouraged or required by the journals themselves. When available and appropriate, using open data published by government entities can side-step the issue of proprietary data; however, not all real estate data products are monitored and recorded by public agencies, although in some areas, such as the open release of property transactions and characteristics by county assessors, this is a growing trend. Overall, the lack of outright data ownership remains a persistent issue in real estate research, especially with regard to reproducibility, but it does not mean that good scientific practice can be ignored.

Within real estate research, not all data requires the same amount of cleaning, compilation, and manual additions prior to final analysis. High-quality, secondary data from a reputable tax assessment office or a large data vendor may necessitate very little extra work prior to empirical analysis. It is nearly ‘‘plug and play.’’ Similarly, the research question(s) may be such that additional fields and/ or data transformations are not needed and the raw data are appropriate for the task. Prepared data, which are the data used in the final empirical analysis, fitting these criteria can be considered low value-added (LVA) data. Conversely, when secondary data are dirty, multiple data sources are combined through a time-consuming process and/ or significant primary data collection is required to address the research question the final data product may be considered high value-added (HVA) data.

Naturally, a researcher or team of researchers may be much more likely to share a set of LVA data than a corresponding HVA dataset that required hundreds of hours of cleaning, compilation, and additional primary data collection. Studies examining proximity effects due to amenities/ dis-amenities as well as research into the marginal impacts of property characteristics not commonly collected by traditional data sources, such as view-sheds or a building’s sustainability rating, represent examples of HVA data. There is little discussion of the variety of data effort in the existing reproducibility/ replicability literature surveyed above. While the slow growth of data citations offers one method for researchers to reap the rewards of producing HVA data, it may not be enough in cases where significant effort was put forth in the data preparation phase. The current data citation process also has insufficient mechanisms to separate HVA from LVA data.

Being a highly applied field in which many researchers have one foot in academia and one in industry, industry-based customs can dominate research training. One such custom is the preference for storing, managing, and analyzing data in spreadsheet form. As noted above, tracking data provenance and analytical workflow done in spreadsheets or other point-and-click software can create notable difficulties when practicing reproducible research. Educational practices at the PhD and Masters level should emphasize the importance of coding and documentation in high-quality research with the hope that in the long run these practices will spill over into industry.

A final challenge not entirely unique to real estate, but certainly inherent in it, is the issue of heterogeneity over space and time in both data standards and the observations themselves. As real estate data, especially property level data (PLD), such as transactional information and structural characteristics of individual buildings, are often collected locally, few standards exist across space and time. Additionally, physical real estate exists in a perpetual state of change through new construction, remodeling, and demolition.

exhibit1Keeping up with spatialtemporal differences in data and changes in the observations themselves requires complex data management techniques, such as the use of geographic information systems (GIS) and relational database management systems (RDBMS). Highly technical solutions like GIS and RDBMS make data and code sharing more difficult as these software are often extremely expensive and heavily dependent on the local computing environment (computation power, path structure, operating systems, and software versions). While certainly not insurmountable, these issues do create additional hurdles for reproducible research in real estate.

JOURNAL POLICIES

As a number of commentators have mentioned, the existing policies of journals in a field can shape researcher behavior (Hamermesh, 2007; Andreoli-Versbach and Mueller-Langer, 2014). While reproducible research entails a variety of practices, the act of data sharing is one of the most critically important. Data sharing is also one area where official journal policy can have a large impact on reproducibility across the discipline. To gauge the current state of journal policies with regard to data sharing and reproduction research, I collected official submission policies from 16 top empirically-focused journals in the real estate field (Exhibit 1).

For each journal, I gathered the complete list of instructions to authors and publication guidelines stated on the official website. Guidelines were carefully analyzed to determine which policies, if any, regarding data sharing are explicit stated or generally implied. Of the 16 journals, only two, Land Economics and The Journal of Real Estate Finance and Economics, have a formal policy regarding data sharing.3 In both cases, authors of published articles in these journals must agree to share data with other parties upon request. Mandatory public data hosting and workflow documentation at the onset of publication are not required. Additionally, the International Real Estate Review notes that ‘‘empirical papers that cannot be replicated are discouraged,’’ however no strict rule forbids publication of non-replicable work nor does a data-sharing policy exist.

As journal websites may be out-of-date and editors may follow a set of working rules not published, I have also attempted to contact the editors of each of these journals to determine: (1) if the current policy on the website is up-to-date; and (2) the volume of reproduction or replication studies that are submitted for publication each year. Of the 9 out of 16 editors that responded, all noted that the current policies are up-to-date, with one journal considering adding policies on data sharing in the future. Editors also reported that rarely, if ever, are reproduction or replication studies submitted to their journals.

A Reproducible Example

To provide an example of how reproducible real estate research can work, I have created an analysis of the marginal impact of different view types on singlefamily home prices in King County, Washington. Every step of the analysis from raw data to final regression results as well as all data used to reach these results (the full data provenance) are hosted and freely available online. Below, I explain where the necessary data and code are hosted, the broad steps taken to get from raw data to finished analysis, and provide a short summary of the empirical results. Detailed explanations of the data, code, and analytical process are found in the documentation files hosted along with the analytical code.

All steps in the analytical workflow are coded in R and available for download from a GitHub repository at: https:// github.com/ andykrause/ ReproducibleRealEstate. A full description of the steps necessary to download and execute code to reproduce this analysis in full are available in the ReadMe.md file on the home page to the repository. A conceptual overview of the analytical workflow necessary to reproduce the results is shown in Exhibit 2.

Raw data for the example analysis come from the King County Assessor’s office and King County GIS Department. Due to their large size, the necessary files to reproduce this analysis, along with the available metadata are hosted on Harvard’s Dataverse Network. The permanent link to the data can be found at: http:// dx.doi.org/ 10.7910/ DVN/ RHJCNC. New users to Dataverse may need to register for a free account prior to accessing the data.

Additionally, an interactive web application has been developed to allow readers and interested researchers to assess the validity of the results under different data and model assumptions. Hosted using RStudio’s Shiny Apps, this application is freely available to all. In this manner, the results of this work are available not only to seasoned researchers with a moderate level of knowledge using the R computing language but to any potentially interested reader, be it a graduate student, politician or homeowner in the area. The interactive application can be found at: http:// reproduciblerealestate.shinyapps.io/ reproducibleRealEstate.

exhibit2

EMPIRICAL EXAMPLE

In the example empirical study, a hedonic price model is estimated to gauge the impact of a variety of different view types and levels on single-family home prices in King County, Washington (home to Seattle and Bellevue). The econometric model is purposefully kept simple as the intent of this analysis is to highlight the process of reproducible research, not to actually produce the most appropriate model(s) for addressing the issue of price premiums related to views in the region. Nonetheless, an attempt was made to provide a realistic example and one whose interactive results, as hosted on the Shiny Apps page, are meaningful in some context.

exhibit3The processed data used in the analysis represent 22,000 single-family homes sales (after data cleaning) in King County that occurred during 2014. I begin by estimating a standard ordinary least squares (OLS) regression model with the three view types—water, mountain, and other—as the variables of interest. The residuals of the OLS model show very high levels of spatial autocorrelation as evidenced by a positive and highly significant Moran’s I value. Spatial autocorrelation in the residuals is a common finding in housing pricing models (Koschinsky, Lozano-Gracia, and Piras, 2012). Conducting a robust Lagrange multiplier test shows the dominant form of spatial dependence to be dependence in the error term. I then estimate a spatial error model to remedy the error dependence. Results from this spatial error model are shown in Exhibit 3.

exhibit3Additionally, an expanded model is estimated using the view quality levels (1–4) instead of simple binary variables for each view type. Similarly, the OLS model showed high levels of spatial autocorrelation caused by dependence in the error term, which is addressed with a spatial error model. Overall, the results of this simple example suggest that water views are much more highly prized than mountains and other types of views (such as city skyline) within King County, WA.4

Conclusion

Reproducible research and the larger open science initiative have a strong following in many of the natural and life science disciplines. Indeed, in fields like bioscience, they have become the status quo. While there have certainly been growing pains, the conversion to an open model of science holds promise for elevating not only the level and expediency of scientific research, but also its transparency and opinion in the eye of the public. Progress has been made; however, many fields in the social sciences remain wedded to traditional models of ‘‘closed’’ (Sui, 2014) or ‘‘captured’’ (Rey, 2014) science.

As this research highlights, a truly open model of reproducible research in the real estate discipline is not without challenges, some significant. However, as funding agencies and the general public continue to demand more transparency from research activities and as data-driven scientific endeavors become more commonplace (Gray, 2009), it is likely not a matter of ‘‘if ’’ but one of ‘‘when’’ some measure of reproducibility will be a requirement in our field. Although the current practices of many (13 out of 16) of the top real estate journals do not currently address issues of data sharing and reproducibility, at least one editor has mentioned that changes are coming. As leading voices in the field as well as educators of the next generation of industry analysts, post-doctoral researchers. and future academics change toward more reproducible research in real estate starts not only with our own work but also in the classroom and the supervisory meetings we have with students.

More broadly, we as researchers have been conditioned to think of publication as the end of the research process. Rather, as John Claerbout has proclaimed (Stodden, 2009; Mesirov, 2010), the publication is merely an advertisement; the true research is the entire analytical workflow and data provenance, which lives on long after the ink on publications has dried. In our world, however, it is the advertisement that propels careers, garners funding, and drives salaries. Structural changes to our academic institutions will be needed to correct this systemic flaw.

Finally, this work is the first to my knowledge to demonstrate and document a fully reproducible empirical analysis in the real estate literature. An example of resources to host data and code as well as to provide interactive applications for working with model results is demonstrated. I show here that using only freely available data and websites, the entire analytical workflow for a moderately complex empirical analysis can easily be hosted and made available to any interested reader to reproduce, replicate or re-analyze.

Endnotes

  1. Note that reproduction is occasionally referred to as ‘‘pure replication,’’ with replication termed ‘‘statistical replication’’ (Hamermesh, 2007; Camfield and Palmer-Jones, 2013).
  2. Millo (2014) offers an example of a recent replication study in the real estate
  3. As a helpful reviewer pointed out, these requirements can be waved in the case of proprietary
  4. Readers are invited to examine the sensitivity of the results to changes in the data inclusion parameters and the model specification via the interactive web application described above. Note that the development of alternative spatial weights matrices and estimation of the spatial error model can be time consuming when examining large datasets (greater than a few thousand observations).

References

Ahlqvist, O., F. Harvey, H. Ban, W. Chen, S. Fontanella, M. Guo, and N. Singh. Making Journal Articles ‘‘live’’: Turning Academic Writing into Scientific Dialog. GeoJournal, 2013, 78:1, 61–8.

Aksulu, A. and M. Wade. A Comprehensive Review and Synthesis of Open Source Research. Journal of the Association for Information Systems, 2010, 11:11, 576–656.

Altman, M. and G. King. A Proposed Standard for the Scholarly Citation of Quantitative Data. D-lib Magazine, 2007, 13(3 / 4), 1082–9873.

Anderson, R.G., W.H. Greene, B.D. McCullough, and H.D. Vinod. The Role of Data/ Code Archives in the Future of Economic Research. Journal of Economic Methodology, 2008, 15:1, 99–119.

Andreoli-Versbach, P. and F. Mueller-Langer. Open Access to Data: An Ideal Professed but not Practised. Research Policy, 2014, 43:9, 1621–33.

Asuncion, H.U. Automated Data Provenance Capture in Spreadsheets, with Case Studies.

Future Generation Computer Systems, 2013, 29:8, 2169–81.

Azzalini, A. and B. Scarpa. Data Analysis and Data Mining: An Introduction. Oxford University Press, 2012.

Barnes, N. Publish your Computer Code: It Is Good Enough. Nature, 2010, 467:7317, 753.

Boylan, J.E., P. Goodwin, M. Mohammadipour, and A.A. Syntetos. Reproducibility in Forecasting Research. International Journal of Forecasting, 2015, 31:1, 79–90.

Camfield, L. and R. Palmer-Jones. Three ‘‘Rs’’ of Econometrics: Repetition, Reproduction and Replication. Journal of Development Studies, 2013, 49:12, 1607–14.

Carey, B. Many Psychology Findings Not as Strong as Claimed, Study Says. The New York Times. Retrieved from: http:// www.nytimes.com / 2015 / 08 / 28 / science/ many–social– science–findings–not–as–strong–as–claimed–study–says.html? r=0. 2015.

Crick, T., B.A. Hall, and S. Ishtiaq. (2014). Can I Implement Your Algorithm? A Model for Reproducible Research Software. arXiv preprint arXiv:1407.5981, July 22, 2014.

Dasu, T. and T. Johnson. Exploratory Data Mining and Data Cleaning. Volume 479. John Wiley & Sons, 2003.

Dewald, W.G., J.G. Thursby, and R.G. Anderson. Replication in Empirical Economics: The Journal of Money, Credit and Banking Project. The American Economic Review, 1986, 587–603.

Donoho, D.L. An Invitation to Reproducible Computational Research. Biostatistics, 2010, 11:3, 385–88.

Feigenbaum, S. and D.M. Levy. The Market for (Ir) Reproducible Econometrics. Social Epistemology, 1993a, 7(3):215–232.

——. Protocol for Student Replication of Published Research. Accountability in Research, 1993b, 3:1, 1924.

Ferguson, A.R., J.L. Nielson, M.H. Cragin, A.E. Bandrowski, and M.E. Martone. Big Data from Small Data: Data-sharing in the ‘‘Long Tail’’ of Neuroscience. Nature Neuroscience, 2014, 17:11, 1442–47.

Firebaugh, G. Replication Data Sets and Favored-Hypothesis Bias Comment on Jeremy Freese (2007) and Gary King (2007). Sociological Methods & Research, 2007, 36:2, 200–09.

Fomel, S. and J.F. Claerbout. Reproducible Research. Computing in Science & Engineering, 2009, 11:1, 5-7.

Freese, J. Overcoming Objections to Open-source Social Science. Sociological Methods & Research, 2007, 36:2, 220–26.

Gentleman, R. and D.T. Lang. Statistical Analyses and Reproducible Research. Journal of Computational and Graphical Statistics, 2007, 16:1, 1–23.

Goodman, A., A. Pepe, A.W. Blocker, C.L. Borgman, K. Cranmer, M. Crosas, R. Di Stefano,

  1. Gil, P. Groth, M. Hedstrom, and Others. Ten Simple Rules for the Care and Feeding of Scientific Data. PLOS Computational Biology, 2014, 10:4, e1003542.

Gray, J. eScience: A Transformed Scientific Method. In A.J.G. Hey, S. Tansley, and K.M. Tolle (eds.). The Fourth Paradigm. Preface. Microsoft Research: Redmond, WA, 2009.

Hamermesh, D.S. Viewpoint: Replication in Economics. Canadian Journal of Economics, 2007, 40:3, 715–33.

Hannay, T. (2009). From Web 2.0 to the Global Database. In A.J.G. Hey, S. Tansley, and

K.M. Tolle (eds.). The Fourth Paradigm. Preface. Microsoft Research: Redmond, WA, 2009.

Herndon, T., M. Ash, and R. Pollin. Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. Cambridge Journal of Economics, 2014, 38: 2, 257–79.

Johnson, C. In Science, Irreproducible Research is a Quiet Crisis. Boston Globe. Retrieved from: https:// www.bostonglobe.com / ideas/ 2015 / 03 / 19 / science–irreproducible–research– quiet–crisis/ xunxnfuzwdwYSpVjkx2iQN/ story.html. 2015.

King, G. Replication, Replication. PS: Political Science & Politics, 1995, 28:3, 444–52.

Koenker, R. and A. Zeileis. On Reproducible Econometric Research. Journal of Applied Econometrics, 2009, 24:5, 833–47.

Koschinsky, J., N. Lozano-Gracia, and G. Piras. The Welfare Benefit of a Home’s Location: An Empirical Comparison of Spatial and Non-spatial Model Estimates. Journal of Geographical Systems, 2012, 14:3, 319–56.

LeVeque, R.J., I.M. Mitchell, and V. Stodden. Reproducible Research for Scientific Computing: Tools and Strategies for Changing the Culture. Computing in Science and Engineering, 2012, 14:4, 13.

Lynch, C. Jim Gray’s Fourth Paradigm and the Construction of the Scientific Record. In

  1. Hey, S. Tansley, and K.M. Tolle (eds.). The Fourth Paradigm. Microsoft Research: Redmond, WA, 2009.

McCullough, B.D. Open Access Economics Journals and the Market for Reproducible Economic Research. Economic Analysis and Policy, 2009, 39:1, 117–26.

McCullough, B.D., K.A. McGeary, and T.D. Harrison. Lessons from the JMCB Archive.

Journal of Money, Credit, and Banking, 2006, 38:4, 1093–1107.

Mesirov, J.P. Computer Science. Accessible Reproducible Research. Science, 2010, 327: 5964, 415–16.

Millo, G. Narrow Replication of ‘‘A Spatio-Temporal Model of House Prices in the USA’’ using R. Journal of Applied Econometrics, 2014, 30:4, 703–04.

Mooney, H. and M. Newton. The Anatomy of a Data Citation: Discovery, Reuse, and Credit. Journal of Librarianship and Scholarly Communication, 2012, 1:1, p.eP1035.

Nosek, B.A., J.R. Spies, and M. Motyl. Scientific Utopia II. Restructuring Incentives and Practices to Promote Truth over Publishability. Perspectives on Psychological Science, 2012, 7:6, 615–31.

Peng, R.D. Reproducible Research in Computational Science. Science, 2011, 334:6060, 1226.

Piwowar, H.A. and T.J. Vision. Data Reuse and the Open Data Citation Advantage. PeerJ, 2013, 1:e175.

Reinhart, C.M. and K.S. Rogoff. Growth in a Time of Debt (Digest Summary). American Economic Review, 2010, 100:2, 573–78.

Rey, S.J. Open Regional Science. The Annals of Regional Science, 2014, 52:3, 825–37.

Sandve, G.K., A. Nekrutenko, J. Taylor, and E. Hovig. Ten Simple Rules for Reproducible Computational Research. PLOS Computational Biology, 2013, 9:10, e1003285.

Society, R. Science as a Public Enterprise; Opening Up Scientific Information. Technical report. Royal Society, London, 2011.

Stodden, V. The Legal Framework for Reproducible Scientific Research. IEEE Computing in Science and Engineering, 2009, 11:1, 35–40.

Sui, D. Opportunities and Impediments for Open GIS. Transactions in GIS, 2014, 18:1, 1–24.

The Economist. Trouble at the Lab. Retrieved from: http:// www.economist.com / news/ briefing/ 21588057–scientists–think–science–self–correcting–alarming–degree–it–not– trouble. 2013.

Vinod, H.D. Care and Feeding of Reproducible Econometrics. Journal of Econometrics, 2001, 100:1, 87–8.

Wilbanks, J. I Have Seen the Paradigm Shift, and It Is Us. In T. Hey, S. Tansley, and

K.M. Tolle (eds.). The Fourth Paradigm. Microsoft Research: Redmond, WA, 2009.

Yamada, K.M. and A. Hall. Reproducibility and Cell Biology. The Journal of Cell Biology, 2015, 209:2, 191–93.

Yong, E. How Reliable are Psychology Studies? The Atlantic. Retrieved from: http:// www.theatlantic.com / science/ archive/ 2015 / 08 / psychology–studies–reliability– reproducability–nosek/ 402466 / . 2015.

MORE FROM AVP