Real estate data are often dirty. This ‘‘dirt’’ manifests in various ways. Some of the most obvious cases are attributable to simple keypunch errors, such as when a real estate agent enters incorrect information into a Multiple Listing Service (MLS) form. Other dirt arises from non-arms’ length transactions; real estate transactions whose agreed upon sales price may be artificially higher or lower due to some kindred connection between the buyer and seller, concessions that are atypical for the market, etc. In real estate classroom discussions, very little if any time is spent on studying the techniques used to prepare, combine, manage, and clean real estate data for quantitative analysis. The real estate discipline is not alone here. Although data preparation is estimated to consume anywhere from 30% to 80% of the overall research effort on any given project, few resources exist in any field devoted to the topic (Dasu and Johnson, 2003, p. 99).
In this paper we provide guidance on data preparation for real estate data, with a particular focus on property-level data used in quantitative analyses. Examples of the data preparation techniques covered below include issues of standardization, data errors, missing data, mislabeled data, and outliers. Techniques range from the relatively simple (filtering out an observation of an ordinary home sale with a price of $100,000,000) to the complex (imputing missing values with logistic regression modeling).
We first present a vocabulary for describing data preparation. We then discuss the general nature of real estate data, specifically property-level data, and the unique characteristics of real estate data that create difficulties in both performing research and in documenting the analytical workflow. Next we cover issues related to managing and combining data, collectively referred to as data integration, followed by a discussion of data cleaning. An analysis of the extent to which reporting of analytical workflows and data preparation processes are discussed within a sample of published studies from the Journal of Real Estate Research (JRER), Real Estate Economics (REE), and the Journal of Real Estate Finance and Economics (JREFE) over the past three decades follows. We conclude with a summary of our findings and suggestions for moving the real estate discipline forward in terms of reporting and teaching the practice of data preparation and the recording of data provenance.
A Data Vocabulary
The acts of cleaning, combining, and managing data form a critical component of empirical research projects. Despite being a key step, these processes remain under-discussed in both the literature and the classroom. This shortcoming may be due in part to the fact that there is not a common language or ‘‘philosophy of data’’ (Wickham, 2014, p. 2) to assist in communicating the data preparation process to readers or students. As a result, it is helpful to first present a data lexicon. We have gathered and synthesized the following terms from a range of associated disciplines including computer science, statistics, and biology.
‘‘Analytical workflow’’ is defined as all the steps that gather and transform raw data into final results (Wolkovich, Regetz, and O’Connor, 2012). In essence, this refers to every step of the research process from the development of the research question through to the publication of the results, and possibly beyond. Increasing transparency in the reporting of analytical workflows in published research is a central tenet of the burgeoning ‘‘open science’’ movement (Sandve, Nekrutenko, Taylor, and Hovig, 2013). Within this definition of analytical workflow, the specific operations that were required to acquire, integrate, and clean the data itself are referred to as the ‘‘data provenance’’ (Royal Society, 2011; Asuncion, 2013; Goodman et al., 2014). As such, data provenance is a thing, not an action, and represents the list of previous actions applied to a given set of data. In other words, it is an instruction manual on getting from raw data to final results.
Throughout this paper we refer to the entire research process as the ‘‘analytical workflow,’’ the active process of gathering, combining, and cleaning data as the data preparation process (DPP) (verb) and the recorded past operations on the data as the data provenance (noun). Additionally, we approach the data preparation process from the standpoint of developing data for use in price or rent modeling analyses, although most, if not all, of the topics we cover would apply to the wider variety of other uses to which data are often put in the course of real estate research.
Within the analytical workflow there are two broad processes: data integration and data cleaning. Data integration refers to the acquisition, storage, combination, and standardization of data such that it can be accessed uniformly (Halevy, 2001). Data cleaning, on the other hand, involves reducing the data down to the proper set of observations via the removal of errors and inconsistent data points (Rahm and Do, 2001). Generally speaking, data integration occurs prior to data cleaning; however, in practice the entire data preparation process is highly iterative and insights from data cleaning may necessitate additional data integration exercises.
The Nature of Property-Level Real Estate Data
In the public media, real estate data are often cast as macro-statistics, such as metropolitan-level median sales prices or rental indices. These macro-scale data are generated by researchers and analysts from underlying micro-scale data (e.g., individual residential or commercial sales transactions). The focus of the discussion in this paper is on micro-level data and, specifically, on property-level data (PLD) in which the observations are recorded and analyzed at the property or the transaction (of a property) level. PLD are often collected by various local and regional firms or government agencies, and some of these data are then assembled by ‘‘data aggregators’’ (e.g., CoreLogic, RealtyTrac, CoStar, etc.). These data usually include basic information on the transaction and the physical property itself, as well as complementary spatial and location data such as census statistics for the area in which a property is located, road network data, and other points of interest. Various entities make property-level real estate data available. Most areas in the U.S. are covered by one or more Multiple Listing Service (MLS) providers; repositories of properties that were listed for sale at some point in the past and/or currently listed. International equivalents exist across many countries. MLS data providers often charge a fee to real estate agents, as well as appraisers and researchers, who choose to subscribe to the MLS. In addition to MLS providers, local governments often maintain real estate data. Qpublic.net is a popular and free website that provides a geographic information system (GIS) interface and searchable database for several states’ local tax assessor data throughout the United States. Within Australia, the Valuer’s General office in each state holds property-level databases of their jurisdiction. Access to government-owned PLD can vary widely from completely free and open to very expensive and difficult to procure. Additionally, there are third-party vendors (e.g., Gold Imaging, CoreLogic, Australian Property Monitors) that collect government-issued real estate PLD, standardize and re-sell for a fee depending on the amount of data purchased. Finally, the last few years have seen the rise of consumer-facing real estate information websites such as Zillow.com in the U.S., Domain.com.au in Australia, Rightmove.co.uk in the U.K., and Funda.nl in the Netherlands, which are websites from which PLD may be gathered by researchers freely, though often slowly.
Property-level real estate data have a number of interesting features that can make the data preparation process (DPP) and the overall empirical analysis especially difficult. First, real estate PLD are highly dynamic. The physical properties themselves are constantly changing as structures are built, renovated, and demolished. Building permits may provide a good source of these dynamic changes though not all activities may be reported. Lot boundaries, likewise, shift as well as subdivide (plattage) or merge (plottage). The frequency of these changes renders the DPP difficult at times. Additionally, sales transaction observations are continuously occurring and, therefore, datasets are never truly complete or ever completely up-to-date. And, in the real estate profession, the most current data are generally the most useful.
Real estate PLD is also, by its very nature, spatial. Properly managing spatial data requires the use of GIS and/or advanced database systems and with these come a host of potential hurdles, such as gaining access to GIS shape files (electronic representations of space), dealing with map projections, and properly tracking spatial changes to lot boundaries over time.
Another idiosyncrasy of transactional PLD is that most real estate transactions are imbued with contextual factors that may have influenced the final transaction price, but are difficult to properly identify and measure in the data. Examples include seller (lessor) or buyer (lessee) concessions, personal property conveyed in the sales transaction, and/or other contractual contingencies. Non-arm’s length transactions represent another contextual challenge in managing transactional PLD.
Finally, in the real estate industry information is power and, therefore, data can be guarded closely and difficult and/or highly expensive to access. When proprietary data are granted to researchers, they are often licensed such that the permitted uses and permitted users are very limited. In such cases, additional collaborations and participation in full and open sharing of the entire analytical workflow, including the data (as proscribed by ‘‘open science’’), can prove difficult or impossible.
Constructing operable PLD often involves combining a wide variety of fields or variables from different sources. The assemblage process is a key component of a research project’s data management scheme. A number of common issues present themselves during data assembly and integration including, but not limited to: (1) inconsistent levels of observation; (2) non-matching unique identifiers; (3) temporal mismatches between datasets; (4) lack of field standardization; and/or (5) conflicting observations from different sources.
INCONSISTENT LEVELS OF OBSERVATION
For research focused on analyzing a measure of price or rent, such as hedonic modeling, the basic level of observation for the core data is the transaction. In other words, each observation in the data represents a single transaction—a sale, a listing or lease—with the price or rent as the dependent variable. Depending on the data source, this initial transaction database may include only information on the transaction itself, such as the date of the transaction, the type of legal instrument used in the transaction, and the parties involved. Other common information used in pricing analyses such as structural, locational (accessibility), and neighborhood characteristics (Munneke, 1996; Case, Pollakowski, and Wachter, 1997; Gordon, Winkler, Barrett, and Zumpano, 2013) are often gathered from other data sources and created or assembled by the researcher.
Structural information on the property itself is generally available at the same level of observation—the property level—as the initial sales transaction data.1 Locational data such as latitude/longitude and ‘‘proximity to’’ variables such as distances to the central business district (CBD) or major employment center are also commonly available or created at the property level. Neighborhood data, on the other hand, has coarser spatial scale. For example, measures of the neighborhood such as demographics and socioeconomic variables from a government entity, like the U.S. Census Bureau or Australian Bureau of Statistics (ABS), cover broader areas; usually block groups or tracts. These data have a defined spatial extent much larger than a single transaction. School district test scores and municipal tax rate information, for example, are other commonly used variables with large spatial extents.
Properly integrating complex datasets with different levels of observation is best accomplished through a relational database management system (RDBMS). Microsoft Access offers a standard and easy to use RDBMS that operates well at a small- to medium-sized scale, one that likely facilitates most price modeling research. Users dealing with much larger datasets or those demanding extra security may consider one of the more functional products built around structured query language (SQL) such as Microsoft SQL server, MySQL, PostgreSQL or an Oracle product.2 Use of an RDBMS product can greatly simplify the data integration process, especially in instances where neighborhood or other data at inconsistent levels of observation will be updated or changed over time.
NON-MATCHING UNIQUE IDENTIFIERS
Combining data from disparate providers requires matching observations based on a unique identifier, known as a key. Depending on the data sources gathered, the unique identification numbers at the observation or individual sales transaction level may not match. For instance, structural information from a government entity, such as a county tax assessor, will likely use a parcel identification number (PIN) as a unique identifier for each property. Data from a private party or MLS tend to favor the use of physical addresses or their own internal unique identifiers. Matching observations—a process known as record linkage (Fellegi and Sunter, 1969; Brizian and Tansel, 2006)—may prove difficult if no common field or key can be found between the various source datasets.
Even when both source datasets contain address information, the addresses may not be standardized and finding a direct match via an automated process may only be partially successful. Breaking addresses into component parts and matching on the parts can suffice in some instances (Randall, Ferrante, Boyd, and Semmens, 2013). A number of statistical software packages like Excel (Microsoft 2015), Stata (Blasnik, 2010), SAS (SAS Institute), and R (van der Loo, 2014) offer options to use fuzzy matching techniques to overcome non-standardized addresses or other identifying fields. GIS may also be used when no linking fields are available and matches are attempted via location alone. While geocoding procedures—the assignment of direct latitude and longitude values to addresses—have become quite advanced and standardized, caution should be taken as many algorithms have unknown or unreported error rates (Bichler and Balchak, 2007).
One challenge when building a database for a longitudinal study is that the temporal extent of the various datasets may not align. As an example, in a study of sales prices over a five-year period, the available property characteristics data (if from a different source than the sales transactions) may not be from the same period as the individual sales transactions.
In this case, the characteristics data may not represent the true condition or land use of the property at the time of sale.3 Additionally, neighborhood data such as demographic or socioeconomic data are usually updated infrequently and will likely be years removed from some or all of the sales transactions in the database.
In the best-case scenario, the researcher would have a ‘‘snapshot’’ of property characteristics data from the specific day at which the property transaction(s) occurred. As property characteristics data are often not updated daily, a more reasonable level of temporal accuracy would be to match characteristics data from the year in which transactions are observed. For example, sales from 2013 would be joined to the property characteristics data from 2013, sales from 2014 to those from 2014, etc. Archival data work or planned data capturing in the case of a long-term project can provide the necessary snapshot data to complete the research. In the absence of temporally consistent data, data cleaning techniques may be used to identify and remove properties that have undergone significant structural or land use changes. Even with solid data cleaning procedures, mismatched temporal data will likely result in a number of errant data observations, thereby increasing the error rates in the specified models unless some process is built-in to identify properties with high error rates (e.g., cross-validation) or atypical property characteristics (e.g., when the number of bathrooms in an area far exceeds what is common for the neighborhood).
Field standardization may also present a problem when building a PLD database for sales and rent modeling research. Two types of standardization issues exist, within-source and between-source. A common within-source standardization problem can be found in MLS or industry gathered data. In cases where many parties (agents) are responsible for personally measuring, determining, and entering data, variations in the measurement process can occur unless very stringent procedures are followed.4 For example, overall gross living area (GLA) can vary widely between data collectors based on the procedures for counting foyers and other open spaces. Also, the determination of finished basement square footage and whether or not it counts toward overall square footage can vary, as can what determines a bedroom, what determines a bathroom (fractions of a bathroom and how they are recorded), and other less-than-objective counts. Finally, selling agents may be incentivized to bias GLA and other figures upwards in hopes of attaining a higher sales price for their client.5
Within-source issues can also arise in government collected data, such as tax assessor information. Although data collection methods may be more stringent in the tax assessment industry, assessment value challenges and non-reporting of additions and improvements by owners may work to bias property characteristic values downward from their true value. While there is no easy solution to addressing within-source standardization issues, the researcher should acknowledge that they exist.
Between-source standardization problems present another challenge to working with PLD. Between-source issues arise when different data collection methods are employed by different operating bodies (Exhibit 1). One example would be a multiple county dataset or study that crosses multiple MLS areas in which garages are measured in raw size in one data source and in vehicle capacity in another. Differences in units, such as acres to hectares or feet to meters, may also arise during PLD integration exercises. Measures of condition, quality, views, and other subjective or quasi-subjective attributes are especially prone to between-source standardization issues. In objective cases, such as the garage example, conversion rules such as allotting a certain number of square feet per vehicle capacity measurement, or vice versa, can be applied and should be recorded in the data provenance. For subjective measurements, similar conversion techniques may be applied, but generally not without consulting each data source provider to gain a solid understanding of the procedures used to generate the subjective ranking. In cases where measurement scale is the only difference between two fields (e.g., bathrooms counted in full numbers vs. fractions) simple reversion to the coarsest scale, full bathrooms in this case, may be the only option.
CONFLICTING OBSERVATIONS FROM DIFFERENT SOURCES
An issue related to field standardization is that of merging duplicate observations generated by different sources (Exhibit 2) when the data values themselves are conflicting. In determining the proper reconciliation approach for this situation, a number of questions should be asked. First, is the aim of the research to predict the value or rent of a number of non-observed (non-transacted) properties? If so, data values from whichever source the non-observed properties originate should be maintained in the observed dataset when discrepancies arise. This will help to minimize prediction bias. For example, assume that a researcher collects 100 sales from the county assessor and information on the same 100 sales from the local MLS with the intent of using these sales to predict the values of 20 non-observed (e.g., refinanced) homes in the same neighborhood. If the data for the 20 non-observed (the to-be-predicted properties) are from the county assessor, then county assessor data values should be maintained when conflicts occur. Using the MLS observations to estimate a predictive model in this situation could result in biased predictions.
Next, a researcher should determine which data is the super-set of data. If a researcher has 100 sales from the county assessor and 80 of those are also represented in a dataset from the local MLS, the county assessor data represents the super-set of data since it fully encompasses the MLS dataset. As such, it would be prudent to use the values from the assessor in instances of discrepancies.
Finally, in the case that neither data set is a perfect super-set of another, then some form of reconciliation must take place. A common and simple metric might be to simply average the two values and choose the midpoint. If more than two datasets are being reconciled, the mode or median may be a more appropriate measure of central tendency than the mean. Another consideration is to determine if the deviations in values between the multiple sources are systematic or random. Random deviations likely represent measurement errors and a measure of central tendency may be the most appropriate reconciliation. Systematic deviations, such as one dataset having GLA values that are consistently 10% higher than the other, likely represent a difference in the data collection methods themselves and the researcher should determine which of the two systems is preferred given the study design and research question(s). If the two values are not standardized by measurement type (i.e., subjective ranking system, etc.), then methods from the between-source standardization discussion above should be applied. Regardless of the method chosen, maintaining consistent practice throughout the reconciliation process is critical.
Across all disciplines, the process of data cleaning is acknowledged as critical (Chapman, 2005), especially as we move toward a greater emphasis on data-driven methods of scientific inquiry (Gray, 2009; Nelson, 2009). The cleaning of data prior to analysis has been identified as an iterative (Mennis and Guo, 2009), complex (Raman and Hellerstein, 2001), idiosyncratic (Thakare et al., 2010), and relatively undocumented (Randall, Ferrante, Boyd, and Semmens, 2013) process. Putting it more succinctly, Varian (2014, p. 5) considers data cleaning to be ‘‘somewhat of an art, which can be learned only by practice.’’ Given the importance of the process yet the relative ambiguity of the method, we attempt to document some of the ‘‘art’’ of cleaning PLD in an applied, empirical research framework, such as a price or rent modeling exercise.
Labeling is critical to understanding, identifying, and classifying observations in large datasets. Property level real estate data are no exception. Depending on the aims of the study, foreclosed, distressed or short-sale transactions may or may not be acceptable observations. The same can be said of multiple parcel sales and sales including personal property or business concerns. For lease observations, the terms of lease may invalidate the observations given the particular study design. For these reasons, the labeling of transaction conditions is incredibly important.
In some cases, non-arm’s length transactions are pre-screened out of the database. For instance, transaction data collected from MLS are unlikely to include quit claim property transfers (non-arm’s length), whereas data from county tax assessor’s or recorder’s offices may include all transactions, arm’s length or not. Conversely, MLS data are unlikely to contain for sale by owner (FSBO) sales, which may be an important component of the study design and a critical data point to collect. As a result, the first step of developing a data labeling scheme is to identify the source of the data to determine what, if any, initial restrictions have been placed on the sample.
When non-arm’s length transactions are included in a researcher’s dataset, tax assessors or recorders often provide a field or a number of fields noting the quality of the sale information. In most cases, this labeling serves an internal purpose as it provides a filtering mechanism for the assessor’s own mass valuation models used to determine taxable value. These labels can prove to be very useful for the real estate researcher; however, labels should not be taken at face value but tested for accuracy during the initial data cleaning phase. In the absence of labels, comparison of buyer and seller names can be used to identify some inter-family transfers. Outlying transaction prices or rent can also be an identifier of non-arm’s length sales; however, use of the dependent variables (price or rent) to limit observations needs to be approached with caution as it may also eliminate valid, but outlying, properties from the dataset. A particularly difficult situation arises in sales-based studies of new homes where land sales and improved sales are present in the same dataset. Identifying land sales in this situation can often be done by finding sale/resale observations in short succession where the first sale is much lower than the second sale and where the home’s year built field indicates a new dwelling. Similar processes can be used, though with more complexity, to identify a ‘‘teardown and rebuild’’ scenario common in many desirable urban areas.
Sales or leases involving multiple parcels or properties can also complicate data cleaning. In the case of large datasets of residential home sales, multiple parcel sales are often removed so long as their removal does not bias the overall sample. With smaller datasets, or non-residential land uses where multiple parcel or property sales are more common, the researcher should consider some method of reconciling the sale or, potentially, altering the study design to focus on single property sales if reconciliation proves difficult. Personal property or business concerns transacted along with the real property and furnished apartment rentals can also cause issues if not properly labeled. Efforts should be made here to reconcile or eliminate the influence of the additional value from these elements, or at the very least ensure that they are controlled for in the statistical analysis that follows.
All data contain some errors, the severity and occurrence of which are usually not known a priori. A major component of the data cleaning process is to identify and either correct the error or remove the observation from the data. Identification of errors is not an easy process, nor necessarily a straightforward one. While some measure of idiosyncrasy will always exist in any research project or dataset, a number of issues commonly arise.
As an example, a value that is three times larger than any other in its field may indeed be a data error—the result of a wrong figure keyed in (fat finger error) or a data measurement issue—or it might also be a valid observation whose value is atypical and does not match well with the remainder of the sample (Grubbs, 1969). Determining the difference between errors and outliers is an essential component of the data cleaning process, although one with few objective guidelines. Making this determination within the real estate transaction sphere is made more difficult by the fact that non-arm’s length sales and other invalid or ‘‘domain violation’’6 data points may be interspersed with valid data observations.
Errors in the data may best be identified by a common sense approach, which represents one practice of Varian’s (2014) ‘‘art’’ in data cleaning. In many cases, domain-specific knowledge can easily pick data errors from a set of observations, given local conditions. For example, single-family detached homes with more than five stories, less than 100 square feet of living space, or 200 bathrooms clearly do not represent plausible observations. Errors resulting in common sense values, such as mistakenly inputting an extra bedroom or an extra 100 square feet on a home are much more difficult to identify; however, the impacts of these errors are also likely to be very minimal on the final estimated results. Similar procedures relying on the localized or tacit knowledge of the researcher may also be used in identifying domain violation data points that should be removed from the dataset.
Dealing with outliers, on the other hand, is not as clear or easy as isolating obvious data errors. An outlier can be defined as ‘‘any observation that appears to be inconsistent with the remainder of that set of data’’ (Barnett and Lewis, 1984). With this knowledge, the researcher is then faced with two decisions: (1) how best to identify observations that are inconsistent; and (2) what to do with them once they are identified (Barbato, Barini, Genta, and Levi, 2011). The treatment of outliers is well established to be a non-trivial exercise that can exert considerable influence on statistical results and analytical conclusions.
The first step, outlier identification, can be accomplished via visual methods (Tukey, 1977; Schwertman, Owens, and Adnan, 2004), statistical methods (Grubbs, 1950; Weber, 2010), or even fuzzy clustering methods (Van Cutsem and Gath, 1993). There is a long and varied history of debate on outlier detection methods in the statistical literature; one well summed up by Barnett and Lewis (1984). With many methods, even the statistical ones, the researcher is still left making a moderately subjective judgment of the criteria used to define the outliers. For example, in a simple case all observations of greater than k standard deviations from the mean are labeled as outliers. The value of k still must be determined and the choice of k may influence the end result. In the more complex cases, outliers are detected in multivariate space instead of along a single dimension.
Once outliers and/or errors are identified, the researcher needs to decide how to proceed. Discussing experimental data, Grubb (1974) offers three initial options: (1) correct the observation (if desirable); (2) reject the observation outright; or (3) reject and take additional observations. The third of these has little relevance in an observational field (as opposed to experimental field) such as real estate as we cannot simply create more observations. One option not mentioned by Grubb is that of leaving the observation in and allowing robust analytical methods (e.g., robust regression techniques) to dampen the influence of the outlier(s) (Rousseeuw and Leroy, 1987). All things considered, three options exist to the observational real estate researcher: remove, fix, or defer to the analysis stage. Regardless of the choice of method or the choice of identifying parameters (k or otherwise), sensitivity analyses of the results to those choices should be described and reported if noticeable differences in results occur.
Missing data are commonly found in large datasets, especially those that represent the combination of multiple smaller datasets or in those in which humans have entered data—both situations that are highly common in real estate PLD. The first step is to identify whether the data are actually missing or if an empty field in a given observation simply represents the lack of a positive value. For instance, in a field that indicates the presence of a pool or fireplace, a missing or null value may signify no pool or fireplace as opposed to a missing or unknown value. While this situation certainly represents poor data management practice, it is a common occurrence in real estate data gathered from disparate and non-standardized sources. Conversely, if a missing or null value is found in a field denoting gross living area (GLA) in a residential sales dataset, it can be safely assumed that the data are missing. Other examples may not be so straightforward and the researcher must use caution in interpreting missing data.
If data are found to indeed be missing, the researcher should begin by determining the form of ‘‘missing-ness.’’ There are two major categories7 of missing data (Van Buuren et al., 1999; King, Honaker, Joseph, and Scheme, 2001). The first is missing at random (MAR). An example of MAR is a computer memory error that randomly affected X number of cells in a dataset. Another would be a human randomly entering data over a long time period, provided the data fields were all similar. This human or set of humans would be likely to randomly miss certain cells and produce data that are MAR, or very near it. The second type of missing data is not missing at random (NMAR) or non-ignorable (King, Honaker, Joseph, and Scheve, 2001). NMAR situations arise when the likelihood of data being missing is influenced by the unobserved data point itself. For example, if data on the year a structure was built were more likely to be missing on old homes than on new homes, then this information would be NMAR.
After identifying the type of missing data, two options are available: (1) impute the missing data or (2) remove those observations from the subsequent analyses. Imputation of missing data, especially that of NMAR data (Pigott, 2001), is a complex technique and beyond the scope of this paper.8 If imputation is determined infeasible and/or outside of the scope of the project, the researcher may perform complete case (CC) analysis in which only the observations with complete data are used. In other words, any observation without complete data is removed from the final analysis. In situations where missing data are MAR and few in number, the CC method can provide unbiased results (Little, 1992). For NMAR and instances of significant missing data, CC analysis should be avoided and imputation or additional data collection activities undertaken. At a minimum, potential biases from NMAR data or CC analysis should be noted.
The data preparation process (DPP) discussed above can be highly complex and requires numerous decisions to be made by the researcher, many of which are subjective or quasi-subjective at best. As a result, it is important that these steps are documented. By documenting the steps of their analytical workflows, researchers ensure the provenance of their data maintains intact and, if need be, can be reproduced in the future (Goodman et al., 2014). A more selfish rationale for documenting one’s analytical workflow is that research projects are often set aside and returned to many months or years later. A clear, well-documented DPP will make the process of returning to a research project much easier than finding two spreadsheets in a folder, one labeled ‘‘raw data’’ and the other ‘‘clean data.’’
Documentation of the DPP requires discipline and time to become a habit-forming exercise. If data cleaning is done in a spreadsheet or tabular form, keeping formulas within the spreadsheet and maintaining a narrative of parameters and decisions made may be the best practice. Another option is to use a statistical language such as Stata, SAS, R, Python, LIMDEP, or MATLAB to perform data cleaning. By coding the entire process, code manages the workflow, allows no steps to be forgotten, and allows changes to be made if mistakes are discovered. As Sandve, Nekrutenko, Taylor, and Hovig (2013) unambiguously recommend: ‘‘Avoid manual data manipulation.’’ The time spent learning a coding language will pay for itself in time spent data cleaning in short order.
CURRENT DATA PRACTICES IN REAL ESTATE RESEARCH
Up to this point, we have focused on providing a data vocabulary and a survey of issues dealing with data integration and cleaning of PLD in the real estate field. As the cited references attest, much of the literature on the DPP originate in other scientific fields and, in general, the real estate literature is rather silent on data issues in general and specifically on those dealing with PLD. Pollakowski’s (1995) survey of data sources and Thrall (2001) and Thrall and Thrall’s (2011) compendia of real estate data websites offer notable exceptions in terms of identifying data sources. Issues such as data integration and data cleaning remain undiscussed.
One reason for the dearth of literature on the data cleaning process, as hinted at above, could be due to the fact that the DPP can be highly idiosyncratic and depends heavily on the particular study design and quality of the data (often very local). Another reason could be that with no real standards in place, there is little incentive to fully document the analytical workflow or the data provenance as it may only raise more concerns and issues that lead to lower acceptance rates and longer review times of manuscripts (Anderson, Greene, McCullough, and Vinod, 2008). This potential complication, along with word limits imposed by some publications, may elevate parsimony above explicitness.
The DPP described above—that of data collection, integration, and cleaning—can be considered a key component of the analytical workflow. As all research projects take on their own idiosyncratic dimensions and all researchers and collaborations of researchers work in different manners, we should not expect there to be a standardized analytical workflow process followed by all. In other words, there is not likely to be, nor perhaps should there be, a definitive checklist of tasks and ordered procedures to follow. Arguably, each data source presents its own challenges to overcome in terms of specific items to be addressed. However, while no two analytical workflows are likely to be identical, an overarching set of potential data integration and cleaning processes can be developed regardless of the quality and specifics of the initial dataset. In any case, the failure to describe and document the specific data provenance and the larger analytical workflow represent suboptimal scientific transparency and general practice.
To test for the incident of properly documented data provenance and analytical workflow in empirical work, we have sampled publications from the Journal of Real Estate Research (JRER), the Journal of Real Estate Finance and Economics (JREFE), and Real Estate Economics ). We then analyzed the explanations of the data preparation process included in those publications. As a sampling procedure, we identified and reviewed the first article of each calendar year published in each journal that used PLD to derive at least one statistical price model in the course of the research. If no such studies existed in a given calendar year, none were sampled. The sample was not confined by land use, statistical method, or dependent variable. The tables in the Appendix show the 84 sampled studies, with some basic information on the publication such as author name, issue number, and article title.
Analytical Workflow Questions
To gauge the level of detail provided about the DPP and analytical workflow, we asked seven basic questions. Additional follow-up questions are also asked in the event that the initial response is positive. The seven sets of questions (15 in total) are below:
(Q1a) We begin by asking about the data source and cost. Does the publication explicitly list the source of the data such that other researchers would be able to located the source, and if so, (Q1b) is its cost and/or availability of the data—free, paid, negotiated free—clear to the reader?
(Q2a) Is any data cleaning discussed? In other words, has the author(s) indicated that something has been done to trim the raw data (that received directly from the source) into a format available for analysis? If so, has this process been documented? (Q2b) If this is noted, then are the data cleaning procedures/steps clearly detailed? (Q2c) Also, if cleaning is noted, is the number of total observations cleaned—those included in the raw data but not the cleaned data—indicated?
(Q3a) Is the presence or absence of outliers discussed? For example, does the author(s) make any mentioned of the appropriateness, or lack thereof, of any of the observations in the dataset? (Q3b) If so, are the sensitivities of the results to these outliers discussed or analyzed?
(Q4a) Are all independent variables used in the final analysis listed and full described? (Q4b) If so, are summary statistics regarding the central tendency and range of these variables provided?
(Q5a) Is the final dataset composed of data from more than one source or dataset? An example here would be a set of sales transactions from the local MLS that were combined with property characteristics information from the county tax assessor and spatial neighborhood information from the U.S. Census. (Q5b) And, if such composition occurred, was the process discussed and described?
(Q6a) Does the author use and/or create any independent variables that are not directly received from the data source? For example, if the author is interested in the impact of proximity to a waterbody on home prices, the distance to waterbody variable usually must be created as this is not a standard property characteristic in most datasets. (Q6b) If so, is the process by which this variable(s) is created clearly documented such that the reader could replicate the process?
(Q7a) Is the dataset used in the study representative of only a sample of the entire dataset available for the spatial and temporal extent of the study design? A non-sampled study would look at all home transactions, for example, in a given county over a given time period. A sampled study, on the other hand, may limit the study to a random sample of observations in a given area over a given time period. The key determinant here is whether or not a contiguous temporal and spatial bound forms the inclusion criteria of the dataset. (7b) If some measure of sampling is performed, then is the sampling structure described such that a reader could replicate it?
A Data Preparation and Workflow Index
Next, from this set of questions we have developed a Data Preparation and Workflow Index (DPWI) in which we seek to quantify the level of explanation of the DPP and analytical workflow in each of the examined studies. Questions 1–4 are considered mandatory and every publication, regardless of its research design or data source, could reasonably be expected to complete these steps. For each initial question, one point is awarded; for follow-up questions, 0.5 point is awarded. Questions 5–7 can be seen as optional, or rather that certain study designs or datasets may not require these processes. Therefore, no points are awarded for the initial question, but one point is awarded for the follow-up if the initial question is answered in the affirmative.
In order to facilitate an even comparison among studies that may differ in the number of optional questions that were asked, each publication is scored as a percentage of the total possible points for that study multiplied by 100. Or, in other words, the total points for each are summed and the percentage achieved (scaled to 100) is calculated where the denominator of the calculation depends on the particular study design and data used in the research. Exhibit 3 shows summary statistics of the results of these criteria applied to the sample studies.There exists wide variation in the number of positive responses to the initial and follow-up questions. A negative response to the optional, primary question results in an ‘‘NA’’ for the follow-up questions. For example, the 17% positive response rate for describing data costs only considers those 89% who listed the data source to begin with.
A number of activities are well documented across all studies, including the source of the data and the listing and description of all independent variables used in the analyses. Others, such as a discussion of potential and real outliers and the explicit mention of the cost of data, are rarely done. Within the 44% of cases where the data cleaning process was noted, 59% described at least some of the criteria used to clean the data; but less than half of the studies mentioned how many observations were removed via cleaning. The listing and description of independent variables (95%) and the display of summary statistics (Q4) is also quite common (80%). Interestingly, of the nearly 40% of the studies that only considered some sample of observations (Q7) from the studied geographic area and time period, only 45% explicitly mentioned the sampling criteria.
Next, to see how the documentation of data process and workflow has changed over time, we have calculated a DPWI metric for each study and plotted them by year of publication. As can be seen in Exhibit 4, the extent of the documentation process in our study sample has risen steadily over the past thirty years. Whereas scores at or below 40 were common in the late 1980s and early 1990s, no study since 2009 scored below 50 and some as high as 90.
To examine the details of the steady increase in reporting of data preparation processes, we have broken down the temporal trends by each question set (Exhibits 5 and 6). For the mandatory question set (1–4), only the reporting of data cleaning activities appears to show any discernable increase in the frequency of reporting over time. Reporting of data sampling processes shows the most obvious increase over time when evaluating trends for the optional question sets (5–7). Improvements in the transparency in documenting these two procedures, together with smaller improvements in terms of independent variables (Q4), data compilation (Q5), and non-source variables (Q6) have driven the increase in DPWI scores over time.
ADDITIONAL DIMENSIONS TO DPWI SCORES
The documentation of data preparation processes may also be related to other characteristics of the study design or the publication itself. We begin by comparing the total sample size of the empirical study to the DPWI score. If multiple models and datasets were used, we summed the total number of unique observations within the study. Observation counts are prior to any reported data cleaning activities.
Exhibit 7 shows the results of this comparison. DPWI score does exhibit a positive relationship with sample size, suggesting that authors working with larger databases may engage in more preparation activities and are more versed in documenting their data provenance. It is also important to keep in the mind that, in general, sample sizes have increased in recent years and therefore some measure of this relationship may be attributable to the changes in reporting standards over time (as shown in Exhibit 4).
Next, we examine the relationship between the property type analyzed in the publication. Not surprisingly, the majority of studies focused on detached single-family residential dwellings. Apartment and condominium studies were the second largest property class with a collection of other uses such as office, industrial, and hotel property making up the remainder. As shown in Exhibit 8, we find very little discernible difference in DWPI scores by property type. The detached single-family residential studies show a slightly higher DPWI score than the other two use categories; however, two-sample t -tests comparing the means of the three groups does not show statistically significant differences between the groups (all p -values greater than 0.45).
Finally, we examined whether the level of documentation varied by publication over time. This analysis (Exhibit 9) shows that trends across all three journals are similar. Examining the standard errors of the fitted LOESS regression lines shows the differences between the three journals to be statistically indistinguishable from 0 using an alpha value of 0.05. All three publication exhibit a consistent increase in data preparation process reporting over time.
Overall, the sample of real estate research publications we analyzed does well in some measures of fully documenting the analytical workflow, but leaves others, such as outlier analysis and discussion of data cleaning, as something to be desired. Documentation has improved over time, especially in the past five to seven years, which is a promising trend for the discipline.
As industry, the academy, and our broader society become ever more dependent on data, the question of data quality, management, cleaning, and analysis takes on increasing importance. While the process and techniques involved in data analysis are paid considerable attention in academic literature and data management is afforded its rightful place in industry, discussion of data quality, data cleaning, and the entire data preparation process are woefully neglected. Almost no advice is provided in the real estate literature for the applied researcher looking for guidelines on the data preparation processes. While the statistics literature can prove of assistance here, it often does not address the domain-specific issues, such as the identification of non-arm’s length transactions and address matching difficulties that are crucial to preparing property level data (PLD) in the real estate field.
A review of a sample of publications using PLD from three leading real estate research journals—JRER, JREFE, and REE—finds that the discussion of the DPP and analytical workflow has steadily improved over time. However, a number of issues such as the discussion of outliers, sampling and data cleaning criteria, and data collection costs remain neglected. The steady progress towards more explicit data processes is encouraging and will hopefully continue.
In this paper, we outline a vocabulary for discussing data preparation, as well as bring together a list of common issues that arise during the data management and cleaning stages of working with PLD. To do so, we compiled a bricolage of advice from related disciplines as to how to best deal with data issues. This list is not meant to be comprehensive, but rather a starting point for a larger, sustained—and much needed—discussion across the discipline regarding the treatment of data in real estate. The documentation process has steadily improved over time, but still much is left to be desired. Many fields such as biostatistics, computer science, and genomics have much stricter policies on the reporting of data processes and workflow. These fields and many others are leading the way to make science more open and reproducible.
List of Sampled Publications
Exhibit A1 List of Journal of Real Estate Research Sampled Publications
JRER 1986 1:1 Cronan et al. The Use of Rank Transformation 18.18
JRER 1987 2:1 Kang and Reichert An Evaluation of Alternative E 57.14
JRER 1988 3:1 Frew and Jud The Vacancy Rate and Rent Level 50.00
JRER 1989 4:2 Sirmans et al. Determining Apartment Rent 41.67
JRER 1990 5:1 Gilley and Pace A Hybrid Cost and Market-Based 41.67
JRER 1991 6:1 Asabere and Huffman Historic Districts and Land Vacancy 62.50
JRER 1992 7:2 Doiron et al. Do Market Rents Reflect the Vacancy 18.18
JRER 1993 8:3 Fehribach et al. An Analysis of the Determinant 75.00
JRER 1994 9:3 Jud and Seaks Sample Selection Bias in Estim 57.14
JRER 1995 10:2 Rodriguez et al. Using Geographical Information 64.29
JRER 1996 12:1 Benjamin and Sirmans Mass Transportation, Apartment 41.67
JRER 1997 13:1 Buttimer et al. Industrial Warehouse Rent Dete 66.67
JRER 1998 15:1 Pace Appraisal Using Generalized Ad 50.00
JRER 1999 17:1/2 Spahr and Sunderman Valuation of Property Surround 45.45
JRER 2000 19:1/2 Ding et al. The Effect of Residential Inve 82.35
JRER 2001 20:1/2 Harrison et al. Environmental Determinants of 64.29
JRER 2002 23:1/2 Bond et al. Residential Real Estate Prices 27.27
JRER 2003 25:1 Frew and Jud Estimating the Value of Apartment 41.67
JRER 2004 26:2 Palmon et al. Clustering in Real Estate Price 75.00
JRER 2005 27:1 Berg Price Indexes For Multi-Dwelling 57.14
JRER 2007 29:2 Wilson and Frew Apartment Rents and Locations 23.08
JRER 2009 31:1 Shultz and Schmitz Augmenting housing sales data 68.75
JRER 2010 32:2 Bourassa et al. Predicting House Prices with S 58.33
JRER 2011 33:1 Winson-Geideman et al. The impact of age on the value 75.00
JRER 2012 34:1 Aroul and Hansz The Value of ‘Green’: Evidence 64.29
JRER 2013 35:3 Gordon et al. The Effect of Elevation and Co 64.29
JRER 2014 36:3 Wyman et al. Testing the Waters: A spatial 75.00
Exhibit A2 List of Journal of Real Estate Finance and Economics
JREFE 1989 2:2 Speyrer The Effect of Land-Use Restric 56.25
JREFE 1990 3:1 Pace and Gilley Estimation Employing A Priori 50.00
JREFE 1991 4:4 Speyrer and Ragas Housing Prices and Flood Risk 41.67
JREFE 1992 5:1 Coulson Semiparametric Estimates of the 30.77
JREFE 1993 6:1 Lin The Relationship Between Rents 40.00
JREFE 1994 8:2 Shilton and Zaccaria The Avenue Effect, Landmark Ex 41.67
JREFE 1995 10:1 Mok et al. A Hedonic Price Model for Private 73.33
JREFE 1996 12:3 Carroll et al. Living Next to Godliness: Resi 64.29
JREFE 1997 14:1/2 Can and Megbolugbe Spatial Dependence and House Price 62.50
JREFE 1998 16:1 Benson et al. Pricing Residential Amenities 71.43
JREFE 1999 18:2 Colwell and Munneke Land Prices and Land Assembly 75.00
JREFE 2000 20:1 Benjamin et al. Housing Vouchers, Tenant Quality 42.86
JREFE 2001 22:1 Benjamin et al. The Value of Smoking Prohibiti 72.73
JREFE 2003 27:3 Clapp A Semiparametric Method For Va 50.00
JREFE 2004 28:1 Schulz and Werwatz A State Space Model For Berlin 38.46
JREFE 2005 30:1 Lee et al. Dwelling Age, Redevelopment an 78.57
JREFE 2006 32:2 Hodgson et al. Constructing Commercial Indices 85.71
JREFE 2007 34:2 Corgel Technological Change as Reflec 57.14
JREFE 2008 36:3 Enstrom and Netzell Can Space Syntax Help Us in Un 45.45
JREFE 2009 38:2 Mueller et al. Do Repeated Wildfires Change H 84.21
JREFE 2010 40:1 McKenzie and Levendis Flood Hazards and Urban Housing 71.43
JREFE 2011 42:1 Dumm et al. The Capitalization of Building 75.00
JREFE 2012 44:1/2 Ihlanfeldt and Mayock Information, Search, and House 64.29
JREFE 2013 46:1 Carter et al. Another Look at Effects of 66.67
JREFE 2014 48:1 Zahirovic-Herbert and Gibler Historic District Influence on 84.62
Exhibit A3 List of Real Estate Economics Sampled Publications
REE 1976 4:2 Morton Narrow versus Wide Stratification 33.33
REE 1977 5:4 Ferri An Application of Hedonic Inde 40.00
REE 1978 6:1 Gau and Kohlhepp Multicollinearity and Reduced- 22.22
REE 1979 7:2 Guntermann FHA Mortgage Discount Points, 18.18
REE 1981 9:4 Nicholas Housing Costs and Prices Under 18.18
REE 1983 11:3 Bajic Urban Housing Markets Modelling 50.00
REE 1984 12:1 Mark and Goldberg Alternative Housing Price Indices 40.00
REE 1985 13:1 Agarwal and Phillips The Effects of Assumption Fina 50.00
REE 1986 14:1 Reichert and Moore Using Latent Root Regression 33.33
REE 1989 17:1 Delaney and Smith Impact Fees and the Price of 64.29
REE 1990 18:1 Glascock et al. An Analysis of Office Market 50.00
REE 1991 19:1 Kang and Reichert An Empirical Analysis of Hedonic 58.33
REE 1993 21:2 Murdoch et al. The Impact of Natural Hazards 78.57
REE 1994 22:3 Thorson Zoning Policy Changes and the 42.86
REE 1995 23:2 Knight et al. A Varying Parameters Approach 41.67
REE 1997 25:3 Asabere and Huffman Hierarchical Zoning, Incompati 68.75
REE 1998 26:2 Benjamin et al. What Do Rental Contracts Revea 66.67
REE 1999 27:3 Tu and Eppli Valuing New Urbanism: The Case 86.67
REE 2000 28:2 Pavlov Space-Varying Regression Coeff 58.33
REE 2001 29:1 Munneke and Slade Metropolitan Transaction-Based 50.00
REE 2002 30:4 Clapp et al. Predicting Spatial Patterns of 66.67
REE 2003 31:1 Thibodeau Marking Single-Family Property 60.00
REE 2004 32:1 Lambson et al. Do Out-Of-State Buyers Pay More 92.31
REE 2005 33:2 Dale-Johnson et al. From Central Planning to Central 64.29
REE 2006 34:1 Ooi et al. Price Formation Under Small Nu 45.45
REE 2008 36:1 Tsoodle and Turner Property Taxes and Residential 78.57
REE 2009 37:1 Clauretie and Daneshvary Estimating the House Foreclosure 75.00
REE 2010 38:3 Leguizamon The Influence of Reference Group 50.00
REE 2011 39:1 Turnbull and Zahirovic-Herbert Why Do Vacant Houses Sell For 71.43
REE 2012 40:4 McMillen Repeat Sales as a Matching Est 64.29
REE 2013 41:2 Carrillo To Sell or Not To Sell: Measur 66.67
REE 2014 42:1 Wentland et al. Estimating the Effect of Crime 94.12
1. Multiple parcel sales withstanding.
2. Note that Microsoft Access also offers SQL command line operations.
3. Data gathered from MLSs or other industry sources may have property characteristics information that is joined to the transaction information at the time of listing or sale, thereby avoiding the issue of temporal inconsistencies. MLS data, on the other hand, are often entered by agents themselves and, lacking a centralized standard of measurement, may be more variable between properties though temporally correct.
4. Recent interest in international measurement standards for measuring buildings validates this concern. For example, see the International Measurement Standards proffered by the Royal Institution of Chartered Surveyors (RICS).
5. Both authors of this article worked together on a legal case where the GLA used in marketing materials for condominium complexes in California was overstated by as much as 20% as compared to physical measurements done by appraisers.
6. Indicating an observation of a transaction or property of a different type or classification than intended and one that is not properly labeled. Domain violations are generally data points that are only invalid within the context of a given domain or field and not from a pure statistical or mathematical standpoint (Miller, 2010). A negative sales price would also be an example of a domain violation, as negative numbers are mathematically and statistically valid in many fields, but are not possible as a transactional price for a home.
7. A third, missing completely at random (MCAR) can exist, but is highly theoretical and does not exist very often in practice (King, Honaker, Joseph, and Scheve, 2001).
8. Researchers are directed toward Little and Rubin’s (2014) recent work or Pigott’s (2001) review of methods.
Anderson, R., W. Greene, B. McCullough, and H. Vinod. The Role of Data/code Archives in the Future of Economic Research. Journal of Economic Methodology, 2008, 15:1, 99– 119.
Asuncion, H.U. Automated Data Provenance Capture in Spreadsheets, with Case Studies.
Future Generation Computer Systems, 2013, 29:8, 2169–81.
Barbato, G., E. Barini, G. Genta, and R. Levi. Features and Performance of Some Outlier Detection Methods. Journal of Applied Statistics, 2011, 38:10, 2133–49.
Barnett, V. and T. Lewis. Outliers in Statistical Data. Chichester: John Wiley & Sons, 1984.
Bichler, G. and S. Balchak. Address Matching Bias: Ignorance Is Not Bliss. Policing: An International Journal of Police Strategies & Management, 2007, 30:1, 32–60.
Blasnik, M. RECLINK: Stata Module to Probabilistically Match Records. Statistical Software Components, 2010.
Brizan, D. and A. Tansel. A Survey of Entity Resolution and Record Linkage Methodologies. Communications of the IIMA, 2006, 6:3, 5.
Case, B., H. Pollakowski, and S. Wachter. Frequency of Transaction and House Price Modeling. Journal of Real Estate Finance and Economics, 1997, 14:1, 173–87.
Chapman, A.D. Principles and Methods of Data Cleaning—Primary Species and Species Occurance Data. Version 1.0. Report for the Global Biodiversity Information Facility.
Dasu, T. and T. Johnson. Exploratory Data Mining and Data Cleaning. Hoboken, NJ: John Wiley & Sons, 2003.
Fellegi, I. and A. Sunter. A Theory for Record Linkage. Journal of the American Statistical Association, 1969, 64:328, 1183–1210.
Goodman, A., A. Pepe, A.W. Blocker, C.L. Borgman, K. Cranmer, M. Crosas, R. Di Stefano, Y. Gil, P. Groth, M. Hedstrom, D.W. Hogg, V. Kashyap, A. Mahabal, A. Siemiginowska, and A. Slavkovic. Ten Simple Rules for the Care and Feeding of Scientific Data. PLOS Computational Biology, 2014, 10:4.
Gordon, B.L., D.T. Winkler, J.D. Barrett, and L.V. Zumpano. The Effect of Elevation and Corner Location on Oceanfront Condominium Value. Journal of Real Estate Research, 2013, 35:3, 345–63.
Gray, J. eScience: A Transformed Scientific Method. In The Fourth Paradigm, A.J.G. Hey, S. Tansley, and K.M. Tolle (eds.). Redmond, WA: Microsoft Research Redmond, WA, 2009.
Grubbs, F. Procedures for Detecting Outlying Observations in Samples. Technometrics, 1969, 11:1, 1–21.
——. Sample Criteria for Testing Outlying Observations. The Annals of Mathematical Statistics, 1950, 27–58.
Halevy, A. Answering Queries Using Views: A Survey. The Very Large Database Journal, 2001, 10, 270–94.
King, G., J. Honaker, A. Joseph, and K. Scheve. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. American Political Science Association, 2001, 95, 49–69.
Little, R. Regression with Missing X’s: A Review. Journal of the American Statistical Association, 1992, 87:420, 1227–37.
Little, R. and D. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, 2014.
Miller, H. The Data Avalanche is Here; Shouldn’t We be Digging? Journal of Regional Science, 2010, 50:1, 181–201.
Mennis, J. and D. Guo. Spatial Data Mining and Geographic Knowledge Discovery: An Introduction. Computers, Environment and Urban Systems, 2009, 33:6, 403–08.
Munneke, H. Redevelopment Decisions for Commercial and Industrial Properties. Journal of Urban Economics, 1996, 39:2, 229–53.
Nelson, M. Data-Driven Science: A New Paradigm? Educause Review, 2009, 44:4, 6–12.
Pigott, T. A Review of Methods for Missing Data. Educational Research and Evaluation, 2001, 7:4, 353–83.
Pollakowski, H. Data Sources for Measuring House Price Changes. Journal of Housing Research, 1995, 6:3, 377–89.
Rahm, E. and H. Do, H. Data Cleaning: Problems and Current Approaches. Bulletin of the Technical Committee on Data Engineering, 2001, 23:4, 3–13.
Raman, V. and J.M. Hellerstein. Potter’s Wheel: An Interactive Data Cleaning System.
Proceedings of the 27th International Conference on Very Large Data Bases. 2001, 381–90.
Randall, S., A. Ferrante, J. Boyd, and J. Semmens. The Effect of Data Cleaning on Record Linkage Quality.
BMC Medical Informatics and Decision Making, 2013, 13:1, 64.
Rousseeuw, R. and A. Leroy. 1987. Robust Regression and Outlier Detection. New York, NY: Wiley, 1987.
Schwertman, N., M. Owens, and R. Adnan. A Simple More General Boxplot Method for Identifying Outliers. Computational Statistics & Data Analysis, 2004, 47:1, 165–74.
Sandve, G.K., A. Nekrutenko, J. Taylor, and E. Hovig. Ten Simple Rules for Reproducible Computational Research. PLOS Computational Biology, October 23, 2013.
Schwertman, N.C., M.A. Owens, and R. Adnan. A Simple More General Boxplot Method for Identifying Outliers. Computational Statistics & Data Analysis, 2004, 47, 165–74.
Thakare, S., S. Gawali, and others. An Effective and Complete Preprocessing for Web Usage Mining. International Journal on Computer Science and Engineering, 2010, 2:3, 848–51.
Thrall, G. Data Resources for Real Estate and Business Geography Analysis. Journal of Real Estate Literature, 2001, 9:2, 175–225.
Thrall, G. and S. Thrall. Data Resources for Real Estate and Business Geography Market Analysis: A Comprehensive Structured Annotated Bibliography. Version 2.0. Journal of Real Estate Literature, 2011, 19:2, 415–68.
Tukey, J.W. Exploratory Data Analysis. Reading, MA: Addison-Wesley, 1977.
Van Buuren, S., H. Boshuizen, D. Knook, and others. Multiple Imputation of Missing Blood Pressure Covariates in Survival Analysis. Statistics in Medicine, 1999, 18:6, 681–94.
Van Cutsem, B. and I. Gath. Detection of Outliers and Robust Estimation Using Fuzzy Clustering.
Computational Statistics & Data Analysis, 1993, 15:1, 47–61.
Van der Loo, M. The Stringdist Package for Approximate String Matching. The R Journal, 2014, 6:1, 111–22.
Varian, H. Big Data: New Tricks for Econometrics. The Journal of Economic Perspectives, 2014, 3–27.
Weber, S. bacon: An Effective Way to Detect Outliers in Multivariate Data using Stata (and Mata). The Stata Journal, 2010, 10:3, 331–38.
Wickham, H. Tidy Data. Journal of Statistical Software, 2014, 59:10, 1–23.
Wolkovich, E., J. Regetz, and M. O’Connor. Advances in Global Change Research Require Open Science by Individual Researchers. Global Change Biology, 2012, 18:7, 2102–10.
This paper received the award for the Best Paper on Real Estate Education (sponsored by Dearborn Real Estate Education) presented at the American Real Estate Society 2015 Annual Meeting and the Best Refereed Paper at the Pacific Rim Real Estate Society 2016 Annual Meeting.
Andy Krause, University of Melbourne, Parkville, VIC 3010 or firstname.lastname@example.org.
Clifford A. Lipscomb, Greenfield Advisors LLC, Cartersville, GA or email@example.com.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.