Different Types of Data and the Validity of Democracy Measures

Different measures of democracy rely on different types of data. Some exclusively rely on observational data, others rely on judgement-based data in the form of in-house coded indicators or expert surveys. A third set of democracy measures combines information from indicators based on different types of data, some of them also data from representative surveys of the mass public. This article discusses the advantages and disadvantages of these different types of data for the measurement of electoral and liberal democracy. The discussion is based on the premise that the main priorities must be to establish a high degree of concept-measure consistency, i


Introduction
The construction and use of measures of democracy in social scientific research has increased considerably in recent decades.This makes good sense; without them, the identification of trends in political rights and liberties must be based on rough impressions not allowing for systematic temporal and cross-country comparisons (Bollen, 1992, p. 189).However, such efforts are only valuable if the quality of the data is high in terms of reliability and validity. 1hen we attempt to measure democracy, the identification of empirical indicators that tap into the different aspects of the overarching concept is one of the most important tasks.One can either use extant indicators, collect new data, or combine new indicators with old ones.The main priority must be to establish a high degree of concept-measure consistency, i.e. the extent to which the indicators capture all of the components of the core concept of interest (and only those), and the extent to which they do so in a precise and unbiased manner (Adcock & Collier, 2001;Goertz, 2006;Munck, 2009). 2 In the words of the Office of the High Commissioner for Human Rights (OHCHR, 2012, p. 50): An important statistical consideration in identifying and developing human rights indicators, or any set of indicators for that matter, is to ensure their relevance and effectiveness in measuring what they are supposed to measure.This relates to the notion of indicator validity.It refers to the truthfulness of information provided by the estimate or the value of an indicator in capturing the state or condition of an object, event, activity or an outcome for which it is an indicator.Most other statistical and methodological considerations follow from this requirement.
Among the supplementary-and related-criteria that scholars take into consideration are: Whether indicators are produced through transparent and replicable datagenerating processes, whether they are made publicly available, and whether they have extensive coverage in terms of units (typically countries) and time (typically years).Researchers face numerous tradeoffs when trying to fulfill these criteria.
One of the most important considerations is what type of data the ever-growing industry of measuring democracy, governance, and human rights should rely on (see Arndt & Oman, 2006;Landman & Carvalho, 2009, Chapter 3;OHCHR, 2012;Schedler, 2012;United Nations Development Programme, 2012).
Different measures of democracy are based on different types of data.Four main data types have been used to construct the major democracy measures: observational data (OD), i.e. data on directly observable facts, such as turnout rates or the presence or absence of formal political institutions; 'in-house' coding (IC) by researchers and/or their assistants based on an assessment of country-specific information found in reports, academic works, newspapers, archival material, etc.; expert surveys (ES), where selected country experts provide an evaluation based on their case-specific knowledge; and representative surveys (RS), where a sample of ordinary citizens provide judgements about particular issues. 3ll of these types of sources have different strengths and shortcomings.Even though this is well-known, contrasting views about what kind of data is better still exist.To illustrate, the Office of the United Nations High Commissioner for Human Rights (OHCHR, 2012), which represents the global commitment to universal ideals of human dignity, takes a clear stand in favor of observable data in its widely cited report on human rights measurement.This preference for fact-based quantitative indicators over judgement-based indicators4 is motivated by an interest in making assessments less subjective and thus more broadly acceptable.According to Cheibub, Gandhi and Vreeland (2010, p. 77), the data required by judgement-based democracy measures 'are hard, if not impossible, to obtain.Consequently, we suspect that these measures entail coding created on the basis of inferences, extensions, and perhaps even guesses' (see also Merkel et al., 2016;Przeworski, Alvarez, Cheibub, & Limongi, 2000;Vanhanen, 2000).
In contrast, the people behind the Worldwide Governance Indicators state that fact-based indicators are insufficient for capturing the realities of governance outcomes on the ground (Kaufman & Kraay, 2008).They therefore consider judgement-based data as a valuable tool.This position is motivated by the assumption that it is virtually impossible to capture the relevant aspects of governance, including democracy, without relying on the judgement of experts, in-house coders, and/or citizens (see also Bowman, Lehoucq, & Mahoney, 2005;Coppedge, Gerring, Lindberg, Skaaning, & Teorell, 2017a, 2017b;Munck, 2009;Schedler, 2012).
To increase the awareness among producers and users of democracy data, it seems pertinent to critically review and supplement the arguments and suggestions in a single article.More particularly, this article discusses the pros and cons of different data types and suggests how to counter some of the potential problems related to the measurement of electoral democracy (i.e.access to government power is determined by competitive and inclusive elections) and liberal democracy (i.e.electoral democracy combined with respect for civil liberties and the rule of law) (see Møller & Skaaning, 2011).The discussion draws on extant as well as suggested indicators to illustrate the tradeoffs.After presenting an overview of what kind of data extant democracy measures are based on, I discuss-for each of the four types of data in turnthe potential advantages and disadvantages regarding reliability and validity together with suggestions to reduce some of the problems.The basic argument of the article is that no type of data is superior to the others in all respects.Researchers should generally pay more attention to different ways of increasing valid measurement, including the combination of different types of data and data from different sources, whenever they construct their measures.It is not reasonable simply to stick to conformist practices and dogmatic doctrines about the general superiority of one type of data.

Extant Democracy Measures: What Kinds of Data Are Used?
Table 1 makes clear that there is considerable variation regarding how many kinds and which kind of data sources they build on.This plethora of approaches indicates that it not obvious what kind of data-or mix of data-one should prefer when trying to measure democracy.For some indicators, it is not easy to say if they are fact-based or judgement-based (more on this below).But if we take the statements of the different data providers as given, the Democracy-Dictatorship dataset (Cheibub et al., 2010) and Vanhanen's (2000) polyarchy measure only use observational data.The first of these measures uses indicators of legislative and executive elections, status of the legislature, opposition parties, and government turnovers to create a dichotomous distinction between democracies and autocracies.The second only uses share of votes cast for the largest party and electoral turnout rates in national elections to capture the level of democracy.
The underlying data of the Bertelsmann Transformation Index (Bertelsmann Stiftung, 2017), the Freedom in the World survey (Freedom House, 2017), and the Perception of Electoral Integrity index (Norris, Frank, & Martínez i Coma, 2014) are all based on expert assessments.The Polity Measure (Marshall, Gurr, & Jaggers, 2016) and the CIRI Human Rights Dataset (Cingranelli & Richards, 2010) solely rely on in-house coded data.The remaining measures included in the overview presented in Table 1 build on more than one kind of data source.The Democracy Barometer dataset (Merkel et al., 2016), the Unified Democracy Scores (Pemstein, Meserve, & Melton, 2010), and the Worldwide Governance Indicators (Kaufman & Kray, 2017) do not provide original data collection but use extant indicators based on all four kinds of data sources.The Varieties of Democracy (Coppedge et al., 2017b) dataset relies on all types of data apart from representative surveys, and the Democracy Index by the Economist Intelligence Unit (2007) only excludes in-house coded data.Finally, the Lexical Index of Electoral Democracy (Skaaning, Gerring, & Bartusevičius, 2015) combines two kinds of data sources: in-house coded data and observational data.It varies quite a bit from measure to measure whether the  (Vanhanen) different types of data are used to measure the same subcomponents and components of the overall democracy measures.
Disagreements about best practices regarding what kind of data to employ continue to flourish.The great variation not only reflects differences in resources; it also indicates different weighting of the potential problems related to data types.But what are the more specific pros and cons of different data types?How can the data type choice matter for reliability and validity?

Observational Data
Observational data have a high, often preferred standing among users of social science data.In the words of Cheibub et al. (2010, p. 74), 'The reliability of a measure depends on whether knowledge of the rules and the relevant facts is sufficient to unambiguously lead different people to produce identical readings on specific cases'.On this basis, they prefer democracy measures based on directly observable and verifiable indicators rather than subjective and fuzzy indicators.Among the main assets of a fact-based approach to measurement are transparency and a replicable data-generation process, which is generally less susceptible to biases than judgementbased data types (see below).Moreover, observational data often provides scales of the phenomena in question that are both relatively easy to interpret and comparable across countries and over time (OHCHR, 2012).
However, the assumptions underlying this preference are criticized for being unrealistic.According to Schedler (2012, p. 28), the collection and use of nonjudgmental data in the social sciences rests on two conditions: '(1) transparent empirical phenomena whose observation do not depend on our judgmental faculties and (2) complete public records on those phenomena'.When we want to measure democracy, none of these criteria are met.Not all aspects of democracy are easily observable and, relatedly, official statistics do not capture many relevant features in meaningful ways.Readily observable empirical information is often incomplete, inconsistent, or insufficient.'Some empirical phenomena we cannot observe in principle, others we cannot observe in practice' (Schedler, 2012, p. 28).
A particular problem emerges when measuring democracy by examining the official (formal) laws of the land, first and foremost the constitution:5 There is often a large discrepancy between what appears on the books and what is practiced on the ground.Informal rules and traditions are often more important than formal regulations.To illustrate this point with an extreme example, the Soviet 1936 constitution (aka.the Stalin Constitution) promised free and fair elections and respect for civil liberties on top social and economic rights.In practice, however, the political regime was totalitarian (see Linz, 2000), including a level of state repression that has hardly been matched by any other political regime in world history.
This problem refers to more than discrepancies between de jure and de facto regulations.For example, OHCHR (2012, p. 97) has suggested using reported cases of killing, disappearances, detention, and torture against journalists to measure freedom of opinion.This could be a relevant indicator but has two significant shortcomings.On the one hand, there is likely to be a reporting bias because reliable information is often not readily available (see Fariss, 2014;McNitt, 1988, pp. 94-99;Weidmann, 2016).The perpetrators normally have a clear interest in keeping the correct number secret and it is often difficult to know why a particular journalist has disappeared or died, or whether they were imprisoned due to a legitimate use of their freedom of expression or some other reasons.On the other hand, anticipated sanctions often lead to self-censorship.Journalists are rarely killed in North Korea (as far as we know), because they know that criticizing the government would have dire consequences.These problems would apply to similar attempts at capturing respect for liberal rights and adherence to the rule of law by the (exclusive) use of observational indicators.
Among the attempts to measure democracy using observational data, we find the democracy-dictatorship dataset (Cheibub et al., 2010;Przeworski et al., 2000).Its reliance on the rule of electoral government turnover to determine whether elections have been free suffers from two problems.First, the so-called Botswana problem, i.e. a government seems to be continuously reelected through free elections, meaning that Botswana (and other such cases) does not fulfill the turnover-criterion, saying that an alternation in government power has taken place under electoral rules identical to those bringing the incumbent into power.Second, the turnover criterion is implemented in a way that could introduce further problems.The coding rule says that a government turnover implies that a particular regime is coded as democratic all the way back to when the previous government took power given that the case also fulfilled the other criteria for democracy in the period, if the electoral rules are identical.However, a judgement call is sometimes needed to determine what counts as electoral rules and what counts as a relevant change to these rules (Knutsen & Wig, 2015, p. 909). 6he freeness and fairness of elections could also improve significantly (from no uncertainty to significant uncertainty about the outcome) under the same (formal) electoral rules.This applies, among other cases, to the Dominican Republic between 1966 and 2002.In this period, election outcomes varied greatly and, according to comprehensive case studies, did not meet the minimum threshold for electoral democracy before 1978 (Marsteintredet, 2009, Chapter 4).In other cases, government turnover merely signifies that the ruling coalition is split and no longer controls the sufficient means to stay in power-a situation that the opposition exploits to gain power through manipulated elections.This problem applies to, for example, the change from conservative to liberal hegemonic rule in Columbia (1930Columbia ( -1931) ) and President Kurmanbek Bakiyev's rise to power in connection with the Tulip Revolution in Kyrgyzstan.Hence, government turnover often provides strong and relevant indication of free electoral competition, but it is not unproblematic and undisputable evidence (see Bogaards, 2007;Boix, Miller, & Rosato, 2013;Skaaning et al., 2015).Another well-known example is Vanhanen's (2000) use of voter turnout and the vote share of the largest party in order to capture different degrees of democracy.These indicators tend to fail to tap all of the relevant aspects of democracy, however, such as the degree of freedom of expression and the power of the parliament, while capturing things that do not directly reflect the level of democracy, such as mandatory voting, dissatisfaction with the government, and the weather on voting day (Bollen, 1990, pp. 8, 15).Both the official statistics on turnout rates and vote share could also be unreliableeither because the data has been manipulated or because the data providers have been unable to collect all of the relevant information and aggregate them correctly.Some governments simply do not have the capacity to collect and handle the relevant information, which leads to missingness or flawed estimates.Other governments and/or their agents have strong incentives (and few constraints) to manipulate official data in order to misinform their own citizens and foreign governments and organizations (Herrera & Kapur, 2007).
Both of these circumstances can seriously reduce the availability and quality of data that could be relevant for measuring democracy, since they tend to be politically sensitive.Even in the case of so-called hard economic data (e.g.GDP per capita and trade), where governments and international organizations invest extensive resources in the collection of information and calculate the figures used by countless social scientists, there are remarkable problems regarding reliability and validity (Jerven, 2013;Kerner, 2014).There is therefore good reason to refrain from buying into the claim that fact-based data are always more informative and less biased than judgement-based data (see, e.g., Bollen, 1992;Coppedge et al., 2017aCoppedge et al., , 2017b;;Kaufman & Kraay, 2008;Schedler, 2012).Public statistical information and other types of observable data can be useful for measuring democracy, but directly observable indicators do not capture all aspects of democracy well.

In-House Coded Data
One way of overcoming problems related to the lack of good observable indicators is to base scores on different kinds of relevant information found in diverse sources providing country-specific information, such as newspapers, election observation reports, human rights reports, and academic works.The construction of in-house coded data normally follows a particular procedure: Relevant information is gathered, after which a coder evaluates the evidence on one or more particular issues and translates the evaluation into a score based on more or less explicit and precise standards.Note, furthermore, that in the case of in-house coding, the coders are not experts on all of the (many) countries (and maybe also not the substantive areas) they assign scores to.
In-house coded data has three major advantages: It can be used to capture important traits that are largely undetectable by observational data (Bollen, 1993(Bollen, , p. 1210;;Hadenius & Teorell, 2005, pp. 14-15;Mainwaring, Brinks, & Pérez-Liñán, 2001, p. 61;Munck & Verkuilen, 2002, p. 18).In many cases, bits and pieces of evidence can be put together to create a more general understanding of the actual respect for different democratic rights.On this basis, raters can make an informed estimate of the extent of, say, electoral contestation, freedom of expression, and fair trails, which would otherwise be very difficult to capture in a nuanced manner.
Another positive feature of in-house coded data is that the centralized assignment of scores by one or a few selected coders, ceteris paribus, generally makes for a higher degree of consistency when applying coding criteria.The understandings of concepts and scales will simply be more uniform compared to (more 'decentralized') expert surveys and public opinion surveys.In other words, in-house coding facilitates similar applications of standards across countries, especially if the number of coders is low and they are carefully trained and supervised.The use of multiple coders and inter-coder reliability tests are valuable tools to assess whether the assumptions about consensus among coders are met, i.e. there is consistency in the estimate if the data-generating procedure is repeated by the same or different coders (see Gwet, 2014). 7he third potential advantage is that in-house coding facilitates standardized and detailed documentation of why particular observations are assigned certain values.Detailed documentation of the motivation behind the particular scores can obviously be very time-consuming, which is probably why it is not provided in connection to any of the democracy measures based on inhouse coding.
There are other reasons for hesitating before accepting values derived from in-house coding.The use of inhouse coded data (and judgement-based data more generally) is sometimes rejected with reference to its sub-jective nature.In contrast to genuine subjective measures, however, such as data on public attitudes, 'they are not supposed to be subjective, but intersubjective: grounded in public facts and public reasons, defensible in the face of critique' (Schedler, 2012, p. 24).Despite this well-taken qualification, coder-specific biases can still influence the scores in different stages of the coding process (Bollen & Paxton, 1998, 2000): First, differential use of sources of information, combined with the filtering of information across the world, could lead to specific judge-centered method factors.Second, judges can process the information available to them in such a way as to differentially weight relevant events or to include irrelevant factors.Finally, the methods of constructing a measure might introduce method effects.(Bollen & Paxton, 2000, p. 64) In-house coders do not have expert knowledge of all of the countries they code.They must therefore rely on secondary sources, which obviously differ with respect to availability and relevance.Systematic distortion of information is likely as it makes its way from the actual practices and events to the sources of information used by the coders.Accessible data can be ordered according to its informative value.The best situation would be for all relevant information to be available, but this is unrealistic.The following ordering of information therefore applies: recorded, accessible, locally reported, and internationally/foreign reported (Bollen, 1992, pp. 198-199).Movement from the former to the latter resembles a filtering process where some information passes through and some does not.
This process is likely to introduce biases.Filters often tend to be selective in non-random fashions, meaning that the information is neither complete nor representative (Foweraker & Krznaric, 2000, p. 766;Milner, Poe, & Leblang, 1999, p. 420).This is due to differences in the openness of countries, how much international attention they receive (influences by size, language, etc.), ideological preferences of the media, specific agendas of scholarly works and reports, and so forth.While most of the providers of original in-house coding (LIED, Polity, V-Dem) use multiple sources (which are generally unspecified), only CIRI makes use of the Country Reports on Human Rights Practices issued by the US State Department. 8This fact means that the validity to a very high degree depends on the representativeness and impartiality of a single source, which has been accused of being biased-especially in the early releases (see Innes, 1992;Poe, Carey, & Vazquez, 2001;Qian & Yanagizawa, 2009).9 In the next step, raters can introduce random and systematic measurement errors by interpreting the sources differently, either because they based their evaluation of different pieces of relevant or irrelevant information, because they weight the same evidence differently, or because they have different understandings of concepts and scales guiding the coding process.10According to Raworth (2001, p. 114), 'The identity of the individuals giving the ratings is inevitably open to questioning'.
Differences in the specific coding processes can also influence the scores.Raters can assign scores to many or few countries (and different groups of countries); they can finalize scores immediately or go back and revise some of them; they can code everything between one year and hundreds of consecutive years at the time; and they can work on the coding in a relatively short but intensive period or carry out the task over a longer, lessintensive period.All of these factors will tend to influence the implicit reference points in the minds of coders and thus have an impact on the scores.The ability of inhouse coded data to capture latent regime features in a consistent way is promising, while biases introduced in the coding process and the lack of comprehensive case knowledge are among the potential downsides of this kind of data.

Expert Survey Data
Expert survey data is generated through assessments of the fulfilment of democratic rights with the help of informed experts, often scholars or other persons working in related fields and intimately acquainted with the subject matter, such as journalists or leading members of NGOs.The main advantage of expert surveys compared to in-house coded data is exactly the case knowledge.The experts presumably know the relevant context and details about the issues in question (Marquardt et al., 2017).If their knowledge is insufficient, they have a superior background for finding relevant information.Experts may even have sufficient contextual knowledge to provide a plausible estimate if there is limited available evidence in terms of written sources directly tapping into a particular phenomenon.Original expert surveys are part of BTI, EIU, FH, PEI, and V-Dem; the three former only use one expert per country, while PEI and V-Dem use multiple experts per country (Coppedge et al., 2017a, p. 8).V-Dem even divides its survey into different categories, and to some degree enlists different experts to fill out different parts of the overall survey for each country (Coppedge et al., 2017b).
The potential problems identified in relation to inhouse coded data also apply to expert surveys.The filtering of information might not be as big a problem due to the case expertise.However, the selection and weighting of evidence and the coding process will differ somewhat from expert to expert, partly depending on personal factors, such as updated and relevant familiarity with the cases, political leaning, job situation, and work effort (Bollen & Paxton, 1998).Expert knowledge varies and is sometimes inadequate, and the experts often lack strong incentives to enlist and spend much time doing a serious coding job, including searches for additional information.Furthermore, limited and differentiated knowledge leaves room for the so-called 'halo effect,' which is the tendency for a good (or bad) impression of performance in one area to influence opinion regarding other areas (Sequeira, 2012).These circumstances draw attention to the three-fold challenge related to the recruitment of experts.The experts should preferably be the most knowledgeable, unbiased, and be ready to do a careful job.However, the enrolled experts are rarely the best possible according to these criteria.
Experts are also more prone to apply different coding criteria than in-house coders because expert surveys are mostly carried out as decentralized coding without prior training, meaning that the basic understanding of concepts and scales can vary greatly (see Martinez i Coma & Van Ham, 2015;Steenbergen & Marks, 2007).BTI, EUI, and FH combine their expert assessments with review and deliberation across a team of in-house analysts.For good reason, this approach is assumed to increase crosscountry consistency.The procedures are not transparent, however, since it is not made public which changes are introduced to the original expert-based values and why for any of the cases.
V-Dem has a different approach to increase the comparability and reduce the influence of potential biases.A complex Bayesian IRT measurement model uses information about agreement across coders, self-assigned uncertainty estimates by the experts about their own ratings, personal coder characteristics (extracted from a post-questionnaire survey), links between countries based on experts assessing more than one country (either for all years or one year), and vignettes related to the survey questions in order to align the experts' thresholds (see Pemstein et al., 2017).This procedure also supplements the scores with a systematic assessment of measurement uncertainty.This is also done for PEI but only based on the degree of expert agreement.
The documentation of the justifications for the scores is desirable, just as in the case of in-house coding.Even though it is usually impossible for experts 'to relate the numerical conclusions they reach to the precise pieces and bits of information that have gone into them…they should be able to document the big picture [and] describe the range of uncertainty and controversy regarding their judgmental decisions with reference to concrete documentary evidence (or the lack of such evidence) ' Schedler (2012, p. 32).The extra workload for the experts and coordinators to provide and standardize the information makes this procedure very resource-demanding.Nonetheless, BTI and FH complement their scores with relatively detailed country reports, meaning that one can get an impression of what events and circumstances have influenced the scores for different aspects of democracy (but they do not provide adequate references to the material on which the reports are based).
In sum, the comparative advantage of expert surveys comes to the fore in situations of incomplete or inconsistent information, where contextual knowledge can be used to bridge informational gaps (Schedler, 2012, p. 28).However, the reliance on the personal judgements of a few experts means that the data might lack comparability and might be affected by different kinds of biases.

Representative Survey Data
The final type of data, representative surveys of the general population, brings the knowledge and opinions of ordinary citizens into play.Mayne and Geissel (2016) argue in favor of including a citizen component in the measurement of democratic quality.It should capture the citizens' democratic commitments, political capacities, and political participation.This perspective, however, seems more relevant for the measurement of deliberative and participatory democracy than electoral and liberal democracy.In connection to these more limited understandings of democracy, the suggested additions are better understood as possible causes or consequences of democracy.Pickel, Breustedt and Smolka (2016) also advocate for the inclusion of representative survey data in the measurement of democratic quality.They propose that citizen evaluations of democratic performance should complement other types of data.
For some purposes, representative surveys can provide valuable information.Respondents can function as 'everyday experts' on issues that are otherwise hard to get firm knowledge about.A case in point is petty corruption, where the experiences of citizens with having to pay bribes could be a superior source of information (see Naval, Walter, & de Miguel, 2008;Razafindrakoto & Roubaud, 2010).Another would be information about whether citizens experience or participate in political violence (see Bhavnani & Backer, 2007).
However, there are also noteworthy problems associated with the use of data from representative surveys to measure democracy.Most citizens lack nuanced knowledge about the general dynamics and performance of particular political institutions.Gut feelings and personal opinions are thus likely to influence the scores.Most drawbacks of judgement-based data apply more strongly to representative survey data than in-house coded data and data based on expert surveys (cf.Marquardt et al., 2017).Experts and in-house coders generally have better backgrounds for carrying out such assessments.They generally possess a broader knowledge regarding the political history of other countries and data collection procedures, a higher degree of shared understanding about the meaning of particular concepts, and a strong scientific ethos (or least an interest in maintaining their academic credibility).This implies that individual biases and dissimilar standards (both within and across countries) in the interpretation of questions and scales are more pronounced.Ordinary citizens also tend to be more susceptible to collective cultural biases (nation-wide inclinations), and the respondents in representative surveys are very unlikely to provide any form of systematic reasoning for their entries.Ordinary citizens might also be afraid to share their experiences or express their honest opinion, especially in the case of an oppressive regime (Tannenberg, 2017).
Does this mean that we should generally refrain from using representative surveys to measure at all? Ordinary citizens might possibly possess valuable knowledge based on their real-life experiences that could supplement that of experts.Here, it seems pertinent to distinguish between experience-based questions and perception-based questions.The former ask citizens about their own experiences regarding particular situations (e.g.how often they have been asked to pay a bribe or been subjected to violent assaults in the previous year).The latter is typically based on more abstract questions, asking about the lay of the land regarding democracy, civil liberties, corruption, etc.
The experience-based questions have greater potential for providing relevant information than the second type, which are likely to produce unreliable and biased democracy indicators.Combined with the relatively low coverage in terms of years and countries,11 it is therefore unadvisable to use perception-based data from representative surveys for democracy measurement.None of the evaluated measures are based on original data collection using this approach, but DB, EIU, UDS, and WGI rely on such data-either directly or indirectly (by including composite measures that use them).

Discussion
There are several ways of countering the disadvantages identified above.In relation to in-house coded and expert survey data, the documentation should ideally provide answers to the following questions: What evidence has been used and why?And how has the evidence been weighed and processed and why?That is, the criteria for identifying and selecting relevant sources and the criteria for extracting and using relevant information must be pinned down.This work can be done to different degrees of perfection to the point where every score is supplemented with nuanced description of the evidence (us-ing active citation; see Moravcsik, 2014), how and why it has been weighted in certain ways (with relevance as the main criteria; see Bowman et al., 2005;Lustik, 1996;Møller & Skaaning, 2017), and who has been involved and how in the data collection and processing (Schedler, 2012, p. 33).
Inconsistency and personal biases can be reduced by the construction and application of specific and justified definitions of what one attempts to measure and the scales used to distinguish between different levels of fulfilment.The clarification should preferably be presented as precisely as possible and linked to concrete (maybe even paradigmatic) examples.This would support the establishment of shared anchors for the assignment of values.Another useful tool is to reduce conceptual complexity through disaggregation.This would imply the coding of more concrete issues than just freedom of expression, including media censorship, freedom of private discussion, harassment of journalists, and monopoly of news media.
Other factors, such as the exposure of coders to extensive relevant variation, can also improve the consistency.As a rule of thumb, they are more likely to employ similar standards across and within cases when the following conditions are fulfilled: The coders assign scores to a diverse set of many countries; they are willing and allowed to revise scores; they score long time series; and they score the cases within a relatively short period.
If in-house coders or experts score the same cases, formal measurement models can produce replicable point estimates and estimates of uncertainty.12One should note, however, that whereas it will almost always be good to increase the number of in-house coders (although there will be a diminishing return), more is not always better in the recruitment of expert coders because there will be a rather limited number of people with high levels of relevant expertise.Moreover, an increase in the number of coders will increase the costs attached to the data collection, thereby emphasizing the latent tradeoff between high quality data and coverage. 13ormal measurement models can also be used to combine data from different datasets based on different data-generating approaches (i.e.observational data, inhouse coded data, expert survey data, and/or representative survey data) (Bollen & Paxton, 2000, p. 79).The advantage of such composite measures is their utilization of information from several variables.The combination of information from different data types can increase the ability to capture related, but distinct, aspects of the variable in question.In addition, it can reduce the impact of idiosyncratic measurement errors associated with individual indicators.The use of multiple indicators for the same phenomenon also facilitates an assessment of how precise the point estimates are through the construction of confidence levels (see Fariss, 2014;Pemstein et al., 2010).This integrative approach is used (in full or in part) to construct several of the democracy measures (see Table 1).By reducing some problems, however, it risks introducing or increasing others.The integration can lead to an accumulation of the problems associated with the individual indicators rather than resolving them.Moreover, the products tend to be more complex.This means that the relationship between measures and the concepts they should capture becomes more blurred.
Extant democracy measures build on different kinds of data; some only employ in-house coded data, expert survey data, or observable indicators, while others use different combinations of two or more of these types and representative survey data.The identification of the pros and cons typically associated with the respective data types has demonstrated that the different methodological choices about this issue matter for the reliability and validity of democracy measures.Table 2 summarizes some of the most important strengths and weaknesses typically associated with the different data types.Some of the similarities and dissimilarities follow the overall fact-based and judgement-based distinction, while others do not.The overview reveals that the pros and cons associated with the respective kinds of data are not simply mirror images of each other.
The discussion reveals the simplicity of the bullet points in Table 2.They neglect the many nuances of coding rules and processes that can influence the quality of the data.The comparative advantages and disadvantages of the different data types vary in both kind and degree.The reliability and validity depend on the particular procedures used in the data generating process and the aspects of democracy one attempts to capture.
The discussion has also revealed that no type of data is superior to all of the others in all respects when it comes to measuring the fulfilment of democratic rights.
Hence, the arguments presented have challenged what many consider conventional wisdom, namely the general superiority on one kind of data-directly observable (fact-based) data.Actually, this belief tends to be a dogmatic doctrine resting on invalid assumptions.As neatly summarized by Schedler (2012, p. 21), 'Banning judgment from measurement is neither a feasible methodological imperative nor a desirable one', and: If we were to renounce our judgmental faculties in the measurement of regime properties and regime dynamics, we would have to renounce the measurement of most of the most interesting regime properties and regime dynamics.If we truly had expelled judgment from data development, quantitative research on political regimes could not have blossomed as it has over the past decades.(Schedler, 2012, p. 33) This point applies more to the measurement of thicker understandings of democracy, such as liberal democracy.Respect for civil liberties and adherence to the rule of law tend to be even harder to capture without judgementbased indicators than narrow electoral criteria (regular, inclusive, and competitive elections).Considerable effort has already been invested in improving democracy measures.More can still be done to increase the reliability and validity, however, and greater awareness about these issues among data users is required.