Automated Journalism: A Meta-Analysis of Readers’ Perceptions of Human-Written in Comparison to Automated News

This meta-analysis summarizes evidence on how readers perceive the credibility, quality, and readability of automated news in comparison to human-written news. Overall, the results, which are based on experimental and descriptive evidence from 12 studies with a total of 4,473 participants, showed no difference in readers’ perceptions of credibility, a small advantage for human-written news in terms of quality, and a huge advantage for human-written news with respect to readability. Experimental comparisons further suggest that participants provided higher ratings for credibility, quality, and readability simply when they were told that they were reading a human-written article. These findings may lead news organizations to refrain from disclosing that a story was automatically generated, and thus underscore ethical challenges that arise from automated journalism.


Introduction
Automated journalism, sometimes referred to as algorithmic journalism (Dörr, 2016) or robot journalism (Clerwall, 2014), alludes to the method by which algorithms are used to automatically generate news stories from structured, machine-readable data (Graefe, 2016).
The idea of news automation is not new. Half a century ago, Glahn (1970) described a process for automatically generating, what he called, "computer-produced worded weather forecasts." Basically, his idea was to create pre-written statements that describe different weather conditions, each of which corresponds to a particular output of a weather forecasting model (e.g., the combination of wind speed, precipitation, and temperature). This process is similar to today's template-based solutions offered by software providers in which a set of predefined rules are used to determine which prewritten statements are selected to create a story (Graefe, 2016).
Another domain that uses automated text generation is the financial news. In 2014, when the Associated Press gained much public attention for the decision to automate earnings reports (White, 2015), Thomson Financial (today part of Thomson Reuters) had already been automating such stories for nearly a decade (van Duyn, 2006).
It is no coincidence that weather and finance were the first applications to utilize news automation. In both domains, structured data, a requirement for news automation (Graefe, 2016), are available. Furthermore, data quality is high for these applications. Weather data are measured through sensors with relatively low measurement error. Likewise, the accuracy of company earnings or stock prices is critical for consumers of financial data.
What is new is the increasing abundance of structured and machine-readable data in many other domains. Governments are launching open data initiatives, sensors are constantly tracking environmental or health data, and users are leaving traces with virtually anything they do online. Such data can be used to generate automated news stories and thus serve as one of the technology's major drivers. Another important driver is economic pressure: News organizations need to save costs, increase news quantity (e.g., covering niche topics), and reach new target audiences (Graefe, 2016).
Promises of automation in increasing efficiency are manifold. As outlined by Graefe (2016), automating routine tasks has the potential to save resources and thus leave more time for journalists to do more important work, such as fact-checking or investigative reporting. Furthermore, automation can speed up news production and essentially enable publication as soon as the underlying data become available. Finally, algorithms tend to make fewer errors than human journalists and can personalize stories towards readers individual needs, and if necessary, in multiple languages.
Nevertheless, Dörr (2016) found news automation to be in an early market expansion phase at best. This situation does not seem to have changed much over the past four years. Providers of automated text generation still list few media organizations as their clients, although this may have to do with reasons of commercial confidentiality. That said, it is difficult to find regular text automation in high-profile publications, apart from the regularly cited one-off or experimental projects such as the Heliograf (The Washington Post) or ReporterMate (the Australian edition of The Guardian). Other major publications such as The New York Times stated that they are not planning to automatically generate news, despite having experimented with automation technology to personalize newsletters or moderate readers' comments (Peiser, 2019).
One reason for why news organizations refrain from using the technology, despite its economic potential, may be concerns that their readers would disapprove of automated news. According to the Modality-Agency-Interactivity-Navigability model (Sundar, 2008), readers may have a conflicting perception of automatically generated news. On the one hand, they may prefer human-written articles because they regard journalists as subject-matter experts (authority heuristic), or because they feel that they are communicating with a human rather than a machine (social presence heuristic). On the other hand, the machine heuristic suggests that readers regard automated news as free of ideological bias and thus more objective.
To answer such questions, researchers in different countries have conducted experimental studies to analyze how readers perceive automated news in comparison to human-written news. While sharing the common goal to better understand readers' perceptions of automated news, these studies often differed in their design.
For example, some studies showed readers the same text and manipulated the byline as either written by a human or by an algorithm, whereas others revealed the true source of the articles. Yet another group of studies asked participants to rate either a human-written or an automated text, without revealing any information about who wrote the article.
The present meta-analysis summarizes available evidence on readers' perception of automated news to date, drawing on 11 articles, published in peer-reviewed journals between 2017 and 2020. Our goal is to give readers quick and easy access to prior knowledge. We provide an overview for which countries, domains, and topics evidence is available, which designs have been used to study the problem as well as how researchers recruited study participants. More importantly, we provide effect sizes aggregated across studies, while distinguishing between descriptive and experimental evidence as well as between effects that can be attributed to the article source (i.e., the author) and the message itself.

Article Search
We included only studies published in scientific peerreviewed journals in the English language. Studies had to provide experimental evidence on readers' perceptions of human-written news in comparison to automated news with respect to credibility, readability, and expertise. These are three of the four constructs that Sundar (1999) identified as central when people evaluate news content (the fourth one, representativeness, was omitted as it applies to news sections rather than single articles).
Our Google Scholar search for [('automated journalism' OR 'robot journalism') AND experiment AND perception] in October 2019 yielded 211 articles. After reading the title and abstract of each article, 34 articles were identified as potentially relevant and were thus read in full length by at least one of the authors. The articles by Jia (2020) and Tandoc, Yao, and Wu (2020) were added later. A total of 11 articles matched the inclusion criteria outlined above. Table 1 lists the 11 articles included in our meta-analysis. Three articles were published in Digital Journalism, and two articles each in Journalism and Computers in Human Behavior. The remaining four articles were published in four different journals, namely International Journal of Communication, Journalism Practice, Journalism & Mass Communication Quarterly, and IEEE Access.

Studies
We only included studies with a particular study design in our meta-analysis. These studies presented recipients with a short news story, in which either the author (journalist or algorithm), the attributed author (journalist or algorithm), or both were experimentally manipulated. Recipients would then rate the article they had just read in terms of (at least one of the dimensions) credibility, quality, and readability.
Given that we were interested in readers' perceptions of human-written vs. automated news, we excluded experiments that used journalists as recipients (e.g., Jung, Song, Kim, Im, & Oh, 2017, Experiment 2) or analyzed hybrids of human-written and automated news (e.g., Waddell, 2019a). We also excluded studies that did not report effect sizes (e.g., Clerwall, 2014) or used a different experimental setup (e.g., Haim & Graefe, 2017, Experiment 2).
We ended up with 12 studies included in the 11 articles (cf. Table 1).

Coding
For each experimental study, one author coded study artifacts that related to the study participants, the stimulus material, the experimental design, and the study results (cf. Table 1). If the coder was uncertain regarding a particular coding, the issue was resolved by discussion with the second author. The coding sheet is available at the Harvard Dataverse (Graefe, 2020).

Participants
We coded the number of participants, age and gender distribution, the country/region participants came from as well as how participants were recruited. Across the 12 experiments, a total of 4,473 people participated, of which 50% were female. The average age was 36 years. Participants were from the USA (all of which were recruited through Amazon Mechanical Turk), Germany (recruited through the Sosci Panel administered by the German Communication Association), South Korea, China, Singapore, and other European countries.

Stimulus
We coded the domain of the news article, the article topic, as well as the article language. Sports news were most often used (eight studies), followed by financial (six) and political (four) news. Two studies focused on breaking news (earthquake alerts), and one study each used texts within the domains of entertainment and other news. Six experiments used articled written in English, two each in German and Korean, and one each in Finnish and Chinese. Table 1 shows the design for each study, particularly regarding our key variables of interest, namely who the actual author of the article was, and who was declared as the author (author attribution). In addition, Table 1 also lists additional experimental manipulations if available.

Outcome Variables
Across the 12 experiments, credibility was measured most often (nine times), followed by quality (eight times) and readability (five times). While the specific operationalization of the three constructs somewhat varied across studies, the measures used intend to capture the same basic constructs. Also, 8 of the 12 experiments reported effect sizes on a 5-point scale, three studies used a 7-point scale, and one study used a 10-point scale. For each outcome variable, we coded mean ratings and standard errors and/or standard deviations.

Effect Size Calculation
For each experimental comparison of human-written and automated news, we calculated Cohen's d, the standardized mean difference between the two groups, as: where M HW refers to participants' mean rating for perceptions of human-written news and M A refers to mean ratings of automated news (Cohen, 1988). Hence, positive values for d imply that the human-written were rated better than automated news, and vice versa. Metaanalysis effect sizes were calculated as weighted (by the inverse of the variance) averages across the d values for the available studies. When referring to magnitudes of effects sizes, we adopted the descriptors suggested by Cohen (1988) and refined by Sawilowsky (2009)

Types of Evidence
We distinguish between experimental and descriptive evidence in our analysis.

Experimental Evidence
Experimental evidence aims to establish causal effects by isolating the effects of the independent variable (i.e., the author or the attribution) through experimental manipulation.
Studies that aim to isolate the effect of the article source would show all recipients the same text (either written by a human or an algorithm). However, for some recipients, the text would be declared as written by a human, whereas for other recipients, that very same text would be declared as automatically generated.
Studies that aim to analyze the effect of the content (i.e., the message) would present recipients with either a human-written or an automated text but would not reveal the source (i.e., the texts had no byline).

Descriptive Evidence
Comparisons that provided descriptive evidence showed recipients news stories that were either written by a human or automatically generated, and truthfully declared the source. That is, human-written news would correctly be declared as written by a journalist, and automated news would correctly be labelled as generated by an algorithm. Then, the researchers would ask participants to rate these texts. These comparisons do not allow for drawing causal inferences on the effects of the source or the message. However, for perceptions of credibility, researchers often use different scales (i.e., source credibility and message credibility), which were specifically designed to separate the effects. In contrast, no scales are available to distinguish the effects of the message and the source on perceived quality or readability. We were thus unable to separate the effects of source and message in these cases. Figure 1 shows the main effects for each of the three constructs across all available comparisons, not differentiating between effects of the source and the message. Overall, there was no difference in how readers perceived the credibility of human-written and automated news (d = 0.0; SE = 0.02). Although human-written news were rated somewhat better than automated news with respect to quality, differences were small (d = 0.5; SE = 0.03). In terms of readability, the results showed a huge effect in that readers clearly preferred humanwritten over automated news (d = 2.8; SE = 0.04).

Main Effects
Interestingly, however, the direction of the effects for credibility and quality differed depending on the type of evidence. For both credibility (d = 0.3; SE = 0.03) and quality (d = 0.8; SE = 0.03), experimental evidence favored human-written over automated news. In comparison, descriptive studies showed the opposite effect: Automated news were favored over human-written news for both credibility (d = −0.5; SE = 0.04) and quality (d = −0.6; SE = 0.06).

Credibility
Figure 2 distinguishes between comparisons that provide evidence on the effects of the source and the effects of the message.

Source Credibility
One factor that may affect readers' perception of news is the source or, more specifically, the author. Overall, the results show a small difference between readers' perceptions of source credibility: human-written news were rated somewhat higher than automated news (d = 0.3; SE = 0.04). That said, the direction of effects differed again depending on the type of evidence. While experimental evidence showed a medium-sized effect in favor of human-written news (d = 0.5; SE = 0.04), descriptive evidence revealed a small effect in favor of automated news (d = −0.3; SE = 0.08).

Message Credibility
With respect to message credibility, automated news were rated somewhat more favorably across all comparisons (d = −0.3; SE = 0.03). Yet, again, the effect was solely carried by descriptive evidence (d = −0.6; SE = 0.05). Experimental evidence found no difference (d = 0.0; SE = 0.04).

Quality
Figure 3 distinguishes between experimental comparisons that provide evidence on the effects of the source and the effects of the message as well as descriptive evidence that does not allow for differentiating between effects of source and message on recipients' perceptions of quality. Experimental evidence suggests that the article source has a small effect on perceptions of quality in that human-written news are rated somewhat better than automated news (d = 0.3; SE = 0.04). Experimental comparisons that provided evidence on the effects of the message found a very large effect in favor of human-written news (d = 1.6; SE = 0.05). Descriptive evidence, which does not allow for distinguishing between effects of the source and the message, found a medium-sized advantage for automated news with respect to perceived quality (d = −0.6; SE = 0.06).

Readability
Figure 4 distinguishes between experimental comparisons that provide evidence on the effects of the source and the message as well as descriptive evidence that does not allow for differentiating between effects of source and message on recipients' perceptions of readability.
Regardless of the type of evidence, the results showed a clear advantage for human-written articles. For experimental evidence on the effects of the source (d = 1.8; SE = 0.05) and the message (d = 3.8; SE = 0.07), effect sizes were very large and huge, respectively. Descriptive evidence on the combined effects of source and message showed a huge effect (d = 5.1; SE = 0.13).

Discussion
This meta-analysis aggregated available empirical evidence on readers' perception of the credibility, quality, and readability of automated news. Overall, the results showed zero difference in perceived credibility of human-written and automated news, a small advantage for human-written news with respect to perceived quality, and a huge advantage for human-written news with respect to readability.
One finding that stood out was the fact that the direction of effects differed depending on the type of evidence. Experimental evidence on the effects of the source found advantages for human-written news with respect to quality (small effect), credibility (mediumsized), and readability (very large). In other words, regardless of the actual source, participants assigned higher ratings simply if they thought that they read a human-written article. The results thus support the authority heuristic and the social presence heuristic, while contradicting the machine heuristic (Sundar, 2008). Given these findings, news organizations may worry that their readers would disapprove of automated news and therefore refrain from disclosing that a story was auto-

Experimental evidence
DescripƟve evidence matically generated (e.g., by not providing a byline). This underscores the ethical challenges that arise from automated journalism (Dörr & Hollnbuchner, 2017). Experimental evidence further showed advantages for human-written news with respect to the effect of the message (i.e., the article content). If participants did not know what they were reading, they assigned higher ratings to human-written news compared to automated news with respect to quality (very large effect) and readability (huge effect). There was, however, no effect for credibility. Obviously, these results depend entirely on the actual articles used in these comparisons. We thus refrain from deriving any conclusions or practical implications from these results and expect that the human written articles in these particular comparisons may have simply been better than the automated counterparts. The extent to which these articles are representative of the relative quality of automated and human-written news at the time is unclear.
In contrast, descriptive evidence showed opposite results with respect to how article source and message affect perceptions of credibility and quality. That is, automated news were perceived as more credible and of higher quality than the human-written counterparts in studies that asked recipients to evaluate articles whose source was truthfully declared. Given that these studies do not allow for making causal inferences, it is difficult to draw practical implications. In particular, any differences in effect sizes could simply be due to differences in the actual quality of the articles themselves.
Our analysis thus demonstrates the importance of distinguishing between the type of evidence (descriptive vs. experimental) as well as the origin of the effect (source and message). Otherwise, interesting findings, such as the positive effect of human authors on people's perceptions may get lost in the aggregate. That said, effects of the source with respect to both perceived credibility and quality were small. News organizations may not need to worry too much that readers could perceive automated news as less credible, or more generally as being of lower quality, than human-written news-assuming of course that the articles' actual quality is similar.
Differences with respect to readability, however, were huge. On the one hand, one could assume that poor readability is a critical barrier for readers' willingness to consume automated news. On the other hand, it should be noted that automation is most useful for routine and repetitive tasks, for which one needs to write a large number of stories (e.g., weather reports, corporate earnings stories). Such routine writing is often little more than a simple recitation of facts that neither requires flowery narration nor storytelling. In fact, in certain domains such as financial news, sophisticated writing may even be harmful, as consumers generally want the hard facts as quickly and clearly as possible. Another potential benefit of automation is the possibility to cover topics for very small target groups, for which previously no news were available (e.g., lower league games for niche sports, earthquake alerts, fine dust monitoring, etc.). For such topics, readers may be happy if they get any news at all. As a result, they may not be too concerned with readability, especially with how the construct is commonly measured (e.g., with items such as 'entertaining, ' 'interesting,' 'vivid,' or 'well written') in the literature. Future research should analyze perceptions of readers that represent the actual target group (i.e., people who would actually consume automated news).
Needless to say, our results provide merely a snapshot of the current state of news automation, drawing on evidence from 11 articles published between 2017 and 2020. Readers' perceptions may change over time, and they may change fast. Assuming that automated news becomes more common, readers would get more accustomed to such content, which could ultimately affect their perceptions. Also, the technology for creating automated news, as well as the availability of data, is likely to further improve over time, which we expect to positively affect both the quality and readability of automated news. Advances in statistical analysis, in combination with more data, should make it possible to add more context (e.g., adding weather data to exit polling texts) and analytical depth (e.g., by analyzing historical data, making predictions, etc.), which should improve the perceived quality of such texts. Similarly, we would expect natural language generation to further improve, with positive effects on perceived readability. Future research should continue monitoring readers' perception of automated news, especially if and how improvements in the objective quality of the texts affect their perceived quality.
The latter relationship has generally been overlooked in research to date. Available studies have merely analyzed if, and to what extent, readers' perceptions of automated and human-written news differ. Yet, we do not know which factors drive these perceptions. What is it that makes an article perceived as more or less credible or readable? Such information would be valuable for developers of automated news to improve the (perceived) quality of the texts.