A chi-squared test with few observations is not a super powerful statistical test (note, apparently termed both chi-square and chi-squared test depending on the discipline and source). Nonetheless, this test useful in systematic reviews to confirm whether observed patterns in the frequency of study of a particular dimension for a topic are statistically different (at least according to about 4/10 referees I have encountered). Not as a vote-counting tool but as a means for the referees and readers of the review to assess whether the counts of approaches, places, species, or some measure used in set of primary studies differed. The mistaken rule-of-thumb is that <5 counts per cell violates the assumptions of chi-squared test. However, this intriguing post reminds that it is not the observed value but the expected value that must be at least 5 (blog post on topic and statistical article describing assumption). I propose that this a reasonable and logical rule-of-thumb for some forms of scientific synthesis such as systematic reviews exploring patterns of research within a set of studies – not the strength of evidence or effect sizes.
An appropriate rule-of-thumb for when you should report a chi-squared test statistic in a systematic review is thus as follows.
When doing a systematic review that includes quantitative summaries of frequencies of various study dimensions, the total sample size of studies summarized (dividend) divided by the potential number of differences in the specific level tested (divisor) should be at least 5 (quotient). You are simply calculating whether the expected values can even reach 5 given your set of studies and the categorical analysis of the frequency of a specific study dimension for the study set applied during your review process.
total number of studies/number of levels contrasted for specific study set dimension >= 5
[In R, I used nrow(main dataframe)/nrow(frequency dataframe for dimension); however, it was a bit clunky. You could use the ‘length’ function or write a new function and use a ‘for loop’ for all factors you are likely to test].
Statistical assumptions aside, it is also reasonable to propose that a practical rule-of-thumb for literature syntheses (systematic reviews and meta-analyses) requires at least 5 studies completed that test each specific level of the factor or attribute summarized.
For example, my colleagues and I were recently doing a systematic review that captured a total of 49 independent primary studies (GitHub repo). We wanted to report frequencies that the specific topic differed in how it was tested by the specific hypothesis (as listed by primary authors), and there were a total of 7 different hypotheses tested within this set of studies. The division rule-of-thumb for statistical reporting in a review was applied, 49/7 = 7, so we elected to report a chi-squared test in the Results of the manuscript. Other interesting dimensions of study for the topic had many more levels such as country of study or taxa and violated this rule. In these instances, we simply reported the frequencies in the Results that these aspects were studied without supporting statistics (or we used much simpler classification strategies). A systematic review is a form of formalized synthesis in ecology, and these syntheses typically do not include effect size measure estimates in ecology (other disciplines use the term systematic review interchangeably with meta-analysis, we do not do so in ecology). For these more descriptive review formats, this rule seems appropriate for describing differences in the synthesis of a set studies topologically, i.e. summarizing information about the set of studies, like the meta-data of the data but not the primary data (here is the GitHub repo we used for the specific systematic review that lead to this rule for our team). This fuzzy rule lead to a more interesting general insight. An overly detailed approach to the synthesis of a set of studies likely defeats the purpose of the synthesis.