In 1983-84 a massive study was undertaken in China to catalogue the health and habits of rural Chinese living in 65 counties and was dubbed “The China Study”.
“Within each of the 65 counties, 2 villages were selected and 50 families in each were randomly chosen for study. One adult from each household (half men and half women), 6500 for the entire survey, participated. Blood, urine and food samples were obtained for later analysis, while questionnaire and 3-day diet information was recorded.
A total of 367 items of information on these 6500 families eventually were judged to be reliable. These 1983-84 diet and lifestyle data included the 1973-75 mortality rates for about 4 dozen different kinds of cancers and other diseases.”
The assumption of the researchers was that “rich Western diets (high in fat and meat, low in dietary fiber) were strongly associated (correlated) with incidence of colon and breast cancer.” A commonly held view by most medical professionals today.
In 2005 T. Colin Campbell, one of the researchers involved in the study published a book called “The China Study: Startling Implications for Diet, Weight Loss, and Long-Term Health” which quickly became a “bible” of the vegan/vegetarian movement. Campbell became interested in the idea that protein (specifically animal protein) might cause cancer after observing in the Philippines that the children from the wealthiest families that ate the most protein had the highest rates of liver cancer. The book “the China Study” presents Campbell’s proof that animal protein does indeed cause (increase the risk of) cancer as well as a variety of other diseases like heart disease, autoimmune diseases, etc.
The book is divided into two sections. In the first section Campbell discusses experiments he performed on rats showing that after exposure to aflatoxin (a cancer-causing mold) rats fed a high protein diet suffered from more cancer than a low protein diet. A very interesting series of experiments. However it’s important to recognize that the animal protein that Campbell fed the rats was casein. Casein is one of the protein components of dairy and is not naturally found by itself. Other research suggests that casein may indeed have a positive (bad) influence on cancers, but that whey the other component of dairy products has a strong negative influence (good). Campbell incorrectly than goes on to label all animal protein bad from his experiments with one particular type of non-naturally occurring protein, isolated casein. And he generalizes from rats to humans which is very often done in the scientific literature and needs to be taken with a large “grain of salt”. Rats are not furry little humans. While we share many of the same genes, enzymes and biochemistry as rats, we also have a number of important differences which means what is good (or bad) for rats is not always good (or bad) for humans.
In the second part of the book, Campbell then draws on the mass of data gathered in the China Study to prove his assumption that animal protein does “very bad things.”
Since 2005 a small minority have argued that Campbell “cherry picked” the data from the China study to prove his point. That is he started with the conclusion he wanted, that animal protein caused disease, and then sifted through the data looking for data that would prove his point. A new critique proves that he did that very thing.
Richard Nikoley who runs the popular blog Free the Animal broke the story to the blogosphere that a blogger named Denise Minger, a self-proclaimed statistics nut, had set out to see if the conclusions of the China Study (the book) were true. She took the data from “the China Study” (the study) and began laboriously combing through it to see if the assertions that animal protein causes disease were true. After analyzing the data her conclusion is that Campbell’s conclusion is WRONG.
If her site hasn’t crashed from the scads of new visitors, I highly recommend you check it out at http://rawfoodsos.com/
Even more interesting, Minger turned up a STRONG correlation between wheat consumption and heart disease. That is, those people who ate the most wheat suffered from more heart disease. Something that was never mentioned in the China Study (the book).
It’s also important to recognize that the data itself while interesting is only of limited value. The China Study (the study) is what is known as an epidemiological study, that information was gathered on people and statistical analysis were done to attempt to attempt to relate various factors gathered (in this case what they ate) and diseases they suffered or died from. As Michael Pollan talks about, this is the most common form of nutritional study performed. Almost without exception when you see a new headline saying “X food linked to Y condition” it is an epidemiological study. Examples include headlines like “processed meat intake linked to increased risks of cancer” and “eggs associated with increased OR not associated with increased risk of…” Epidemiologic studies can show correlation not causation.
Correlation means two things appear together, but cannot say whether one thing causes another thing. Causation implies that one thing causes another. If I hold a match under my finger it burns my finger, that is causation. Correlation is an entirely different thing. For example we can do a study of obese people and find that increasing belt size is related to obesity, that is, the fatter the person is, the larger the belt they wear. And the statistical significance is very strong, it holds true in almost every case. This would be a correlation between obesity and belt size. The mistake that is made in the nutritional community would be to imply correlation between the two, that is, the bigger the belt a person chooses to wear the fatter they will become (belts cause obesity). This sounds absurd and it is. But when we look at epidemiologic data (which is the majority of nutritional research currently being done) and we see that consumption of X and disease Y have been correlated with one another we want to jump to the conclusion that X causes Y and the data cannot tell us that. Epidemiologic data is a useful starting point. We can take it, look at correlations that emerge from the data, and then design more focused studies to actually try to test causation.
If you’re still not clear on the difference between correlation and causation and why it’s so important to make the distinction I recommend you check out the following two posts.
The first which has nothing to do with the China Study is an excellent post by Dr. Michael Eades http://www.proteinpower.com/drmike/statistics/observational-studies-2/
The second is by Dr. Kurt Harris on the China Study at http://www.paleonu.com/panu-weblog/2010/7/8/polish-a-turd-and-find-a-diamond.html
We owe Denise Minger a debt of gratitude for slogging through reams of data to look at it with an unbiased eye. What she shows is that even the correlations that Campbell drew from the China Study are wrong. If eating vegan/vegetarian works for you, than keep at it, but please stop pointing to the China Study (the book) to justify yourself. It’s wrong. But don’t believe what someone else says, go check it out yourself.
[...] Destroying China (the Study that Is) [...]
[...] Alternative Medicine Expertise from Aspire Natural Health » Blog Archive » Destroying Ch… [...]
Denise conducted the analysis incorrectly and came to a heavily flawed conclusion. Denise is a great writer however this is about science. Here is the correct procedure from a cancer epidemiologist.
Hi Denise,
As promised, I’m posting my response to your email [yes, she emailed me] on your site. You asked that I provide some tips on where to start and how to proceed. BTW, you mentioned “epidemiology secrets” and I just want to say: no “secrets”!! Epidemiology is just critical thinking, but with numbers. It’s no different from many other disciplines. Maybe some time you can help me with writing (scientists are generally terrible writers, hehe).
Note: I’ve included some comments on what went wrong and how it can be corrected merely for demonstrative purposes – not at all malicious attacks, OK? This is how we all learn after all. In caps, I will highlight steps in the action plan for you.
STEP 0: Do a literature search. I find it helpful to keep an excel spreadsheet with columns for author, title, journal, year, summary of paper, strengths of the study, weaknesses, and concluding remarks. This is essential, as one shouldn’t just blindly go into an analysis without having at least some background information on the subject matter. No need to be an expert, but good to know what’s already out there, and what needs to be done.
1. Correlations:
For this discussion, the outcome will be colorectal cancer, since you used it on your post. Similarly, the primary exposure of interest will be total cholesterol. By by basing your conclusions on uncorrected correlations alone, you’ve made a huge leap that doesn’t have much ground to stand on. The simple correlations are biased, as you yourself pointed out when evaluating total cholesterol, schistomiasis, and colorectal cancer. As such, if you don’t adjust for potential confounders via multiple regression, the association you observe is biased. We almost always need to adjust for confounders, and this is very true in your case.
STEP 1: It’s a good habit to evaluate the correlations between all exposures and also between all exposures and the outcome at the individual level. So, for *every* analysis you plan on doing, run create scatterplots for every X against X and every X against Y, using the *individual* data (where possible), and provide the correlation + 95% confidence interval for each.
STEP 2: Create histograms for every exposure of that is categoric and density plots (or you can create histograms with very narrow bars) for every exposure that is continuous. This will tell you how the variables are distributed and what the appropriate summary statistics for them would be. For example, if total cholesterol is not normally distributed (follow a bell curve) then *median* total cholesterol might be a better summary statistic then *mean* total cholesterol (good to know when you present descriptive statistics of the data you’re using). Sometimes it’s useful to present different stats for a single variable.
2. Individual data vs. aggregated data:
You stated you didn’t see much curvature, but keep in mind that you were presenting with aggregated data (eg. average total cholesterol for all individuals) instead of including individual-level data (the exposure and outcome for a single individual). Consequently, there was a big loss in information, and you can’t make accurate decisions on how to model your data if you plot aggregated data. Related to this, your analysis was ecologic (used aggregated/grouped data) but you made individual-level conclusions when you used the term “risk factor.” This is referred to as an ecologic fallacy – and it’s just that. A fallacy. For example, all we can say based on your cholesterol-colorectal cancer example (the one that doesn’t account for schistomiasis) is that the counties with higher mean total cholesterol tend to have higher incidence rates of colorectal cancer. We can’t make the leap to calling cholesterol a *risk factor* for colorectal cancer.
STEP 3: Don’t aggregate your data in your analysis. Why? You lose A LOT of information when you aggregate data and you can bias your results. So keep that data at the individual-level. For descriptive tables, by all means, aggregated data is necessary for obvious reasons. But in your analysis, individual-level data when you’ve got it is essential.
3. The right regression model:
One of your outcomes was incidence rates of colorectal cancer. When you do your analysis with individual-level data, with incidence rates of colorectal cancer as your outcome, linear regression = WRONG model. Make sure you know which models to use and when. To start – when modeling “raw” rates (case counts and person time), we almost always use Poisson regression, and often we need to account for overdispersion as well. Get to know some of the other common regression models as well.
STEP 4: Write out all of the primary exposures of interest you want to investigate and the corresponding outcome of interest and how you’re setting up your outcome variable (are you interested in colorectal cancer *incidence rates*, *prevalence*, a simple yes/no the person has colorectal cancer?)
STEP 5: Write out what the appropriate regression model would be for the different analyses you plan to conduct.
4. Confounders:
These are factors that are related to the exposure and the outcome of interest such that *not* adjusting for them will produce a biased association between exposure and outcome. As you saw, schistomiasis might be a confounder. And in fact, county might be too – and is actually upstream of schistomiasis in some sense, right? Two confounders that almost *always* must be included in a model are AGE and SEX (provided your analysis isn’t restricted to one sex). This is especially true for chronic disease (eg. cardiovascular disease and cancer). In this particular case, body mass index (BMI) would be very important to include as well. County may also be important.
STEP 6: For every analysis you do, write out all potential confounders you can think of and why. You know the data better than I do as you’ve worked with it extensively. And, from STEP 0, you’ll know your context.
STEP 7: Write out *how* the confounders are related to the exposure and outcome. Is the confounder protective (i.e. decrease risk) for the outcome? Or is it a risk factor? How is it associated with the primary exposure of interest? This is where those scatterplots in STEP 1 come in handy! The purpose of this is to give you an idea of *how* an observed association might be biased if you *don’t* adjust for certain confounders. It is tedious, but thorough and, like STEP 6, will allow you to approach your analyses with more contextual background.
5. “Cleaning” and “recoding” your data:
Raw data is not *in and of itself* a bad thing. It is simply the data in its original form. But in order to be useful for analysis we often need to “clean” it and “recode” it. When I say “clean” it, I mean setting up the *dataset* that is free (to the greatest extent possible) of unnecessary data (for example, if you’re interested in ovarian cancer, you wouldn’t include men), or mistakes (for example, if an individual in the data was coded as being a man with ovarian cancer, this is clearly wrong). In this case, you might either omit it since you don’t have a way to check which is correct or, based on other data for that individual choose to change “man” to “woman” or “ovarian cancer” to “no ovarian cancer.” “Recoding” means setting up the *variables* to be useful. For example, we might recode BMI in categories of underweight, normal, overweight, and obese rather than leave it as continuous. Some variables may already be categoric, if the corresponding data were collected that way.
STEP 8: Clean your data. You will likely need to set up multiple datasets.
STEP 9: Write out *how* you’ve cleaned your data. (This is good record keeping.)
STEP 10: Recode your data. This might include combining variables too.
STEP 11: Create a “data dictionary” similar to the one on the Oxford site. But in addition, include a description of how you’ve coded your data (eg. 1=underweight, 2=normal, 3=overweight, 4=obese). Again, good for record keeping, but also “keeps you honest” so others know how you set up your data. This will often be apparent when you present your results, but not always. It’s a good habit to keep track of this, in any event.
STEP 12: Replot all newly *categorized* variables against the outcome(s) of interest. Why? Because the categorized data may reveal non-linear relationships with the outcome (in fact, this is a strength of categorizing data – that we can account for some non-linear relationships). For example, underweight might be a risk for something, whereas normal BMI is protective, while overweight and obese are a risk (“U-shaped”).
6. Exploration of your data through descriptive statistics:
Almost all scientific papers start out with a “Table 1″ which presents a description of the data. It tells us things like What’s the % of women and men in our data, What is the proportion of people with and without the exposure and with and without the outcome?
STEP 13: Create descriptive tables of all relevant variables. This includes your primary exposure of interest, confounders, and outcome. Obviously, you will have different tables for each analysis as you’re interested in different primary exposures (cholesterol? meat? total caloric intake?) and outcomes (cardiovascular disease? colorectal cancer? bladder cancer?). To save time, you might include all relevant exposures and confounders in rows, and cross-classify them with all outcomes of interest in columns.
6. Analysis:
The fun part.
STEP 14: Run your models. Keep track of what you include in your models b/c oftentimes we will evaluate several models for each analysis depending on what’s called “fit statistics.” Since you are familiar with p-values and I assume interpretation of beta coefficients, use these to help inform you of which variables to include in your final model *within the context of the analysis at hand* (this is key – if you have reason to believe that a confounder is important to include, keep it in the model even if it’s non-significant).
STEP 15: Create tables for results from *all* analyses (including the models you decide to can in favor for another one) and what regression model was used. This is much more transparent than simply producing your final model.
There’s more “post-analysis” stuff that should be done, but really Steps 1-15 is a pretty thorough.
7. Publish:
I can’t stress this enough. This is a long-term goal for sure, especially as you will likely end up with multiple papers! But once you think you’ve got the data set-up and analyses down, you need to write it up and send it on for peer-review. Peer-review is not perfect for sure, but it is the best measure we have for good science. It gives credibility to your efforts. Besides, you *do* want to be acknowledged for your efforts, right? By publishing in a peer-reviewed journal, you’re more likely to gain more widely publicized attention, which I think should be the goal of most epidemiological studies; we want to improve public health through informing not only our peers, but also the public.
As a last note, I know this is a huge undertaking, but these are steps to a thorough analysis. I have no doubt you’re capable of tackling it.
Best wishes.
PS. I’m sure you already planned to do this, but make all of the above available. With your large readership you can make this a collaborative effort.
Hi Freelee,
I am the first to admit that I am not a statistician, and I’m sure there are things that could be done to improve Denise’s critique which she freely admits. The point, in my mind, is not that her critique is “right” but that it shows enough holes in Campbell’s interpretation of the data to render it invalid.
Thank you for your comments.
Best,
Dr. Tim Gerstmar
[...] Dr. Gerstmar's Thoughts on Health, Happiness, and Well-Being from Aspire Natural Health »… [...]