Angry by Choice: Reviewing grants for NIH vs NSF: a comparison

During my career, I have reviewed grant proposals for both the National Institutes of Health (NIH) and the National Science Foundation (NSF). The standard NIH research proposal is called the R01, generally giving 5 years of funding to a research lab; the standard NSF proposal generally gives 3 years of support to a research lab. By and large, an NIH award will provide more funding than a NSF award on a per year basis.

How the process works in general terms:
After an investigator(s) writes a proposal to either agency, the proposal is assigned to a study section or panel for review. The study section/panel is comprised of expert researchers in the general research area the proposal is about. Specific experts are recruited based on the specific proposals submitted such that there is at least one expert working in the area of each proposal. In general, reviewers receive a stack of 8-15 proposals to review. Reviewing takes a lot of time and energy with reviewers often referring to the literature to get up-to-date on specific topics. Ultimately, the reviewers all gather together in a room to discuss the proposals and make recommendations for which proposals get funded and which do not. At the NIH, reviewers have much more influence on which proposals obtain funding than NSF. In part, this is because NSF is legally bound to ensure funding is spread across the country and to different types of institutions, the NSF program officers have to superimpose the reviewer recommendations with these other criteria to make funding decisions. (To be clear, proposals NSF reviewers find to be fundamentally flawed are not funded simply to spread the wealth.)

A generic stack of grants to review

Once the initial reviews are written though, the process is fundamentally different between NIH and NSF. I believe the NSF model is profoundly better than the NIH model and I'll explain why using a specific rationale that I think is readily justified but also anecdotes, which I realize do not count as data and are therefore less reliable. (Full disclosure, I have reviewed grants for both institutions, have submitted proposals to both institutions, and have been funded by NIH but not NSF.)

What happens at NIH pre-meeting:
When you review a proposal you score it on a variety of criteria using a 1 - 10 point scale (1 being the best). You also give your proposal an overall score. For your stack of proposals, you are supposed to spread out your scores such that you don't give every proposal 1s across the board. A reviewer has to note both the strengths and weaknesses of the proposal for each of the criteria which is the basis of the review. Any given proposal is reviewed by 3 reviewers (sometimes more, but generally not). Once all the proposals are reviewed and scored, this information is sent to NIH and the information becomes available to the other reviewers. Thus, a reviewer cannot 'cheat' and see what the other reviewers think before writing their own critique.

For a given proposal titled 'XYZ' that is reviewed by reviewers Dr. 123, Dr. 456, and Dr. 789, a different proposal titled 'ABC' would be reviewed by Dr. 123, Dr. 045, and Dr. 232, and a third proposal titled 'JKL' is reviewed by Dr. 123, Dr. 045, and Dr. 789. The point here is that different groups of reviewers are reviewing different proposals. However, there is generally overlap of reviewers because they share similar expertise. Say the study section is on signal transduction in eukaryotic systems, there might be a group of 6 experts who work with mouse models, another 8 experts who work in fungal systems, and 6 more experts who work with Drosophila. So generally speaking, every proposal using mouse models (and likely other mammalian models) would be reviewed by 3 out of the 6 experts who work with mouse models. Say there are 15 proposals in the study section (out of 90) studying mammalian signal transduction, then you should see that these are being reviewed by a specific cohort of the entire study section. Same for the proposals using fungi as a model and proposals using invertebrates as a model.

What happens at NIH during the meeting:
At NIH, once all the proposals are scored, they are ranked with the lowest overall score (based on the 3 reviewers) being ranked first. Depending on the cohort your proposal falls into, this could work for or against you. Some reviewers score high (more likely to give 1s) than others. So one reviewer's 2.2 may be another reviewers 1.3, even if they are equally enthusiastic about their respective proposals. Based on the luck of the (reviewer) draw your proposal might be scored as the 10th best, but with a different draw that same proposal might be scored as the 1st (best) proposal for the entire study section. Here's the outcome of this situation, proposals are discussed in their rank order, so the lowest scoring (best) proposal is discussed first, second best discussed 2nd, third best discussed 3rd. Of the 90 or so proposals submitted only the top third is actually discussed by the entire group, the other two-thirds are 'triaged' (i.e. not discussed). Of those discussed only a handful are actually funded, 0-4. As the group discusses a proposal the 'best' proposal is described to the entire panel most of whom have not read the proposal at least not in any depth. Usually the first page (the Specific Aims) is read by everyone, but generally not much else of the proposal. After the brief presentation where the reviewers go over their strengths and weaknesses, any member of the panel can ask questions or comment. If a reviewer gave a proposal a 1.3 but stated nothing but weaknesses, the question would inevitably arise 'why did you score this so high?' Once the discussion is complete, the three reviewers give revised scores (generally they change little and if so move towards the mean). The entire panel then enters their own score for the proposal, which is generally the average of the three reviewers. Then the panel moves on to the next grant.

NIH (FYI panels take place at a hotel not here)

How can this go wrong?
First, there is the psychological issue that the panelists know they are discussing the proposals from 'best' to 'worst'. Even though a reviewer may love their proposal, which was ranked 10th, it is not discussed until after nine other proposals. These reviewers may lower their ultimate score to reflect this, but the entire panel knows it was 10th and the reviewers are changing their scores to make it not be 10th, rightly or wrongly.

Confounding this issue is the majority of grants are solid good proposals that should be funded. That me rephrase that, the top 20 proposals (or so) in a study section are solid excellent proposals. Hell, you could take the top 10-15 and (in general terms) the scientific/impact difference between the top 10-15 proposals is negligible, yet only the top few have any chance of funding. This means that once you reach a certain point, funding is really a luck issue and nothing more. In fact, there have been suggestions of putting the top proposals into a lottery to determine funding. This is not a new problem. When I was trying to obtain my first R01, I submitted my last attempt at funding for one project (you had three attempts). Based on previous critiques and scores, I was confident of funding. However, one of the main parts of an Aim had been completed and published during the time I spent on the first and second submission. It would be stupid to propose doing published stuff, so I changed that Aim to focus on the follow up studies based off of what we had published. The third submission was triaged (it was actually discussed because I was a new investigator, but was scored in the triaged range), and the biggest issue was that I had reworked an Aim and 'we have not had a chance to fix it.' (Quote from the actual reviewer.) My program officer recommended I resubmit with a new title as a new proposal, which of course I did. This 'first' submission was funded and received one of the lowest (best) possible scores. My point is how arbitrary the system can be.

Second, people suck. On a study section I have served on (ad hoc) numerous times, there are two distinct factions based on the type of organism each faction studies. Some members of one of these factions would read and critique the high ranking proposals of the other faction in order to present 'issues' and 'faults' during the discussion session. This is completely valid if everything is equal, but this was done with the goal of diminishing the high ranking proposals of the other faction in order to increase the standing of proposals from their faction. (In other words it was not done to critically evaluate the science across the board but in a strategic way to help their colleagues and their field.)

Third, (and most importantly in my opinion) the scoring is done blind. Apart from the three reviewers, the rest of the panel scores the proposal in secret (again based on my discussions with panel members it is usually the average of the three reviewers). Once a proposal is scored it is not brought up or discussed again.

What happens at NSF pre-meeting:
It's pretty much the same as described above for NIH, however there is not a 1-10 scale but a qualitative scale (Excellent, Very good, Good, Not competitive). There are still 3 reviewers, there are still strengths and weaknesses, there are still cohorts based on areas of expertise. NSF proposals are broken up into two sections the 'intellectual merit' basically the science being proposed, and the 'broader impacts' basically how does this benefit society. Each of these sections is a critical part of the review, have an excellent intellectual merit, but no real broader impacts and your proposal is not scored well.

What happens at NSF during the meeting:
At the surface level, its similar to NIH. However, the proposals are not pre-ranked/pre-scored. The order of discussion is based on reviewer availability as some ad hocs call in. It's also based on the leadership of NSF some of whom may be interested in a specific area and want to sit in to hear the discussion in an area they are familiar with. (At NSF the program officers are practicing scientists who have taken a multi-year leave from their research institution to serve at NSF, the upper leadership are usually 'permanent' staff at NSF).

NSF headquarters (FYI panels take place here!)

Proposals are discussed by the reviewers and then a general discussion takes place. This discussion is more robust than what I have observed on NIH study sections. Once the discussion is done, the panel, not the reviewers, suggests a category to put the proposal in (again either Excellent, Very good, Good, Not competitive). The Very good and Good categories are further broken up into two groups I and II to distinguish the very very good and the not so good goods.) After the panelists make a recommendation, the reviewers can agree/disagree and another mini-discussion can ensue. Regardless, the proposal under consideration is placed on the board. (This is an excel spreadsheet projected on the wall.) We then move on to the next proposal.

A key difference is that once a proposal is placed, it can be discussed further. This is particularly important when two similar proposals get profoundly different rankings. We can then discuss why. At NSF proposals can move around a lot. Furthermore, everyone at the table has to agree on the categories and position within the categories of every proposal (we still decide on the most excellent, the 2nd most excellent, the 3rd most excellent, etc.). I may not agree with the ultimate position of every proposal on the board, but as a group we are in agreement.

How can this go wrong?
First, psychology still exists. If the reviewers score a proposal Excellent or Not competitive, as a panelist you are influenced by this. I did not read every proposal (although often I and the other panelists will read particularly contentious proposals at night before we meet the second day). Regardless, those initial critiques carry weight even if we know it happens and try to avoid it.

Second, people still and always will suck (#Trump2016). My most recent NSF full proposal was not funded and one of the reviews referred to me as 'she' and 'her' whereas the other reviewers referred to me as 'the investigator' or 'Dr. XYZ' (this is standard boilerplate when talking about the researcher). I'm not saying the reviewer was biased against me because they thought I was a woman, but its possible. This was the only time in close to two decades, I've read a such a condescending review that attempted to explain to my feeble girl-brain what science is and how it's done by 'real'-scientists. (Full disclosure, I'm not a woman and it doesn't matter anyway.) After talking with my program officer, my proposal was the one on the fence between funding and no funding, which unfortunately fell on the side on no funding. And here is why I think the NSF system us better...

Why the NSF system is better:
Regarding my unfunded NSF proposal: It is my fault it wasn't funded. I could complain about sexism and bias, but if my proposal had been slightly stronger the other reviewers would have gone to bat for me more and the panel would have placed my proposal higher and I would have been funded. This is not the case at NIH, one slightly not enthusiastic review can tank your proposal. I expect when my NIH proposal was dinged for getting too much done and rewriting an Aim the panel hadn't yet corrected, there was some brief discussion of this being not a reasonable critique (if the reviewer didn't actually say this out loud, it wouldn't be discussed period) and then the scores were adjusted somewhat. The reviewers who supported my proposal increased their scores slightly to show some semblance of reviewer cohesiveness and the reviewer who was an idiot decreased their score somewhat to 'fix' the BS critique and the panel scored to the mean, which amounted to triage. If the proposal had to be placed on a board and put in context with other proposals, then I doubt it would have been triaged and expect it would have been funded based on the score of the subsequent proposal.

In conclusion, I like the NSF system more because it is more transparent, accountable, and self-correcting.

Some potential confounding factors:
Success rates: There is really no difference in success rates between NIH and NSF, they both suck (#Trump2016) and essentially a lottery system of top proposals seems appropriate (although NSF has additional criteria that impact who gets awarded).

Number of grants: My experience is that there is really no difference when it comes to the meetings. Some NSF programs have a preproposal (essentially their triage step) and then review 30 or so full proposals, which is about the NIH study section full review. I'll point out that every preproposal is reviewed too, there is no 'it wasn't good enough to discuss' category.

Probably others I cannot think of now.

Also, I know none of this is #Trump2016's fault, but it is my go to hashtag to express contempt at the shortsightedness of one party (Republicans).

Field of Science

Angry by Choice

Reviewing grants for NIH vs NSF: a comparison

1 comment:

Categories

Blog Archive