Multicenter determination of optimal interobserver agreement using the Fuhrman grading system for renal cell carcinoma
Assessment of 241 patients with > 15-year follow-up
Abstract
BACKGROUND
The Fuhrman system is the most widely used nuclear grading system for renal cell carcinoma (RCC). Although Fuhrman nuclear grade is widely accepted as a significant prognostic factor, its reproducibility, as reported in the limited number of series available in the literature, appears to be low.
METHODS
Between 1980 and 1990, 255 cases of RCC (pT1–3bN0M0) were treated with radical nephrectomy at the Department of Urology, University Hospital, Strasbourg, France. In a retrospective multicenter study, 3 pathologists independently classified 241 of these 255 cases according to the Fuhrman grading system. The authors searched for optimal interobserver agreement by collapsing the grading system to a three-tiered scheme and then to a two-tiered scheme. In addition, overall survival curves were generated according to the classic four-tiered scheme and also according to the best collapsed scheme. The kappa index was used to assess the level of agreement between each pair of observers, and the Cox model was used for multivariate survival analyses.
RESULTS
The mean interobserver κ value was 0.22 (range, 0.09–0.36). The best concordance was obtained by collapsing to a system in which low-grade (Grade 1–2) disease was compared with high-grade (Grade 3–4) disease. Doing so improved the mean interobserver κ value to 0.44 (range, 0.32–0.55). Fuhrman grade was an independent prognostic factor for all 3 pathologists (P = 0.01, P < 0.0001, and P = 0.004, respectively), and nuclear grade continued to have independent prognostic value after the optimal collapsing algorithm was performed (P = 0.004, P = 0.0003, and P = 0.005).
CONCLUSIONS
Collapsing of the Fuhrman grading system to a two-tiered scheme led to an improvement in interobserver agreement while preserving the independent prognostic value of nuclear grade. Cancer 2005. © 2004 American Cancer Society.
Renal cell carcinoma (RCC) accounts for 90% of all primary malignant renal tumors found in adults. The precise identification of prognostic factors is an essential step in the evaluation of RCC, as approximately 40% of all affected patients die of disease progression and 30–40% of all surgically resectable RCCs recur during surveillance.1 With regard to histopathologic factors, many histologic tumor grading systems have been advanced to predict outcome for patients with RCC, and almost all have exhibited prognostic value. Nonetheless, none of these systems is optimal, and all suffer to some extent from problems with reproducibility and interobserver variability. According to the International Union Against Cancer and the American Joint Committee on Cancer, nuclear grade is the optimal factor for predicting outcome,2 and the most widely used nuclear grading scheme in the world is the Fuhrman system.1, 3
The aim of the current study was to assess interobserver agreement among three pathologists using the Fuhrman grading system and to subsequently identify the grade-collapsing scheme that improved interobserver concordance to the greatest extent.
MATERIALS AND METHODS
Between January 1980 and December 1990, we performed 255 radical nephrectomies for pT1–3bN0M0 RCC4 in a single urology department. All histologic sections were fixed under the same conditions (buffered formalin), embedded in paraffin, and stained with hematoxylin and eosin. Slides from 241 cases (1–10 slides per case; median, 4 slides) were reviewed independently and blindly by 3 pathologists (V.L., M.d.F., and V.M.). These pathologists worked in different cities, and each was trained at a different university. None of the three performed the original analysis, none was aware of the others' results, and none had any information regarding clinical characteristics or patient follow-up. Five cases were excluded because of disagreement regarding tumor characteristics; in each case, one pathologist considered the tumor to be an oncocytoma, rather than an RCC. In another nine cases, the supply of tumor material was exhausted due to the small size of the tumor.
Nuclear grade was determined using the criteria proposed by Fuhrman et al.3 Grade 1 tumors were composed of cells with small (∼10 μm), round, uniform nuclei and inconspicuous or absent nucleoli; Grade 2 tumor cells had larger (∼15 μm) nuclei with irregular outlines and nucleoli that were visible under high-power (400×) microscopy; Grade 3 tumor cells had even larger nuclei (∼20 μm) with obviously irregular outlines and prominent nucleoli even under low-power (100×) microscopy; and Grade 4 tumors exhibited features similar to those of Grade 3 tumors but also had bizarre, often multilobed nuclei and heavy chromatin clumps.
The characteristics of the study population are summarized in Table 1. All patients underwent follow-up examination every 3 months during the first year after nephrectomy, every 4 months during the second and third years, every 6 months during the fourth and fifth years, and then yearly until death. Clinical and biologic assays (hemoglobin concentration, plasma ionography, urea, creatinine, erythrocyte sedimentation, serum calcium, and alkaline phosphatase assays), along with radiologic examination (yearly thoracic and abdominal computed tomography scans and X-rays, with chest and abdominal ultrasonography during the interval), were performed at each consultation. The median follow-up duration was 15.3 years (range, 0–23 years), and 5 patients were lost to follow-up. In addition, 137 patients died during follow-up. Statistical analyses were performed using Statview (SAS Institute, Cary, NC) and SPSS (SPSS Inc., Chicago, IL) software. Agreement between pairs of pathologists in terms of Fuhrman grading were assessed using the κ index, which corrects for chance agreement. A κ value of 0 indicates a level of agreement that would be expected strictly on the basis of chance, whereas a value of 1.00 indicates perfect agreement; negative κ values indicate less agreement than would be expected on the basis of chance. With regard to positive κ values, the following interpretations are generally accepted: fair agreement, 0.00–0.20; moderate agreement, 0.21–0.45; substantial agreement, 0.46–0.75; near-perfect agreement, 0.76–0.99; and perfect agreement, 1.00.5 A chi-square test for independence (equivalent to a test for homogeneity) was used to verify homogeneity within the pathologists' marginal distributions of tumor grades.
Characteristic | No. of patients (%) |
---|---|
Gender | |
Male | 161 (66.8) |
Female | 80 (33.2) |
TNM (2002) status | |
pT1a | 60 (24.9) |
pT1b | 84 (34.9) |
pT2 | 22 (9.1) |
pT3a | 26 (10.8) |
pT3b | 49 (20.3) |
Histologic type | |
Conventional | 212 (88.0) |
Papillary type 1 | 8 (3.3) |
Papillary type 2 | 15 (6.2) |
Chromophobic | 6 (2.5) |
Microscopic venous invasion | |
Yes | 70 (29.0) |
No | 171 (71.0) |
Age | |
Mean (SD) | 61.0 (11.5) |
Range | 23–90 |
Size (cm) | |
Mean (SD) | 5.8 (2.6) |
Range | 1.3–14 |
- SD: standard deviation.
A κ value was calculated for each pair of pathologists, and the mean κ value was also recorded. To improve the level of agreement, we collapsed the original four-tiered grading system into a three-tiered scheme and then into a two-tiered scheme, with κ values (including means) being recalculated for each collapsed scheme.
Although the κ value increases as the number of grading categories is reduced, some information may be lost in the process. Therefore, using survival data, we verified the prognostic value of the original Fuhrman grading system as well as the value of the collapsed system that yielded the best κ value. Univariate survival analyses were performed using the Kaplan–Meier and log-rank tests, and multivariate survival analyses were performed using the Cox proportional hazards model with stepwise selection of variables. Death due to any cause was defined as the endpoint for these analyses, which were performed for each pathologist and also with stratification according to the following parameters: gender, age, tumor size, and pathologic T status (2002 TNM classification).5
RESULTS
The Fuhrman grade distribution (n = 241) for each pathologist is presented in Figure 1. The same grade was obtained by all 3 pathologists in 58 cases (24%). Using the original classification scheme, the κ indexes for agreement between Pathologists 1 and 2, Pathologists 1 and 3, and Pathologists 2 and 3 were 0.09, 0.21, and 0.36, respectively. The mean κ value was 0.22, which is indicative of low-to-moderate agreement.

Fuhrman grade distribution (n = 241) for each of the 3 pathologists in the current study.
The mean κ values for the various collapsed grading schemes are presented in Table 2. Ultimately, the highest mean κ value was yielded by a 2-tiered scheme in which low-grade (Grade 1–2) tumors were distinguished from high-grade (Grade 3–4) tumors. Using this scheme, the individual interobserver κ values were 0.32, 0.45, and 0.55 (mean, 0.44), and agreement among all 3 pathologists occurred in 142 cases (58.9%).
Three-tiered schemes | Two-tiered schemes | |||||
---|---|---|---|---|---|---|
Low-grade | Intermediate-grade | High-grade | Mean κ | Low-grade | High-grade | Mean κ |
1 | 2 | 3 + 4 | 0.29 | 1 | 2 + 3 + 4 | 0.31 |
1 | 2 + 3 | 4 | 0.26 | 1 + 2 | 3 + 4 | 0.44 |
1 + 2 | 3 | 4 | 0.34 | 1 + 2 + 3 | 4 | 0.32 |
Assessment of the Fuhrman grading system as an independent prognostic factor using the Cox model produced the following P values: 0.0128, < 10−4, and 0.0041 for Pathologists 1, 2, and 3, respectively. Using the optimal collapsed scheme (Grades 1–2 vs. Grades 3–4), the corresponding P values were found to be 0.0044, 0.0003, and 0.0051, respectively. Survival curves are shown in Figure 2.

Kaplan–Meier curves showing the association between histologic grade and survival for each pathologist. The top row corresponds to the original Fuhrman grading system, and the bottom row corresponds to the optimal two-tiered scheme identified in the current study.
DISCUSSION
Only pT1–3bN0M0 RCCs have been included in the current study.5 We have excluded from our analysis pT3c–4N0M0 lesions, which raise specific problems regarding surgical techniques and survival. In the current cohort, one of the main goals was to improve our ability to identify those who had a high risk of recurrence and who could therefore benefit from adjuvant therapy. Recently, Zisman et al.6 and Kattan et al.7 have suggested considering several clinical and histologic factors in combination to distinguish patients according to their probability of recurrence. Multicenter trials and comparison of data from different institutions both require some type of homogeneous stratification of patients, and whereas clinical and histologic parameters (e.g., symptoms, performance status, histologic pattern, tumor size, TNM status) are objective factors, the Fuhrman grading system, the most widely used system of its type, remains limited in this role by its more subjective nature.
Interobserver agreement in the current study was moderate (combined mean κ = 0.22, corresponding to a 24% concordance rate among the 3 pathologists) when the 4-grade Fuhrman system was used. This finding is consistent with data regarding agreement among groups of 4 pathologists as reported by Lanigan et al.8 and Al-Aytani et al.9 (mean κ = 0.33 and 0.29, respectively). These results differ significantly from those reported by Bretheau et al.,10 who found almost perfect agreement (concordance rate, 95%) between a pair of pathologists; however, the fact that both of these pathologists were located within the same district could partially explain the high level of agreement that was observed. In a previous, unpublished study, we also found a higher rate of concordance (61%) between 2 independent pathologists from within our department.
The moderate level of interobserver agreement in the current study can be explained by a number of factors. For example, RCC is a heterogeneous tumor that is usually composed of cells of different grades, rather than cells that all have the same grade; in other words, intimately admixed cells with varying degrees of atypia are the rule more so than they are an exception.11 The Fuhrman grading system uses the lowest- or highest-grade focus present to classify tumors,3 but the creators of the system did not define the minimum proportion of highest-grade area that warranted assignment of this elevated grade to the tumor. The Fuhrman grade depends on the outlines of cell nuclei, the presence or absence of nucleoli, and (in part) nuclear size. The use of nuclear size would be expected to introduce an element of objectivity into the assessment, but in practice, it appears that pathologists estimate size as subjectively as they do other criteria.8 This subjectivity also explains the moderate level of intraobserver agreement reported by Al-Aytani et al.9 when tumors were independently graded by 4 pathologists on 2 separate occasions, with a minimum of 3 months between duplicate readings. Another apparent finding from the current study was that, for certain subjective reasons, increasing experience led to the assignment of lower grades by the pathologist.
Similar Fuhrman grade distributions have been reported by Fuhrman et al.,3 Medeiros et al.,12 Bretheau et al.,10 and Ficarra et al.13 (Grade 1, 14%, 7%, 28%, and 25%, respectively; Grade 2, 50%, 34%, 31%, and 35%, respectively; Grade 3, 26%, 37%, 31%, and 33%, respectively; and Grade 4, 10%, 22%, 10%, and 7%, respectively). In the current study, relatively few extreme grades were assigned, whereas the two intermediate grades were the most commonly assigned for two of three pathologists. It would be disadvantageous to use highly selective and stringent grading criteria for both extreme groups. Doing so would create a lowest grade that was associated with a high likelihood of having a favorable outcome after surgery and a highest grade that was associated with a high likelihood of having a poor outcome, leaving the majority of patients with some type of intermediate classification. Thus, for patients with intermediate-grade disease, the predictive accuracy of the system with regard to survival or metastasis would suffer.11
Using kappa analysis, we identified the collapsed grading schemes that yielded optimal concordance among the three pathologists in the current study. With regard to 3-tiered schemes, the most satisfactory results were obtained by combining Grades 1 and 2 and leaving Grade 3 and Grade 4 uncombined (combined mean κ = 0.34); however, this scheme did not preserve the high discriminatory power (as in the original grading system) of the extreme grades.
In the current study, the best concordance was obtained by using a 2-tiered system in which low-grade (Grade 1–2) lesions were separated from high-grade (Grade 3–4) lesions (combined mean κ = 0.44, corresponding to 59% concordance among the 3 pathologists). Our findings agree with those of Al-Aytani et al.,9 who reported a mean κ value of 0.45. The modest improvement in κ values after the collapsing of nuclear grades can be attributed to the fact that intermediate-grade tumors (Grades 2–3) made up the bulk of the current series.
Unlike Al-Aytani et al.,9 our aim was to improve interobserver agreement using data that were not obtained from previous survival studies. Nonetheless, we obtained the same cutoff points as were found in a number of previous studies that evaluated the Fuhrman grading system as a means for predicting survival.10, 14-17 In contrast, Green et al.18 found a significant difference in survival between patients with Grade 1–3 RCC and patients with Grade 4 RCC. Among patients with intracapsular RCC, Di Silverio et al.19 found a significant difference in terms of disease progression between those who had Grade 1–2 disease and those who had Grade 3 disease; this cutoff point was the same as was found by Medeiros et al.,12 who investigated RCC cases in all stages. On univariate analysis, Minervini et al.20 found significant differences in disease-specific survival among patients with Grade 1 disease, patients with Grade 2 disease, and patients with Grade 3–4 disease, and similar findings were made by Tsui et al.21 using Kaplan–Meier survival data.
Our approach also appears to be more relevant due to our finding that for each of 3 pathologists, the original Fuhrman grading system appeared to possess significant independent prognostic value with regard to locally advanced or less severe RCC (multivariate analysis: P = 0.01, P < 0.0001, and P = 0.004, respectively). Our confirmation of the prognostic value of the Fuhrman grading system over an extended follow-up period is in agreement with the findings of certain previous studies.11, 13
We have used survival analyses to verify the reliability of the collapsed grading scheme that yielded optimal concordance. Such collapsing necessarily leads to a loss of information and also to a loss of discriminatory power in each group. Despite this collapsing, however, nuclear grade remained independently and significantly predictive of overall survival on multivariate analysis for each of the 3 pathologists in the current study (P = 0.004, P = 0.0003, and P = 0.005, respectively).
In conclusion, it appears that collapsing of the Fuhrman grading system into a low-grade (Grade 1–2) group and a high-grade (Grade 3–4) group improves interobserver agreement without a significant loss of information regarding survival.
Acknowledgements
The authors thank Dr. Marin Wagner for his helpful collaboration and the pathology personnel who provided technical assistance with the current study.