This is the demo to showcase some analysis process. For the analysis for each task, we have provided a corresponding class.
# import analysis tools
from analysis import SUMStat, D2TStat, WMTStat
def truncate_print(l, n=10):
""" Print the first n items of a list"""
for i, x in enumerate(l):
if i == n:
print('...')
break
print(x)
For all summarization datasets, including REALSumm, SummEval and Newsroom, the analysis tools are the same.
summ_stat = SUMStat('SUM/REALSumm/final_p.pkl') # The path to the scored file, _p means we have prompted metrics
See what metrics are out there.
Since there are a lot, including P, R, F variants for some metrics as well as prompted metrics, we only print a truncated version of metrics
print('[All metrics]')
truncate_print(summ_stat.metrics) # change to print if you want to see all metrics
print('[Automatic metrics]')
truncate_print(summ_stat.auto_metrics)
print('[Human metrics]')
truncate_print(summ_stat.human_metrics)
[All metrics] litepyramid_recall bert_score_p bert_score_r bert_score_f mover_score bart_score_src_hypo bart_score_hypo_ref bart_score_ref_hypo bart_score_avg_f bart_score_harm_f ... [Automatic metrics] bert_score_p bert_score_r bert_score_f mover_score bart_score_src_hypo bart_score_hypo_ref bart_score_ref_hypo bart_score_avg_f bart_score_harm_f bart_score_cnn_src_hypo ... [Human metrics] litepyramid_recall
We can choose some metrics that we are interested in to conduct analysis.
For example, in REALSumm, we use recall-based metrics (e.g. bert_score_r, rouge1_r, bart_score_cnn_hypo_ref, ...)
For others, we use F-based metrics (for metrics that only consider hypo and ref) and src->hypo (for generation based metrics like bart_score and prism)
valid_metrics = [
'rouge1_r',
'rouge2_r',
'rougel_r',
'bert_score_r',
'mover_score',
'prism_hypo_ref',
'bart_score_cnn_hypo_ref'
]
# The first argument is the human metric considered.
# The second argument is a list of considered automatic metrics, can omit it if all automatic metrics are considered
summ_stat.evaluate_summary('litepyramid_recall', valid_metrics)
Human metric: litepyramid_recall metric spearman kendalltau ----------------------- ---------- ------------ rouge1_r 0.497526 0.407974 rougel_r 0.488254 0.402523 bart_score_cnn_hypo_ref 0.474608 0.374497 bert_score_r 0.440398 0.346489 rouge2_r 0.4233 0.353119 prism_hypo_ref 0.411005 0.323994 mover_score 0.372353 0.290156
We can also see the performance of some prompt-based metrics.
valid_metrics = [
'bart_score_cnn_hypo_ref_de_id est',
'bart_score_cnn_hypo_ref_de_Videlicet',
'bart_score_cnn_hypo_ref_de_To give an instance',
'bart_score_cnn_hypo_ref_de_To give an example',
'bart_score_cnn_hypo_ref_de_As an illustration'
]
summ_stat.evaluate_summary('litepyramid_recall', valid_metrics)
Human metric: litepyramid_recall metric spearman kendalltau ---------------------------------------------- ---------- ------------ bart_score_cnn_hypo_ref_de_id est 0.49539 0.392728 bart_score_cnn_hypo_ref_de_Videlicet 0.491011 0.388237 bart_score_cnn_hypo_ref_de_To give an instance 0.49081 0.387054 bart_score_cnn_hypo_ref_de_To give an example 0.489033 0.38625 bart_score_cnn_hypo_ref_de_As an illustration 0.488977 0.385511
To combine prompt-based metrics, run the following
summ_stat.combine_prompt()
summ_stat.evaluate_summary('litepyramid_recall', ['bart_score_cnn_hypo_ref_de'])
Human metric: litepyramid_recall metric spearman kendalltau -------------------------- ---------- ------------ bart_score_cnn_hypo_ref_de 0.48784 0.386398
To conduct bootstrapping significant test, we provide the sig_test_two ( ) and sig_test ( ) method.
# The first two arguments are metrics that should be compared, the third argument is the human metric.
m1 = 'bart_score_cnn_hypo_ref'
m2 = 'bert_score_r'
result = summ_stat.sig_test_two(m1, m2, 'litepyramid_recall')
if result == 1:
print(f'{m1} is significantly better than {m2}')
elif result == -1:
print(f'{m2} is significantly better than {m1}')
else:
print('cannot decide')
100%|██████████| 1000/1000 [01:28<00:00, 11.34it/s]bart_score_cnn_hypo_ref is significantly better than bert_score_r
# The first arguments are a list of metrics considered
# The second argument is the human metric
summ_stat.sig_test(['rouge1_r', 'bart_score_cnn_hypo_ref', 'bert_score_r'], 'litepyramid_recall')
100%|██████████| 1000/1000 [01:28<00:00, 11.32it/s] 100%|██████████| 1000/1000 [01:24<00:00, 11.81it/s] 100%|██████████| 1000/1000 [01:26<00:00, 11.55it/s]Best metrics are: ['rouge1_r']
We use Rank19 dataset and QAGS_CNN dataset to showcase some basic usages. The former uses accuracy as its evaluation metric while the latter uses pearson correlation.
We first print out the factuality accuracy obtained using different metrics for the Rank19 dataset.
fact_stat = SUMStat('SUM/Rank19/final_p.pkl')
fact_stat.combine_prompt()
# Set valid metrics
valid_metrics = [
'rouge1_f',
'rouge2_f',
'rougel_f',
'bert_score_f',
'mover_score',
'prism_src_hypo',
'bart_score_cnn_src_hypo',
'bart_score_cnn_src_hypo_de'
]
# Print accuracy, take a list of metrics
fact_stat.get_fact_acc(valid_metrics)
metric acc -------------------------- -------- bart_score_cnn_src_hypo 0.836461 bart_score_cnn_src_hypo_de 0.796247 prism_src_hypo 0.780161 bert_score_f 0.713137 mover_score 0.713137 rouge2_f 0.630027 rougel_f 0.587131 rouge1_f 0.568365
Below are some methods that help to facilitate the siginificant test.
m1 = 'bart_score_cnn_src_hypo'
m2 = 'bert_score_f'
result = fact_stat.fact_acc_sig_test_two(m1, m2)
if result == 1:
print(f'{m1} is significantly better than {m2}')
elif result == -1:
print(f'{m2} is significantly better than {m1}')
else:
print('cannot decide')
100%|██████████| 1000/1000 [00:01<00:00, 744.17it/s]bart_score_cnn_src_hypo is significantly better than bert_score_f
# Take a list of metrics, print the best metrics
fact_stat.fact_acc_sig_test(['bart_score_cnn_src_hypo', 'prism_src_hypo', 'bert_score_f'])
100%|██████████| 1000/1000 [00:00<00:00, 2082.68it/s] 100%|██████████| 1000/1000 [00:01<00:00, 666.78it/s] 100%|██████████| 1000/1000 [00:01<00:00, 614.94it/s]Best metrics are: ['bart_score_cnn_src_hypo']
fact_stat = SUMStat('SUM/QAGS_CNN/final_p.pkl')
fact_stat.combine_prompt()
# Set valid metrics
valid_metrics = [
'rouge1_f',
'rouge2_f',
'rougel_f',
'bert_score_f',
'mover_score',
'prism_src_hypo',
'bart_score_cnn_src_hypo',
'bart_score_cnn_src_hypo_de'
]
# Print accuracy, take a list of metrics
fact_stat.get_fact_pearson(valid_metrics)
metric pearson -------------------------- --------- bart_score_cnn_src_hypo 0.734672 bart_score_cnn_src_hypo_de 0.718525 bert_score_f 0.575994 prism_src_hypo 0.478689 rouge2_f 0.459141 mover_score 0.41414 rougel_f 0.356889 rouge1_f 0.337667
m1 = 'bart_score_cnn_src_hypo'
m2 = 'bert_score_f'
result = fact_stat.fact_pearson_sig_test_two(m1, m2)
if result == 1:
print(f'{m1} is significantly better than {m2}')
elif result == -1:
print(f'{m2} is significantly better than {m1}')
else:
print('cannot decide')
100%|██████████| 1000/1000 [00:00<00:00, 1177.00it/s]bart_score_cnn_src_hypo is significantly better than bert_score_f
# Take a list of metrics, print the best metrics
fact_stat.fact_pearson_sig_test(['bart_score_cnn_src_hypo', 'prism_src_hypo', 'bert_score_f'])
100%|██████████| 1000/1000 [00:00<00:00, 1986.75it/s] 100%|██████████| 1000/1000 [00:00<00:00, 1156.13it/s] 100%|██████████| 1000/1000 [00:00<00:00, 1173.93it/s]Best metrics are: ['bart_score_cnn_src_hypo']
For all data-to-text datasets, including BAGEL, SFHOT and SFRES, the analysis tools are the same.
d2t_stat = D2TStat('D2T/BAGEL/final_p.pkl')
d2t_stat.combine_prompt() # combine the prompt-based resutls
See what metrics are out there. For data-to-text datasets, the human metrics are informativeness, naturalness and quality.
print('[All metrics]')
truncate_print(d2t_stat.metrics) # change to print if you want to see all metrics
print('[Automatic metrics]')
truncate_print(d2t_stat.auto_metrics)
print('[Human metrics]')
truncate_print(d2t_stat.human_metrics)
[All metrics] informativeness naturalness quality bert_score_p bert_score_r bert_score_f mover_score bart_score_ref_hypo bart_score_hypo_ref bart_score_avg_f ... [Automatic metrics] bert_score_p bert_score_r bert_score_f mover_score bart_score_ref_hypo bart_score_hypo_ref bart_score_avg_f bart_score_harm_f bart_score_cnn_ref_hypo bart_score_cnn_hypo_ref ... [Human metrics] informativeness naturalness quality
We can print out the correlation w.r.t. human judgement as below.
# Set valid metrics
valid_metrics = [
'rouge1_f',
'rouge2_f',
'rougel_f',
'bert_score_f',
'mover_score',
'prism_avg',
'bart_score_para_avg_f',
'bart_score_para_avg_f_de'
]
# The first argument is human metric while the latter is a list of metrics considered.
d2t_stat.evaluate_text('informativeness', valid_metrics)
Human metric: informativeness metric spearman kendalltau ------------------------ ---------- ------------ bart_score_para_avg_f_de 0.335997 0.248525 bart_score_para_avg_f 0.329896 0.246686 prism_avg 0.304946 0.224797 bert_score_f 0.289118 0.217179 mover_score 0.283694 0.20884 rouge1_f 0.234338 0.177972 rouge2_f 0.198585 0.151011 rougel_f 0.188592 0.145508
To perform significant test, use sig_test_two ( ) method
m1 = 'bart_score_para_avg_f'
m2 = 'prism_avg'
# The first two arguments are metrics that should be compared, the third argument is the human metric.
result = d2t_stat.sig_test_two(m1, m2, 'informativeness')
if result == 1:
print(f'{m1} is significantly better than {m2}')
elif result == -1:
print(f'{m2} is significantly better than {m1}')
else:
print('cannot decide')
100%|██████████| 1000/1000 [01:19<00:00, 12.54it/s]bart_score_para_avg_f is significantly better than prism_avg
For all language pairs, the analysis tools are the same. We begin by looking at reference length statistics.
wmt_stat = WMTStat('WMT/kk-en/final_p.pkl')
wmt_stat.print_ref_len()
Mean reference length: 17.75 Max reference length: 180 Min reference length: 1 20% percentile: 10.0 80% percentile: 25.0 90% percentile: 31.0
Next, we print out k-tau for all automatic metrics.
print('All metrics')
print(wmt_stat.metrics) # Print out all metrics
print('\n')
print('All k-tau')
wmt_stat.print_ktau()
print('\n')
print('k-tau for some metrics')
wmt_stat.print_ktau(['prism', 'bart_score_para'])
64%|██████▎ | 7/11 [00:00<00:00, 69.64it/s]All metrics ['bleu', 'chrf', 'bleurt', 'prism', 'comet', 'bert_score', 'bart_score', 'bart_score_cnn', 'bart_score_para', 'bart_score_para_en_Such as', 'bart_score_para_de_Such as'] All k-tau 100%|██████████| 11/11 [00:00<00:00, 64.20it/s] 100%|██████████| 2/2 [00:00<00:00, 67.73it/s]metric k-tau -------------------------- -------- bart_score_para_de_Such as 0.38014 bart_score_para 0.378495 comet 0.378289 bart_score_para_en_Such as 0.375822 bleurt 0.371505 prism 0.362048 bert_score 0.350535 bart_score_cnn 0.347656 bart_score 0.324424 chrf 0.322985 bleu 0.276316 k-tau for some metrics metric k-tau --------------- -------- bart_score_para 0.378495 prism 0.362048
To print out the k-tau over certain reference length, run the following.
print('All k-tau')
wmt_stat.print_len_ktau(15, 25)
print('\n')
print('k-tau for some metrics')
wmt_stat.print_len_ktau(15, 25, ['prism', 'bart_score_para'])
100%|██████████| 9728/9728 [00:00<00:00, 648147.63it/s] 100%|██████████| 11/11 [00:00<00:00, 179.12it/s] 100%|██████████| 9728/9728 [00:00<00:00, 625838.84it/s] 100%|██████████| 2/2 [00:00<00:00, 194.46it/s]All k-tau Considered samples: 3545 metric k-tau -------------------------- -------- comet 0.351763 bart_score_para 0.335966 bert_score 0.332581 bart_score_para_de_Such as 0.332017 bart_score_para_en_Such as 0.331453 prism 0.322426 bleurt 0.321862 chrf 0.318477 bart_score 0.305501 bleu 0.300987 bart_score_cnn 0.300987 k-tau for some metrics Considered samples: 3545 metric k-tau --------------- -------- bart_score_para 0.335966 prism 0.322426
To perform significant test, use sig_test_two ()
m1 = 'bart_score_para'
m2 = 'bleurt'
# The first two arguments are metrics that should be compared, the third argument is the human metric.
result = wmt_stat.sig_test_two(m1, m2)
if result == 1:
print(f'{m1} is significantly better than {m2}')
elif result == -1:
print(f'{m2} is significantly better than {m1}')
else:
print('cannot decide')
100%|██████████| 1000/1000 [00:33<00:00, 29.77it/s]bart_score_para is significantly better than bleurt
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。