Demo

This is the demo to showcase some analysis process. For the analysis for each task, we have provided a corresponding class.

# import analysis tools
from analysis import SUMStat, D2TStat, WMTStat

def truncate_print(l, n=10):
    """ Print the first n items of a list"""
    for i, x in enumerate(l):
        if i == n:
            print('...')
            break
        print(x)

Summarization

For all summarization datasets, including REALSumm, SummEval and Newsroom, the analysis tools are the same.

summ_stat = SUMStat('SUM/REALSumm/final_p.pkl') # The path to the scored file, _p means we have prompted metrics

See what metrics are out there.
Since there are a lot, including P, R, F variants for some metrics as well as prompted metrics, we only print a truncated version of metrics

print('[All metrics]')
truncate_print(summ_stat.metrics) # change to print if you want to see all metrics
print('[Automatic metrics]')
truncate_print(summ_stat.auto_metrics)
print('[Human metrics]')
truncate_print(summ_stat.human_metrics)

[All metrics]
litepyramid_recall
bert_score_p
bert_score_r
bert_score_f
mover_score
bart_score_src_hypo
bart_score_hypo_ref
bart_score_ref_hypo
bart_score_avg_f
bart_score_harm_f
...
[Automatic metrics]
bert_score_p
bert_score_r
bert_score_f
mover_score
bart_score_src_hypo
bart_score_hypo_ref
bart_score_ref_hypo
bart_score_avg_f
bart_score_harm_f
bart_score_cnn_src_hypo
...
[Human metrics]
litepyramid_recall

We can choose some metrics that we are interested in to conduct analysis.
For example, in REALSumm, we use recall-based metrics (e.g. bert_score_r, rouge1_r, bart_score_cnn_hypo_ref, ...)
For others, we use F-based metrics (for metrics that only consider hypo and ref) and src->hypo (for generation based metrics like bart_score and prism)

valid_metrics = [
    'rouge1_r',
    'rouge2_r',
    'rougel_r',
    'bert_score_r',
    'mover_score',
    'prism_hypo_ref',
    'bart_score_cnn_hypo_ref'
]

# The first argument is the human metric considered.
# The second argument is a list of considered automatic metrics, can omit it if all automatic metrics are considered
summ_stat.evaluate_summary('litepyramid_recall', valid_metrics)

Human metric: litepyramid_recall
metric                     spearman    kendalltau
-----------------------  ----------  ------------
rouge1_r                   0.497526      0.407974
rougel_r                   0.488254      0.402523
bart_score_cnn_hypo_ref    0.474608      0.374497
bert_score_r               0.440398      0.346489
rouge2_r                   0.4233        0.353119
prism_hypo_ref             0.411005      0.323994
mover_score                0.372353      0.290156

We can also see the performance of some prompt-based metrics.

valid_metrics = [
    'bart_score_cnn_hypo_ref_de_id est',
    'bart_score_cnn_hypo_ref_de_Videlicet',
    'bart_score_cnn_hypo_ref_de_To give an instance',
    'bart_score_cnn_hypo_ref_de_To give an example',
    'bart_score_cnn_hypo_ref_de_As an illustration'
]
summ_stat.evaluate_summary('litepyramid_recall', valid_metrics)

Human metric: litepyramid_recall
metric                                            spearman    kendalltau
----------------------------------------------  ----------  ------------
bart_score_cnn_hypo_ref_de_id est                 0.49539       0.392728
bart_score_cnn_hypo_ref_de_Videlicet              0.491011      0.388237
bart_score_cnn_hypo_ref_de_To give an instance    0.49081       0.387054
bart_score_cnn_hypo_ref_de_To give an example     0.489033      0.38625
bart_score_cnn_hypo_ref_de_As an illustration     0.488977      0.385511

To combine prompt-based metrics, run the following

summ_stat.combine_prompt()
summ_stat.evaluate_summary('litepyramid_recall', ['bart_score_cnn_hypo_ref_de'])

Human metric: litepyramid_recall
metric                        spearman    kendalltau
--------------------------  ----------  ------------
bart_score_cnn_hypo_ref_de     0.48784      0.386398

To conduct bootstrapping significant test, we provide the sig_test_two ( ) and sig_test ( ) method.

# The first two arguments are metrics that should be compared, the third argument is the human metric.
m1 = 'bart_score_cnn_hypo_ref'
m2 = 'bert_score_r'
result = summ_stat.sig_test_two(m1, m2, 'litepyramid_recall')
if result == 1:
    print(f'{m1} is significantly better than {m2}')
elif result == -1:
    print(f'{m2} is significantly better than {m1}')
else:
    print('cannot decide')

100%|██████████| 1000/1000 [01:28<00:00, 11.34it/s]bart_score_cnn_hypo_ref is significantly better than bert_score_r

# The first arguments are a list of metrics considered
# The second argument is the human metric
summ_stat.sig_test(['rouge1_r', 'bart_score_cnn_hypo_ref', 'bert_score_r'], 'litepyramid_recall')

100%|██████████| 1000/1000 [01:28<00:00, 11.32it/s]
100%|██████████| 1000/1000 [01:24<00:00, 11.81it/s]
100%|██████████| 1000/1000 [01:26<00:00, 11.55it/s]Best metrics are: ['rouge1_r']

Factuality

We use Rank19 dataset and QAGS_CNN dataset to showcase some basic usages. The former uses accuracy as its evaluation metric while the latter uses pearson correlation.

Rank19

We first print out the factuality accuracy obtained using different metrics for the Rank19 dataset.

fact_stat = SUMStat('SUM/Rank19/final_p.pkl')
fact_stat.combine_prompt()

# Set valid metrics
valid_metrics = [
    'rouge1_f',
    'rouge2_f',
    'rougel_f',
    'bert_score_f',
    'mover_score',
    'prism_src_hypo',
    'bart_score_cnn_src_hypo',
    'bart_score_cnn_src_hypo_de'
]

# Print accuracy, take a list of metrics
fact_stat.get_fact_acc(valid_metrics)

metric                           acc
--------------------------  --------
bart_score_cnn_src_hypo     0.836461
bart_score_cnn_src_hypo_de  0.796247
prism_src_hypo              0.780161
bert_score_f                0.713137
mover_score                 0.713137
rouge2_f                    0.630027
rougel_f                    0.587131
rouge1_f                    0.568365

Below are some methods that help to facilitate the siginificant test.

m1 = 'bart_score_cnn_src_hypo'
m2 = 'bert_score_f'
result = fact_stat.fact_acc_sig_test_two(m1, m2)
if result == 1:
    print(f'{m1} is significantly better than {m2}')
elif result == -1:
    print(f'{m2} is significantly better than {m1}')
else:
    print('cannot decide')

100%|██████████| 1000/1000 [00:01<00:00, 744.17it/s]bart_score_cnn_src_hypo is significantly better than bert_score_f

# Take a list of metrics, print the best metrics
fact_stat.fact_acc_sig_test(['bart_score_cnn_src_hypo', 'prism_src_hypo', 'bert_score_f'])

100%|██████████| 1000/1000 [00:00<00:00, 2082.68it/s]
100%|██████████| 1000/1000 [00:01<00:00, 666.78it/s]
100%|██████████| 1000/1000 [00:01<00:00, 614.94it/s]Best metrics are: ['bart_score_cnn_src_hypo']

QAGS_CNN

fact_stat = SUMStat('SUM/QAGS_CNN/final_p.pkl')
fact_stat.combine_prompt()

# Set valid metrics
valid_metrics = [
    'rouge1_f',
    'rouge2_f',
    'rougel_f',
    'bert_score_f',
    'mover_score',
    'prism_src_hypo',
    'bart_score_cnn_src_hypo',
    'bart_score_cnn_src_hypo_de'
]

# Print accuracy, take a list of metrics
fact_stat.get_fact_pearson(valid_metrics)

metric                        pearson
--------------------------  ---------
bart_score_cnn_src_hypo      0.734672
bart_score_cnn_src_hypo_de   0.718525
bert_score_f                 0.575994
prism_src_hypo               0.478689
rouge2_f                     0.459141
mover_score                  0.41414
rougel_f                     0.356889
rouge1_f                     0.337667

m1 = 'bart_score_cnn_src_hypo'
m2 = 'bert_score_f'
result = fact_stat.fact_pearson_sig_test_two(m1, m2)
if result == 1:
    print(f'{m1} is significantly better than {m2}')
elif result == -1:
    print(f'{m2} is significantly better than {m1}')
else:
    print('cannot decide')

100%|██████████| 1000/1000 [00:00<00:00, 1177.00it/s]bart_score_cnn_src_hypo is significantly better than bert_score_f

# Take a list of metrics, print the best metrics
fact_stat.fact_pearson_sig_test(['bart_score_cnn_src_hypo', 'prism_src_hypo', 'bert_score_f'])

100%|██████████| 1000/1000 [00:00<00:00, 1986.75it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1156.13it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1173.93it/s]Best metrics are: ['bart_score_cnn_src_hypo']

Data-to-Text

For all data-to-text datasets, including BAGEL, SFHOT and SFRES, the analysis tools are the same.

d2t_stat = D2TStat('D2T/BAGEL/final_p.pkl')
d2t_stat.combine_prompt() # combine the prompt-based resutls

See what metrics are out there. For data-to-text datasets, the human metrics are informativeness, naturalness and quality.

print('[All metrics]')
truncate_print(d2t_stat.metrics) # change to print if you want to see all metrics
print('[Automatic metrics]')
truncate_print(d2t_stat.auto_metrics)
print('[Human metrics]')
truncate_print(d2t_stat.human_metrics)

[All metrics]
informativeness
naturalness
quality
bert_score_p
bert_score_r
bert_score_f
mover_score
bart_score_ref_hypo
bart_score_hypo_ref
bart_score_avg_f
...
[Automatic metrics]
bert_score_p
bert_score_r
bert_score_f
mover_score
bart_score_ref_hypo
bart_score_hypo_ref
bart_score_avg_f
bart_score_harm_f
bart_score_cnn_ref_hypo
bart_score_cnn_hypo_ref
...
[Human metrics]
informativeness
naturalness
quality

We can print out the correlation w.r.t. human judgement as below.

# Set valid metrics
valid_metrics = [
    'rouge1_f',
    'rouge2_f',
    'rougel_f',
    'bert_score_f',
    'mover_score',
    'prism_avg',
    'bart_score_para_avg_f',
    'bart_score_para_avg_f_de'
]

# The first argument is human metric while the latter is a list of metrics considered.
d2t_stat.evaluate_text('informativeness', valid_metrics)

Human metric: informativeness
metric                      spearman    kendalltau
------------------------  ----------  ------------
bart_score_para_avg_f_de    0.335997      0.248525
bart_score_para_avg_f       0.329896      0.246686
prism_avg                   0.304946      0.224797
bert_score_f                0.289118      0.217179
mover_score                 0.283694      0.20884
rouge1_f                    0.234338      0.177972
rouge2_f                    0.198585      0.151011
rougel_f                    0.188592      0.145508

To perform significant test, use sig_test_two ( ) method

m1 = 'bart_score_para_avg_f'
m2 = 'prism_avg'

# The first two arguments are metrics that should be compared, the third argument is the human metric.
result = d2t_stat.sig_test_two(m1, m2, 'informativeness')

if result == 1:
    print(f'{m1} is significantly better than {m2}')
elif result == -1:
    print(f'{m2} is significantly better than {m1}')
else:
    print('cannot decide')

100%|██████████| 1000/1000 [01:19<00:00, 12.54it/s]bart_score_para_avg_f is significantly better than prism_avg

Machine Translation

For all language pairs, the analysis tools are the same. We begin by looking at reference length statistics.

wmt_stat = WMTStat('WMT/kk-en/final_p.pkl')
wmt_stat.print_ref_len()

Mean reference length: 17.75
Max reference length: 180
Min reference length: 1
20% percentile: 10.0
80% percentile: 25.0
90% percentile: 31.0

Next, we print out k-tau for all automatic metrics.

print('All metrics')
print(wmt_stat.metrics) # Print out all metrics
print('\n')
print('All k-tau')
wmt_stat.print_ktau()
print('\n')
print('k-tau for some metrics')
wmt_stat.print_ktau(['prism', 'bart_score_para'])

 64%|██████▎   | 7/11 [00:00<00:00, 69.64it/s]All metrics
['bleu', 'chrf', 'bleurt', 'prism', 'comet', 'bert_score', 'bart_score', 'bart_score_cnn', 'bart_score_para', 'bart_score_para_en_Such as', 'bart_score_para_de_Such as']


All k-tau
100%|██████████| 11/11 [00:00<00:00, 64.20it/s]
100%|██████████| 2/2 [00:00<00:00, 67.73it/s]metric                         k-tau
--------------------------  --------
bart_score_para_de_Such as  0.38014
bart_score_para             0.378495
comet                       0.378289
bart_score_para_en_Such as  0.375822
bleurt                      0.371505
prism                       0.362048
bert_score                  0.350535
bart_score_cnn              0.347656
bart_score                  0.324424
chrf                        0.322985
bleu                        0.276316


k-tau for some metrics
metric              k-tau
---------------  --------
bart_score_para  0.378495
prism            0.362048

To print out the k-tau over certain reference length, run the following.

print('All k-tau')
wmt_stat.print_len_ktau(15, 25)
print('\n')
print('k-tau for some metrics')
wmt_stat.print_len_ktau(15, 25, ['prism', 'bart_score_para'])

100%|██████████| 9728/9728 [00:00<00:00, 648147.63it/s]
100%|██████████| 11/11 [00:00<00:00, 179.12it/s]
100%|██████████| 9728/9728 [00:00<00:00, 625838.84it/s]
100%|██████████| 2/2 [00:00<00:00, 194.46it/s]All k-tau
Considered samples: 3545
metric                         k-tau
--------------------------  --------
comet                       0.351763
bart_score_para             0.335966
bert_score                  0.332581
bart_score_para_de_Such as  0.332017
bart_score_para_en_Such as  0.331453
prism                       0.322426
bleurt                      0.321862
chrf                        0.318477
bart_score                  0.305501
bleu                        0.300987
bart_score_cnn              0.300987


k-tau for some metrics
Considered samples: 3545
metric              k-tau
---------------  --------
bart_score_para  0.335966
prism            0.322426

To perform significant test, use sig_test_two ()

m1 = 'bart_score_para'
m2 = 'bleurt'

# The first two arguments are metrics that should be compared, the third argument is the human metric.
result = wmt_stat.sig_test_two(m1, m2)

if result == 1:
    print(f'{m1} is significantly better than {m2}')
elif result == -1:
    print(f'{m2} is significantly better than {m1}')
else:
    print('cannot decide')

100%|██████████| 1000/1000 [00:33<00:00, 29.77it/s]bart_score_para is significantly better than bleurt

bobtuan/BARTScore

Demo

Summarization

Factuality

Rank19

QAGS_CNN

Data-to-Text

Machine Translation

About

Releases

Contributors

Language(Optional)

Activities

bobtuan/BARTScore .gitee-modal { width: 500px !important; }

Demo

Summarization

Factuality

Rank19

QAGS_CNN

Data-to-Text

Machine Translation

About

Releases

Contributors

Language(Optional)

Activities

Search

bobtuan/BARTScore