Notebook Description¶
Often we want to compare if two files with seizure annotations contain the same annotations. For example, if you look through a week of recordings and annotate the sezures, comparing a classifier’s predictions with your annotations will allow you to check the number of false positives and (more importantly) false negatives.
In [31]:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
For an example we will look at the difference between a raw predictions output and the same file after it has been manually checked and the false postives removed using the gui.
In [42]:
checked_preds = pd.read_csv('./example_data/checked_predictions.csv', index_col=0)
raw_preds = pd.read_csv('./example_data/raw_predictions.csv', index_col=0)
checked_preds = pd.read_csv('/media/jonathan/My Passport-ELE -EEG/2019/analysis pyecog/01b_baseline_annotation_only.csv',
index_col=0)
raw_preds = pd.read_csv('/media/jonathan/My Passport-ELE -EEG/2019/analysis pyecog/prediction from Tawfeeq Library/predictions_baselineconverted01.csv',
index_col=0)
#
path= r'nathan\My Passport-ELE -EEG\2019/analysis pyecog/prediction from Ta'
raw_preds = pd.read_csv(path, index_col=0)
In [60]:
x = 5
print(x)
x = 10
print(x)
x
5
10
Out[60]:
10
In [ ]:
# note to jonny 2019 :
The file output by the clf does not havethe [] areound the transmitter. Whereas
In [43]:
raw_preds.head()
Out[43]:
old_index | filename | start | end | duration | transmitter | real_start | real_end | |
---|---|---|---|---|---|---|---|---|
0 | 0 | M1566408624_2019-08-21-18-30-24_tids_[142].h5 | 125.0 | 285.0 | 160.0 | [142] | 2019-08-21 18:32:29 | 2019-08-21 18:35:09 |
1 | 1 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 1965.0 | 2025.0 | 60.0 | [119] | 2019-08-22 09:03:09 | 2019-08-22 09:04:09 |
2 | 2 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 2125.0 | 2195.0 | 70.0 | [119] | 2019-08-22 09:05:49 | 2019-08-22 09:06:59 |
3 | 3 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 2290.0 | 2305.0 | 15.0 | [119] | 2019-08-22 09:08:34 | 2019-08-22 09:08:49 |
4 | 4 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 2380.0 | 2425.0 | 45.0 | [119] | 2019-08-22 09:10:04 | 2019-08-22 09:10:49 |
In [44]:
raw_preds.shape
Out[44]:
(112, 8)
In [45]:
checked_preds.shape
Out[45]:
(77, 8)
In [46]:
checked_preds.head()
Out[46]:
old_index | filename | start | end | duration | transmitter | real_start | real_end | |
---|---|---|---|---|---|---|---|---|
2 | 414 | M1567629024_2019-09-04-21-30-24_tids_[28, 33, ... | 860.18 | 939.98 | 79.80 | [119] | 04/09/2019 21:44 | 04/09/2019 21:46 |
4 | 407 | M1567611024_2019-09-04-16-30-24_tids_[28, 33, ... | 2976.58 | 3031.56 | 54.98 | [119] | 04/09/2019 17:20 | 04/09/2019 17:20 |
7 | 380 | M1567524624_2019-09-03-16-30-24_tids_[28, 33, ... | 26.26 | 68.79 | 42.53 | [119] | 03/09/2019 16:30 | 03/09/2019 16:31 |
9 | 377 | M1567521024_2019-09-03-15-30-24_tids_[28, 33, ... | 3298.90 | 3347.56 | 48.66 | [119] | 03/09/2019 16:25 | 03/09/2019 16:26 |
10 | 376 | M1567521024_2019-09-03-15-30-24_tids_[28, 33, ... | 2996.26 | 3057.73 | 61.47 | [119] | 03/09/2019 16:20 | 03/09/2019 16:21 |
In [47]:
print('So we expect there to be', raw_preds.shape[0]-checked_preds.shape[0], 'false positives')
So we expect there to be 35 false positives
Code for comparing the dataframes¶
In [48]:
def add_mcode_tid_col(df):
'''Note this expects file start to be of format: M1513966209'''
df['mcode_tid'] = df.filename.str.slice(0,11)+'_'+df.transmitter.astype(str)
return df
def check_overlap(series1,series2):
''' pandas series should both have start and end columns
http://baodad.blogspot.co.uk/2014/06/date-range-overlap.html
'''
start_a, end_a = float(series1.start), float(series1.end)
start_b, end_b = float(series2.start), float(series2.end)
overlap_bool = (start_a <= end_b) and (end_a>=start_b)
return overlap_bool
def calculate_overlap(series1,series2):
''' pandas series should both have start and end attrs
http://baodad.blogspot.co.uk/2014/06/date-range-overlap.html
'''
a, b = float(series1.start), float(series1.end)
c, d = float(series2.start), float(series2.end)
overlap = min([b-a,b-c,d-c,d-a])
return overlap
def compare_dfs(prediction_df, annotation_df):
'''
Function to check how much of prediction_df is found in annotation_df
Returns two dataframes:
- preds_in_annotations_df: the predictions found within the annotations and the amount of overlap
(probably corresponding to true positives). Has an overlap column (seconds).
- preds_not_in_annotations_df: the predictions not found in the annotations
(probably corresponding to false positives)
Note:
To check for false negatives, or missed seizures, pass in the actual annotations
as the 'prediction_df' and the actual predictions as the 'annotation_df'.
Here we would hope for no 'false positives', so the second dataframe will contain the
missed seizures...
'''
# first add a 'mcode_tid' col: allows us to check for same hour and transmitter
prediction_df = add_mcode_tid_col(prediction_df)
annotation_df = add_mcode_tid_col(annotation_df)
# Create empty dataframes that we will add to below
preds_in_annotations_df = pd.DataFrame(columns = prediction_df.columns)
preds_not_in_annotations_df = pd.DataFrame(columns = prediction_df.columns)
# loop over the predictions
for _, prediction_row_series in prediction_df.iterrows():
overlap_bool = False # boolean for if the predicted seizure at all overlaps with an annotation
# first check if the hour&transmitter is in the annotations
if prediction_row_series.mcode_tid in annotation_df.mcode_tid.unique():
# next find all annotations with same hour and tid as the prediction row
# this will often just be one row, but if >1 seizures in a single hour will be more
revevant_annotations_df = annotation_df[annotation_df.mcode_tid.isin([prediction_row_series.mcode_tid])]
# finally check if the start and end columns overlap
t_overlap = 0 # store the overlap time between preds and seizures
for _, annotation_row_series in revevant_annotations_df.iterrows():
row_overlap = check_overlap(prediction_row_series,
annotation_row_series) # in the case that two seizures, want to add...
overlap_bool += row_overlap
if row_overlap: # is this robust to two seizures?
t_overlap += calculate_overlap(prediction_row_series,
annotation_row_series)
if overlap_bool>0:
prediction_row_series['overlap'] = t_overlap
preds_in_annotations_df = preds_in_annotations_df.append(prediction_row_series)
else:
preds_not_in_annotations_df = preds_not_in_annotations_df.append(prediction_row_series)
return preds_in_annotations_df, preds_not_in_annotations_df
true_positives, false_positives = compare_dfs(raw_preds,checked_preds)
In [49]:
true_positives.shape, false_positives.shape
Out[49]:
((67, 10), (45, 9))
In [50]:
true_positives.head()
Out[50]:
old_index | filename | start | end | duration | transmitter | real_start | real_end | mcode_tid | overlap | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1.0 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 1965.0 | 2025.0 | 60.0 | [119] | 2019-08-22 09:03:09 | 2019-08-22 09:04:09 | M1566459024_[119] | 58.43 |
2 | 2.0 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 2125.0 | 2195.0 | 70.0 | [119] | 2019-08-22 09:05:49 | 2019-08-22 09:06:59 | M1566459024_[119] | 39.32 |
3 | 3.0 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 2290.0 | 2305.0 | 15.0 | [119] | 2019-08-22 09:08:34 | 2019-08-22 09:08:49 | M1566459024_[119] | 15.00 |
4 | 4.0 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 2380.0 | 2425.0 | 45.0 | [119] | 2019-08-22 09:10:04 | 2019-08-22 09:10:49 | M1566459024_[119] | 43.08 |
5 | 5.0 | M1566459024_2019-08-22-08-30-24_tids_[119].h5 | 2585.0 | 2635.0 | 50.0 | [119] | 2019-08-22 09:13:29 | 2019-08-22 09:14:19 | M1566459024_[119] | 50.00 |
In [51]:
false_positives.head()
Out[51]:
old_index | filename | start | end | duration | transmitter | real_start | real_end | mcode_tid | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | M1566408624_2019-08-21-18-30-24_tids_[142].h5 | 125.0 | 285.0 | 160.0 | [142] | 2019-08-21 18:32:29 | 2019-08-21 18:35:09 | M1566408624_[142] |
21 | 21.0 | M1566765024_2019-08-25-21-30-24_tids_[119].h5 | 875.0 | 950.0 | 75.0 | [119] | 2019-08-25 21:44:59 | 2019-08-25 21:46:14 | M1566765024_[119] |
23 | 23.0 | M1566822624_2019-08-26-13-30-24_tids_[119].h5 | 2545.0 | 2625.0 | 80.0 | [119] | 2019-08-26 14:12:49 | 2019-08-26 14:14:09 | M1566822624_[119] |
24 | 24.0 | M1566822624_2019-08-26-13-30-24_tids_[119].h5 | 2715.0 | 2760.0 | 45.0 | [119] | 2019-08-26 14:15:39 | 2019-08-26 14:16:24 | M1566822624_[119] |
25 | 25.0 | M1566822624_2019-08-26-13-30-24_tids_[119].h5 | 2870.0 | 2910.0 | 40.0 | [119] | 2019-08-26 14:18:14 | 2019-08-26 14:18:54 | M1566822624_[119] |
Here save the false positives to check through using the gui¶
- they might not all be false positives!
In [11]:
savename = 'predictions_not_in_annotations.csv'
false_positives.to_csv(savename,header=False, index=False)
Here flip the order of dataframes:¶
Note:
To check for false negatives, or missed seizures, pass in the actual annotations
as the 'prediction_df' and the actual predictions as the 'annotation_df'.
Here we would hope for no 'false positives', so the second dataframe will contain the
missed seizures...
Pass in checked preds as the predctions. This is similar to the case where you are checking for missed seizures
In [52]:
true_positives, false_positives = compare_dfs(checked_preds,raw_preds)
In [53]:
true_positives.shape
Out[53]:
(69, 10)
In [56]:
false_positives.shape
savename = 'annotations_not_in_predictions.csv'
false_positives.to_csv(savename,header=True, index=False)
as expected no false positives. If there were these would be annoations that had been missed (if the predictions were over the same time period as the annotations)
In [57]:
false_positives.shape
Out[57]:
(8, 9)
In [ ]: