Making sense of Endomondo's calorie estimation [see within blog graph]

The other day I got curious how Endomondo estimates energy expenditure during the exercise.

On their website, they mention some paywalled paper, but no specifics, so I figured it'd be interesting to reverse engineer that myself. I've extracted endomondo data from their JSON export and plotted a regression.

I'm using Wahoo TickrX chest strap monitor, so the HR data coming from it is pretty decent.

First, I'm importing the dataframe from the python package I'm using to interact with my data. I write about it here.

All the data is provided by this package, but otherwise it's just a Pandas dataframe, so hopefully, that wouldn't confuse you.

In [1]:

from my.workouts.dataframes import endomondo
df = endomondo()

WARNING:workout-provider:Unhandled: Cycling
WARNING:workout-provider:Unhandled: Cycling
WARNING:workout-provider:Unhandled: Snowboarding

Some sample data:

In [2]:

display(df[df['dt'].apply(lambda dt: str(dt.date())) == '2019-04-21'])

	dt	sport	heartbeats	kcal	error
384	2019-04-21 10:11:28+00:00	Rope jumping	3873.500000	310.0	None
385	2019-04-21 10:47:58+00:00	Running	2860.666667	248.0	None

Sport type is entered manually when you start recording exercise activity in Endomondo.

Heartbeats were calculated as average HR multiplied by the duration of the exercise.

Error column is a neat way of propagating exceptions from the data provider. E.g. I only have HR data for the last couple of years or so, so data provider doesn't have any of HR points from endomondo. While I could filter out these points in the data provider, they might still be useful for other plots and analysis pipelines (e.g. if I was actually only interested in kcals and didn't hare about heartbeats).

Instead, I'm just being defensive and propagating exceptions up through the dataframe, leaving it up to the user to handle them.

In [3]:

display(df[df['dt'].apply(lambda dt: str(dt.date())).isin(['2015-03-06', '2018-05-28'])])

	dt	sport	heartbeats	kcal	error
17	2015-03-06 05:50:38+00:00	Running	NaN	397.0	no hr
18	2015-03-06 13:20:06+00:00	Table tennis	NaN	127.0	no hr
297	2018-05-28 10:11:45+00:00	NaN	NaN	NaN	Unhandled activity: Cycling
298	2018-05-28 12:58:33+00:00	NaN	NaN	NaN	Unhandled activity: Cycling

So, first we filter out the entries with errors:

In [4]:

df = df[df['error'].isnull() & (df['sport'] != 'Other')]

As well as some random entries which would end up as outliers:

In [5]:

df = df.groupby(['sport']).filter(lambda grp: len(grp) >= 10)

hack to make seaborn plots deterministic (click to expand)

In [6]:

import seaborn as sns
if sns.algorithms.bootstrap.__module__ == 'seaborn.algorithms':
    # prevents nondeterminism in plots https://github.com/mwaskom/seaborn/issues/1924
    # we only want to do it once
    def bootstrap_hacked(*args, bootstrap_orig = sns.algorithms.bootstrap, **kwargs):
        kwargs['seed'] = 0
        return bootstrap_orig(*args, **kwargs)
    
    sns.algorithms.bootstrap = bootstrap_hacked

In [7]:

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
sns.set(font_scale=1.5)

sports = {
    g: len(f) for g, f in df.groupby(['sport'])
}

g = sns.lmplot(
    data=df,
    x='heartbeats',
    y='kcal',
    hue='sport', 
    hue_order=sports.keys(),
    legend_out=False,
    height=15,
    palette='colorblind',
    truncate=False, # kind of sets same span for the reglines
)
ax = g.ax
ax.set_title('Dependency of energy spent during exercise on number of heartbeats')

ax.set_xlim((0, None))
ax.set_xlabel('Total heartbeats, measured by chest strap HR monitor')

ax.set_ylim((0, None))
ax.set_ylabel('Kcal,\nEndomondo\nestimate', rotation=0, y=1.0)

# https://stackoverflow.com/a/55108651/706389
plt.legend(
    title='Sport',
    labels=[f'{s} ({cnt} points)' for s, cnt in sports.items()],
    loc='upper left',
)
pass

Unsurprisingly, it looks like a simple linear model (considering my weight and age have barely changed).

What I find unexpected is that the slope/regression coefficient (i.e. calories burnt per heartbeat) is more or less the same. Personally, for me running feels way more intense than any of other cardio I'm doing, definitely way harder than skiing! There are two possibilities here:

Endomondo can't capture dynamic muscle activity and isn't even trying to use exercise type provided by the user for a better estimate.
Energy is mostly burnt by the heart and other muscles don't actually matter or have a very minor impact.

Let's try and check the latter via some back of an envelope calculation.

In order to run, you use your chemical energy to move your body up and forward. For simplicity, let's only consider 'up' movements that go against gravity, it feels like these would dominate energy spendings. So let's model running as a sequence of vertical jumps. My estimate would be that when you run you jumps are about 5 cm in height.

We can find out how much energy each jump takes by using $\Delta U = m g \Delta h$ formula.

In [8]:

g = 9.82 # standard Earth gravity
weight = 65 # kg
stride_height = 5 / 100 # convert cm to m

strides_per_minute = 160 # ish, varies for different people
duration = 60 # minutes
joules_in_kcal = 4184 

energy_per_stride = weight * g * stride_height

leg_energy_kcal = energy_per_stride *  strides_per_minute * duration / joules_in_kcal
print(leg_energy_kcal)

73.22753346080309

So, 70 kcal is fairly low in comparison with typical numbers Endomondo reports for my exercise.

This is a very rough calculation of course:

In reality movements during running are more complex, so it could be an underestimate
On the other hand, feet can also spring, so not all energy spent on the stride is lost completely, so it could be an overestimate

With regards to the actual value of the regression coefficient: seaborn wouldn't let you display them on the regplot, so we use sklearn to do that for us:

In [9]:

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(df[['heartbeats']], df['kcal'])

[coef] = reg.coef_
free = reg.intercept_

print(f"Regression coefficient: {coef:.3f}")
print(f"Free term: {free:.3f}")

Regression coefficient: 0.095
Free term: -12.640

Basically, that means I get about 0.1 Kcal for each heartbeat during exercise. Free term ideally should be equal to 0 (i.e. just as a sanity sort of thing: not having heartbeat shouldn't result in calorie loss), and -10 is close enough.

Also, fun calculation: what if we fit the model we got to normal, resting heart rate?

In [10]:

normal_bpm = 60
minutes_in_day = 24 * 60

print(f'{coef * normal_bpm * minutes_in_day:.3f}')

8165.422

8K Kcals per day? A bit too much for an average person. I wouldn't draw any conclusions from that one though :)

You can find the source of this notebook here.

Discussion:

/r/dataisbeautiful