What data on myself I collect and why? [see within blog graph]
This is the list of personal data sources I use or planning to use with rough guides on how to get your hands on that data if you want it as well.
It's still incomplete and I'm going to update it regularly.
My goal is automating data collection to the maximum extent possible and making it work in the background, so one can set up pipelines once and hopefully never think about it again.
This is kind of a follow-up on my previous post on the sad state of personal data, and part of my personal way of getting around this sad state.
If you're terrified by the long list, you can jump straight into "Data consumers" section to find out how I use it.
Table of Contents
- 1. Why do you collect X? How do you use your data?
- 2. What do I collect/want to collect?
- Amazon
- Arbtt (desktop time tracker)
- Bitbucket (repositories)
- Bluemaestro (environment sensor)
- Blood
- Browser history (Firefox/Chrome)
- Emfit QS (sleep tracker)
- Endomondo
- Facebook Messenger
- Feedbin
- Feedly
- Fitbit
- Foursquare/Swarm
- Github (repositories)
- Github (events)
- Gmail
- Goodreads
- Google takeout
- TODOHackernews
- HSBC bank
- Hypothesis
- Instapaper
- Jawbone
- Kindle
- Kobo reader
- Last.fm
- Monzo bank
- Nomie
- Nutrition
- Photos
- PDF annotations
- Pinboard
- Plaintext notes
- Remember the Milk
- Rescuetime
- Shell history
- Sleep
- Sms/calls
- Spotify
- Stackexchange
- Taplog
- Telegram
- VK.com
- Weight
- TODOWhatsapp
- 23andme
- 3. Data consumers
- 4. --
¶1 Why do you collect X? How do you use your data?
All things considered, I think it's a fair question! Why bother with all this infrastructure and hoard the data if you never use it?
In the next section, I will elaborate on each specific data source, but to start with I'll list the rationales that all of them share:
¶backup
It may feel unnecessary, but shit happens. What if your device dies, account gets suspended for some reason or the company goes bust?
¶lifelogging
Most data in digital form got timestamps, so automatically, without manual effort, constitutes data for your timeline.
I want to remember more, be able to review my past and bring back and reflect on memories. Practicing lifelogging helps with that.
It feels very wrong that things can be forgotten and lost forever. It's understandable from the neuroscience point of view, i.e. the brain has limited capacity and it would be too distracting to remember everything all the time. That said, I want to have a choice whether to forget or remember events, and I'd like to be able to potentially access forgotten ones.
¶quantified self
Most collected digital data is somewhat quantitative and can be used to analyze your body or mind.
¶2 What do I collect/want to collect?
As I mentioned, most of the collected data serve as a means of backup/lifelogging/quantified self, so I won't mention them again in the 'Why' sections.
All my data collection pipelines are automatic unless mentioned otherwise.
Some scripts are still private so if you want to know more, let me know so I can prioritize sharing them.
¶Amazon
How: jbms/finance-dl
Why:
¶Arbtt (desktop time tracker)
¶Bitbucket (repositories)
How: samkuehn/bitbucket-backup
Why:
proved especially useful considering Atlassian is going to wipe mercurial repositories
I've got lots of private mercurial repositories with university homework and other early projects, and it's sad to think of people who will lose theirs during this wipe.
¶Bluemaestro (environment sensor)
How: sensor syncs with phone app via Bluetooth, /data/data/com.bluemaestro.tempo_utility/databases/ is regularly copied to grab the data.
Why:
- temperature during sleep data for the dashboard
lifelogging: capturing weather conditions information
E.g. I can potentially see temperature/humidity readings along with my photos from hiking or skiing.
¶Blood
How: via thriva, data imported manually into an org-mode table (not doing too frequently so wasn't worth automated scraping)
Also tracked glucose and ketones (with freestyle libre) for a few days out of curiosity, also didn't bother automating it.
Why:
- contributes to the dashboard, could be a good way of establishing your baselines
¶Browser history (Firefox/Chrome)
How: custom scripts, copying the underlying sqlite databases directly, running on my computers and phone.
Why:
¶Emfit QS (sleep tracker)
Emfit QS is kind of a medical grade sleep tracker. It's more expensive than wristband ones (e.g. fitbit, jawbone) but also more reliable and gives more data.
How: emfitexport.
Why:
- sleep data for the dashboard
¶Endomondo
How: Endomondo collects GPS data, and HR data (via Wahoo Tickr X strap). Then, karlicoss/endoexport.
Why:
- exercise data for the dashboard
How: manual archive export.
I barely use Facebook, so don't even bother doing it regularly.
¶Facebook Messenger
¶Feedbin
¶Feedly
¶Fitbit
How: manual CSV export, as I only used it for few weeks. Then the sync stopped working and I had to return it. However, it seems possible via API.
Why:
- activity data for the #dashboard
¶Foursquare/Swarm
How: via API
¶Github (repositories)
How: github-backup
Why:
- capable of exporting starred repositories as well, so if the authors delete them I will still have them
¶Github (events)
How: manually requested archive (once), after that automatic karlicoss/ghexport
Why:
- better browsing history
- better search in comments/open issues, etc.
¶Gmail
How: imap-backup, Google Takeout
Why:
- this is arguably the most important thing you should export considering how heavily everything relies on email
- better search
- better browsing history
¶Goodreads
¶Google takeout
How: semi-automatic.
- only manual step: enable scheduled exports (you can schedule 6 per year at a time), and choose to keep it on Google Drive in export settings
- mount your Google Drive (e.g. via google-drive-ocamlfuse)
- keep a script that checks mounted Google Drive for fresh takeout and moves it somewhere safe
Why:
- Google collects lots of data, which you could put to some good use. However, old data is getting wiped, so it's important to export Takeout regularly.
- better browsing history
- (potentially) search history for promnesia
- search in youtube watch history
- location data for lifelogging and the dashboard (activity)
¶TODOHackernews
How: haven't got to it yet. It's going to require:
- extracting upvotes/saved items via web scraping since Hackernews doesn't offer an API for that. Hopefully, there is an existing library for that.
I'm also using Materialistic app that has its own 'saved' posts and doesn't synchronize with Hackernews.
Exporting them is going to require copying the database directly from the app private storage.
Why: same reasons as Reddit.
¶HSBC bank
How: manual exports of monthly PDFs with transactions. They don't really offer API, so unless you want to web scrape and deal with 2FA, it seems it's the best you can do.
Why
- personal finance; used it with karlicoss/hsbc-parser to feed into hledger
¶Instapaper
Why:
- better search
- better browsing history, in particular implementing overlay with highlights
- quick todos via orger
¶Jawbone
How: via API. Jawbone is dead now, so if you haven't exported it already, likely your data is lost forever.
Why:
- sleep data for the dashboard
¶Kindle
How: manually exported MyClippings.txt from Kindle. Potentially can be automated similarly to Kobo.
Why:
¶Kobo reader
How: almost automatic via karlicoss/kobuddy. Manual step: having to connect your reader via USB now and then.
Why:
- better search
- spaced repetition for unfamiliar words/new concepts via orger
¶Last.fm
¶Monzo bank
¶Nomie
How: regular copies of /data/data/io.nomie.pro/files/_pouch_events and /data/data/io.nomie.pro/files/_pouch_trackers
Why:
- could be a great tool for detailed lifelogging if you're into it
¶Nutrition
I tracked almost all nutrition data for stuff I ingested over the course of a year.
How: I found most existing apps/projects clumsy and unsatisfactory, so I developed my own system. Not even a proper app, something simpler, basically a domain-specific language in Python to track it.
Tracking process was simply editing a python file and adding entries like:
# file: food_2017.py july_09 = F( [ # lunch spinach * bag, tuna_spring_water * can, # can size for this tuna is 120g beans_broad_wt * can * 0.5, # half can. can size for broad beans is 200g onion_red_tsc * gr(115) , # grams, explicit cheese_salad_tsc * 100, # grams, implicit as it makes sense for cheese lime, # 1 fruit, implicit ], [ # dinner... ], tea_black * 10, # cups, implicit wine_red * ml * 150, # ml, explicit ) july_10 = ... # more logs
Comments added for clarity of course, so it'd be more compact normally.
Then some code was used for processing, calculating, visualizing, etc.
Having a real programming language instead of an app let me make it very flexible and expressive, e.g.:
I could define composite dishes as Python objects, and then easily reuse them.
E.g. if I made four servings of soup on 10.08.2018, ate one immediately and froze other three I would define something like soup_20180810 = [...], and then I can simply reuse soup_20180810 when I eat it again. (date was easy to find out as I label food when put it in the freezer anyway)
- I could make many things implicit, making it pretty expressive without spending time on unnecessary typing
- I rarely had to in nutrient composition manually, I just pasted the product link to supermarket website and had an automatic script to parse nutrient information
- For micronutrients (that usually aren't listed on labels) I used the USDA sqlite database
The hard thing was actually not entering, but rather not having nutrition information if you're eating out. That year I was mostly cooking my own food, so tracking was fairly easy.
Also I was more interested in lower bounds, (e.g. "do I consume at least recommended amount of micronutrients"), so not having logged food now and then was fine for me.
Why:
I mostly wanted to learn about food composition and how it relates to my diet, and I did
That logging motivated me to learn about different foods and try them out while keeping dishes balanced. I cooked so many different things, made my diet way more varied and became less picky.
I stopped because cooking did take some time and I actually realized that as long as I actually vary food and try to eat everything now and then, I hit all recommended amounts of micronutrients, so I stopped. It's kind of an obvious thing that everyone recommends, but one thing is hearing it as a common wisdom and completely different is coming to the same conclusion from your data.
- nutritional information contributes to dashboard
¶Photos
How: no extra effort required if you sync/organize your photos and videos now and then.
Why:
- obvious source of lifelogging, in addition comes with GPS data
¶PDF annotations
As in, native PDF annotations.
How: nothing needs to be done, PDFs are local to your computer. You do need some tools to crawl your filesystem and extract the annotations.
Why:
experience of using your PDF annotations (e.g. searching) is extremely poor
I'm improving this by using orger.
¶Pinboard
¶Plaintext notes
Mostly this refers to org-mode files, which I use for notekeeping and logging.
How: nothing needs to be done, they are local.
Why:
- search comes for free, it's already local
- better browsing history
How: karlicoss/pockexport
Why:
- better search
- better browsing history, in particular implementing overlay with highlights
How: karlicoss/rexport
Why:
- better search
- better browsing history
- org-mode interface for processing saved Reddit posts/comments, via orger
¶Remember the Milk
How: ical export from the API.
Why:
-
I stopped using RTM in favor of org-mode, but I can still easily find my old task and notes, which allowed for a smooth transition.
¶Rescuetime
¶Shell history
How: many shells support keeping timestamps along your commands in history.
E.g. "Remember all your bash history forever".
Why:
- potentially can be useful for detailed lifelogging
¶Sleep
Apart from automatic collection of HR data, etc., I collect some extra stats like:
- whether I woke up on my own or after alarm
- whether I still feel sleepy shortly after waking up
- whether I had dreams (and I log dreams if I did)
- I log every time I feel sleepy throughout the day
How: org-mode, via org-capture into table. Alternatively, you could use a spreadsheet for that as well.
Why:
- I think it's important to find connections between subjective feelings and objective stats like amount of exercise, sleep HR, etc., so I'm trying to find correlations using my dashboard
- dreams are quite fun part of lifelogging
¶Sms/calls
How: SMS Backup & Restore app, automatic exports.
¶Spotify
How: export script, using plamere/spotipy
Why:
- potentially can be useful for better search in music listening history
- can be used for custom recommendation algorithms
¶Stackexchange
¶Taplog
(not using it anymore, in favor of org-mode)
How: regular copying of /data/data/com.waterbear.taglog/databases/Buttons Database
Why:
- a quick way of single tap logging (e.g. weight/sleep/exercise etc), contributes to the dashboard
¶Telegram
How: twitter archive (manually, once), after that regular automatic exports via API
Why:
¶VK.com
How: Totktonada/vk_messages_backup.
Sadly VK broke their API so the script stopped working. I'm barely using VK now anyway so not motivated enough to work around it.
Why:
¶Weight
¶TODOWhatsapp
Barely using it so haven't bothered yet.
How: Whatsapp doesn't offer API, so potentially going to require grabbing sqlite database from Android app (/data/data/com.whatsapp/databases/msgstore.db)
Why:
¶23andme
How: manual raw data export from 23andme website. I hope your genome doesn't change so often to bother with automatic exports!
Why:
was planning to setup some sort of automatic search of new genome insights against open source analysis tools
Haven't really had time to think about it yet, and it feels like a hard project out of my realm of competence.
¶3 Data consumers
¶Instant search
Typical search interfaces make me unhappy as they are siloed, slow, awkward to use and don't work offline. So I built my own ways around it! I write about it in detail here.
In essence, I'm mirroring most of my online data like chat logs, comments, etc., as plaintext. I can overview it in any text editor, and incrementally search over all of it in a single keypress.
¶orger
orger is a tool that helps you generate an org-mode representation of your data.
It lets you benefit from the existing tooling and infrastructure around org-mode, the most famous being Emacs.
I'm using it for:
- searching, overviewing and navigating the data
- creating tasks straight from the apps (e.g. Reddit/Telegram)
- spaced repetition via org-drill
Orger comes with some existing modules, but it should be easy to adapt your own data source if you need something else.
¶promnesia
promnesia is a browser extension I'm working on to escape silos by unifying annotations and browsing history from different data sources.
I've been using it for more than a year now and working on final touches to properly release it for other people.
¶dashboard
As a big fan of #quantified-self, I'm working on personal health, sleep and exercise dashboard, built from various data sources.
I'm working on making it public, you can see some screenshots here.
¶timeline
Timeline is a #lifelogging project I'm working on.
I want to see all my digital history, search in it, filter, easily jump at a specific point in time and see the context when it happened. That way it works as a sort of external memory.
Ideally, it would look similar to Andrew Louis's Memex, or might even reuse his interface if he open sources it. I highly recommend watching his talk for inspiration.