Category Archives: Data

Powerful Accessories to Excel – Introduction

Dear Readers,

I wanted to start a quick tutorial targeted at Excel users and highlighting cool technologies that are relatively easy to learn and can provide a lot of benefit.  Basically, they are technologies that provide a lot of bang for their “effort” bucks.  The technologies I want to cover in order of ease to learn are:

Bash/Batch – Bash and Batch are shells for Linux/Mac and Windows respectively.  Shells interact with your operating system to allow you to do things like read files, manipulate data, create remote login sessions and schedule tasks on your computer.  One of the great things about bash and batch is that if you study it long enough, you can start using “cloud” computers, which allow computers to do work, while you are sleeping.

SQL/RDBMS – Relational databases allow you to store data compactly on your computer and provide a convenient interface for doing data manipulation.  One great reason you should pick up this tool is that it allows you to specify relationships between data: Student – Class -Teacher – School being a good example and then make statements like: I want only class taught by John Naughton that are considered math classes and only students whose first name starts with A.  One great thing about databases is scale.  You can work with 100s of millions of rows. Since SQL is a standard for data access, there are SQL-esque access to big data tools like Hadoop and Spark, which is software that allows you to take dozens of computer and have them process things in parallel.

Python – Python is a general purpose programming language that provides countless libraries for different purposes.  Do you want to develop a website or automate networking?  Python can do this.  Do you want to scrape a website, travel through a social network or do analysis on wikipedia?  Python is a great tool for it.  Do you want to be on the cutting edge of machine learning and AI by having direct interface to more advanced libraries?  Python…  Python can also be used for data manipulation and processing, which I will focus on.

I focus on the above three based on my knowledge.  Other noteworthy technologies include R, which I hope to cover in a future article.  During these tutorials, I might also mention other cool concepts or technologies.

If you are interested in this lecture series, I will put it under tutorial categories.




Data as a Fitness Tool

Recently, I’ve been focusing on improving my life by becoming more fit and eating healthy.  I’ve read somewhere that those who record what they eat and dedicate at least 4-5 hours of time in the gym tend to lose more weight and maintain weight for longer periods of time.  I think it’s more about changing your attitude to it and realizing that being healthy is a life style.

Since I am a data nerd of sorts, I’ve come to the conclusion that it would be fun to approach this as a data management problem.  The first thing I did was get an app called myFitnessPal, which you can download on both Android and IPhone (I have IPhone).  It then provides a relatively simple interface to log food as well as a database/search utility to look up food.  The app also has the ability to monitor exercise and water intake.  I decided to keep track of everything I consumed since mid-November and have kept it up for around 50 some odd days now.

To make this more interesting, I also produced a google spreadsheet containing projections (a technique I learned as a fraud analyst) and used this to project weight in the future.  The good news the amount of calories I’ve lost around 40,000 kcal translates to a loss that is significantly lower then the 17 lb that I’ve lost so far (11 lb projection).  The bad news is the projections I built are definitely off, typically by about 1-2 weeks.

The way I built the projections is to use BMR calculation to get a base burn rate (before exercise of any sort).  I then added food and exercise calculations to this.  I take an average (of a few days) that updates every day to get a general sense of net loss rates and then apply that to the future based on the last weight measurement.  It’s been off by one week and one pound, which isn’t too bad.

Of course, like any good business or goal, good quality data matters and so I have extra motivation to as my nutritionist states treat the matter like an accountant.  That motivation ends up translating into more accountability.

I act liberal with food measurements and conservative with exercise calories.  It’s better to be safe then sorry when considering margins of error.  One thing I try not to compromise on is getting to the gym or some physically intense event (like Salsa dancing) 3 times a week for about 1-1.5 hours.  I force myself to do the activity even if I end up just walking for the duration.

Right now, I’m hoping to keep up this habit and see the results in a few months.  I’ve already changed how I view exercise and am looking at potential programs or new ways to exercise (in a more social manner) to reduce things like fatigue (too much cardio in between two days).  It seems that by making exercise a habit I’m forced to deal with new problems, which require new solutions and experiences.  Definitely seeing benefits from doing this.



Machine Learning in Action: Part 1



I’m interested in learning more about computer programming.  Recently, I’ve picked up a book on algorithms and data structures as well as looked into Greenplum and Postgres.  I wanted to have a slight change of focus this weekend and picked up Machine Learning in Action over the weekend.

The book has been great so far.  It’s written using Python and implements many common machine learning algorithms from scratch.  Currently, I’ve gone through 2 chapters, one on KNN and the other on ID3 trees.  The later was a bit more challenging then the first, requiring quiet a bit of recursion due to the tree structure involved with that methodology.  I like this book so far in that it does a lot of the implementations from scratch, which makes it easier to understand.  I still want to get deeper into Shannon entropy and that up to get a better understanding of it.

For those interested in the code behind the book, it can be found here:

Hopefully, I will get to try out the next few chapters.  Chapter 4 covers Bayesian methodology.