Regression Analysis and Bayesian Statistics

Just finished Regression Analysis at Georgia Tech and off to do some Bayesian Statistics.  I am now on the last 3 courses before I finish my degree.  Hopefully, I’ll be done by the end

of 2022 fingers crossed.  Currently, I’ve taken the following courses:

Courses Taken:

  1. Introduction to Analytic Modeling – high level theory of statistical modeling plus some R
  2. Introduction to Business for Analytics – survey of 4-5 business units and how they function
  3. Computing for Data Analytics – python plus numpy and pandas with some from scratch machine learning implementation
  4. Data Analytics in Business – high-level overview of regression, marketing, supply chain and investing using some lightweight stats.
  5. Data and Visual Analytics – One high-level project that results in a Machine Learning or BI end-to-end solution plus a series of mini tutorials covering: Python graph  from a movie API, Spark, SQL, Random Forest from scratch implementation and basic machine learning libraries in Python.
  6. Simulation – Lots of theory on simulation, statistics, probability, calculus and simulations plus 2 small projects where you implement some simulation oriented concept dubbed mini-projects.
  7. Regression – Everything is in the R language covers linear regression, general linearized models and dips into more advanced concepts.  Lectures derive everything from scratch, which makes this easily the best regression course I’ve taken.  I’ve taken a few that cover that topic.

Courses Left:

My last 3 courses will be:

 

  1. Bayesian Statistics  – This is supposed to cover similar topics to regression, but involve the BUGS program and using the Bayesian statistics paradigm.  Sounds fun, though by no means am I an expert on that topic.
  2. Computational Data Analytics – This is supposed to be Computing for Data Analysis like course, but significantly harder.  There are no outlines and clues.  They just give you an algorithm and you have to implement it from scratch in Python.  Sounds like my type of course.
  3. High Dimensional Data Analytics – This is
  4. supposed to be a significantly harder course than Computational Data Analytics, which is already one of the harder courses.  This looks into situations where you have little data compared to the number of features present.  Supposedly, this course involves reading a bunch of research papers.  Sounds fun.

Plus I need to complete a practicum, which seems to be some kind of internship.

Practicum – Internship of Sorts

I’m not 100% sure what the practicum is about.  They have a few companies that I guess pitch a project for a semester and students work on it.  The fabulous Laurent, someone I met on Slack that seems quiet bright, mentioned that most of the projects are BI related.  Some are Machine Learning related.  I’m thinking of aiming for a Machine Learning heavy project as coding is one of my favorite activities and that’s why I’m taking the program in any case.  I might team up with Laurent and possibly see if I can get Shahin in from Simulation and DVA class.

Thoughts so Far

I think overall the program is good.  The intro courses are good if you don’t have exposure to analytics.  If you do, then you’ll find the electives far more interesting.  They get relatively deep compared to the introduction courses.  There is a good chance that I will take a few extra courses afterwards.  I think: AI, Deep Learning and Reinforcement Learning looked interesting.

Overall, I like this program and think it’s been fun.  I’ve kept it  kind of on the light side.  Taking only 1-2 courses a semester.  It hasn’t been unreasonable, but I’ll admit I’ve had moments where I’ve been a bit burned out or stressed.  Last term being an example of that with a total of 8 exams during a 1.5 month period.

Afterwards, I feel like I’ll most likely take some computer science classes.  Self-learning is also important, but it’s nice not having to research and find good material.  Have something taught to you without having too many misconceptions or missing some big picture items.  Only con so far, not enough hands on experience.  I miss the whole creating prototypes thing.  That’s half the fun in tech.

Publication

I submitted a publication for review late June.  Still waiting to hear about it.  It involves image-to-image comparison algorithm and some fancy work I did with Flask and PWA libraries.  Met some great people during the project and might look at doing a second publication with them.  I’ve mostly been focusing on Computer Vision applications.  Took the Georgia Tech Computer Vision curriculum and found the books.  Probably will read that in preparation when I have some down time.  One nice thing about working on publication is you get to really play around with ideas and prototypes.  That’s been a ton of fun for me thus far.

Simulations and Poker

Simulations is Done

My Master’s in Analytics is 50% complete!  With Simulations done, I’ve gained valuable knowledge in: analyzing simulation inputs, outputs and comparing systems.  Plus, I got to use Arena.  Arena allows you to create simulations super quick.  During this class, I stretched a bit of my Python skills and built a Poker Simulator and got to design poker players.  More on that later.

Simulation Concepts

Simulations covers lots of concepts useful for future classes.  Just a quick list:

  1. Calculating expectation both in discrete, continuous and multi-variate forms.  Lots of integrals.  A close cousin is calculating variances.
  2. How random number generators work.  Check out the Mersenne twister.  This is behind Python’s random library.
  3. How to run simulation libraries such as Arena.
  4. Probability distributions: Bernoulli, Binomial, Geometric, Uniform etc.  How to fit them to your data using MLE.  How to check if they are a good fit using goodness of fit tests.
  5. Comparing simulation systems with different confidence interval techniques: independent replications, batch means etc.

We covered more than the above.  The above represents some concepts taught in that class.

Finding a Teammate

I decided to team up with Shahin Shirazi, a fellow student.  Shahin and I worked on Data Visual Analytics project.  Shahin focused on using Python to build a KNN clustering algorithm on Geospatial data.  I love programming and when I find a capable student I try to recruit them for a project.  Shahin is great!  We decided to team up for our simulation project.

Poker Project

Shahin and I decided to build a poker simulator.  We called it Decima after the goddess that measures out the lives of men and women.  Our Decima would measure out the performance of poker players instead.  Our repository can be found here: Decima.

Decima was developed as a single python file with no parameters: poker.py.  This was to simplify the effort for the graders.  We utilized Python 3.7 for the project.  The project defines a set of simulations to run, see the below picture (poker.py file):

Tables, hands, balance and minimum balance represent global parameters.  In the above:

  • tables – represents the number of poker tables to run.  You can think of this as an independent replication.
  • hands – number of hands the dealer plays at a table.
  • balance – initial player balance.  Think of this as $100,000.  A sizeable house down payment or a Ferrari.  Real high stakes.  If you are Bezos, waiter tip, please tip more :).
  • minimum balance – how much a player needs to join a game.

Simulations parameter above defines a user friendly name for the simulation and the type of players involved.  You can define as many simulations as you want.  Though if you define a lot of them, you might be waiting a while for the simulation to complete.  My official warning.

Poker Player Types

We defined a bunch of players.  Some of them rather complex, others not so much.  Below 2 rudimentary ones;

These players always call or bet independent of anything else.   I wouldn’t say this is a “smart” strategy, but it’s great to explain our framework.  In the above cases, we have a class that inherits from GenericPlayer and has a bet_strategy method.  That method has a ton of parameters none of which are optional.  Within the body of the function, you have to call (only once): call_bet, raise_bet or fold_bet to cause the player to: call, raise or fold respectively.  That’s it.

You can make this as complicated as you would like by adding more methods or data structures.

Decima Diagram:

I provide below a diagram of our program.  Read from top left to bottom right.  The bottom right are 3 CSVs that are produced as outputs for the system.  Poker balance is for game balances, poker hands is for decisions made in the poker game and poker table info is just metadata for the simulation itself.  All of the CSVs have those convenient foreign and primary keys that all jazzy database folks love and live by.  So feel free to join them.  Preferably with a good beverage at hand.

Cool Player Types we developed:

Shahin focused on players that were aware of their opponents.  His players almost always kicked my player’s butt.  Even when they weren’t aware of their opponents they kicked my player’s butt.   I guess this whole reinforcement learning thing has some added advantages.  That or trying to figure out your opponent when playing poker has some merit.  Maybe I’ll get him to update this post with all that poker butt-kicking goodness.

I focused on trying out some really strategies involving data structures I found novel.  When I mean novel, I mean: I read 4-5 academic papers, 3 of them mentioned this data structure and so I decided on pure faith that I could build it without considering any time constraints (then somehow pulled it off).  Wonderful.  What did I build?  A Monte Carlo Tree Search algorithm.

A Tree of Monte Carlo Simulations:

So evidently, in the video game and board game communities, they use this  concept called a monte carlo tree search algorithm to determine what moves are best to play.  I had no idea.  It’s also, coincidently, used in AlphaGo.  Though coincidently here means one of many algorithms and so I can’t even bother claiming any genius for implementing a half-assed version of it.  Still, kind of fun.  Here I poach the diagram from Wikipedia for your viewing pleasure:

It’s not as complicated as it looks.  Each node in the tree represents a players move.  Each child is an opponents response to the parent move.  So levels of a tree represent different players.  Well at least until the player order repeats itself.  If you start at the root node and go down, you follow a sequence of moves for a board (video game) up until that point.

How does the algorithm work?  Select a node with a missing child.  Add said missing child.  Use the moves up to that point in time as an initial state for a monte carlo simulation.  Play the monte carlo simulation to completion and record the wins and total games played.  Propagate the score and total games up the ancestor chain.  That’s it!  Once you propagate everything up the ancestral chain, your win probabilities update and you can look at all moves to make a decision.  One hiccup, how do you choose a node to search if no children are missing.  You use the upper confidence bound algorithm.  TLDR: a algorithm that balances searching new nodes with exploiting nodes with high win percent.

Next up, Regression and more regression…

Doubling this next summer term with a class on regression and a class that applies regression to different business units.  So double regression trouble!  Hopefully, I have a cool project to talk about.

Not Everything Appears as It Seems – Lessons on Computing from an ex-Strategy Consultant

How to confuse a spouse new to programming!  A two-step guide!

My wife looks over my shoulder a bit confused.  I’m working on a final project for my Node.js class (7th iteration) and she’s been watching since the first iteration.  This is what she sees:

She asks me calmly!

“Why does it seem like every time I look at your Taskitty application [Task tracking App] it seems to look the same or occasionally looks worse?  I feel like you are procrastinating.  Maybe stop playing video games or watching TV.”

I’m a bit stunned.  I’ve been putting a ton of work in the class.  Easily a dozen or so hours per week on top of my typical work schedule.  Then I realize something spectacular.  1. The entire class is iterative and 2. the professor doesn’t care how it looks in the browser (front-end web development).  He wants us to focus on learning different frameworks and is fine with us replicating the same overall and feel.

The best analogy.  Imagine talking to a neighbor and finding out he’s put a ton of work into his house recently.  You look at the exterior and think “It looks like it’s actually deteriorating a bit.  A little bit of the siding is coming off and some of the wood seems a bit rotten”.  He than mentions that all his time has been focused on interior design and a kitchen remodel.  It’s that aha moment.

My wife was watching the exterior of the website, but was clueless that the kitchen and bathroom had been remodeled.  In fact, in my case the entire internals had been replaced without her even noticing.

The house before remodeling!

In the above cases, both version 4 and the final version of Taskitty are quiet large.  The first version uses a framework called Express that uses routers and templates to create the presentation.  The best way to describe it is the diagram below:

Your browser requests a web page from my server.  The server uses a router to find a template for the web page.  It fills in the blank parts of the template using information from a database and also cookies/session information.

Two important things to note:

  1. Everything happens on  my server, each web page has it’s own template with it’s own fill in the blank variables.
  2. When my server is done it sends the entire newly formed html document to your browser via the internet.
  3. Every web page requires you to contact my server first and all the processing happens on my server.

What changed inside the house!

The final version looks similar on the surface, but how it arrives to the same conclusion is completely different.  Here is the diagram of how it works:

In the above case, the browser requests a page from my server.  My server sends a single web page bundled with a ton of components and JavaScriptThe single page application (SPA) sets up a router on your browser using JavaScript, which than selectively picks and chooses components to create a web page.  It can also independently call the database for more information.  Once the router has finished picking the components and getting data from the database it modifies the single HTML web page that was sent and renders it for you in your web browser.  You can than use the router to reconfigure the web page as many times as you desire.

There are some huge implications here:

  1. When you first contact my server, you have to download not just a web page, but all the components and logic of the website.  That can be a very large initial download.  So the first connection is a bit slow.
  2. Once the web page and website logic exists on your browser, the only time you have to reach out to my server is to contact the database.  You can almost run the application without an internet connection.  That can really speed up the experience since internet connections are slow compared to your computers resources (usually).
  3. By shipping components, I can re-use them across web pages.  This safes a lot of time.

The house walls are still an ugly Green!

My wife of course doesn’t have access to the code or knows how the internals work by looking at what the website generated.  She just sees the same old web page unchanged iteration after iteration.  The only thing she might notice is that the later web page feels faster and seems to function smoothly.  She can easily take this for granted, because both sites are relatively fast.

I think this is one way a developer’s experience can be divorced from an end user.  Sometimes we make huge changes, but things that are visible don’t change drastically.  So it appears no work has happened.

The “SQL” strikes back!

About 3 months after the class, my wife decides to be ambitious and start a class on Excel, SQL and Tableau.  One day, I watch her getting frustrated.  She tells me:

“I keep trying out these SQL commands in this database.  Every single time it tells me the function doesn’t exist.  I’ve tried different things and googled the message for the last 4 hours.  Nothing seems to solve it.”

Trying to be a helping hand and having done SQL for almost a decade.  I decide to look into the issue for her.  The first thing I ask is what SQL database she is running.  She tells me the class is in PostgreSQL.

She types SQL into a web page.  Runs it.  It works.  She opens up a SQL client on her computer.  Types the same thing in.  It tells her the function doesn’t exist.

My first gut reaction is she’s running an older version of the database.  So I tell her to type: select version().  That should tell me what version of postgreSQL she is running.

Version() doesn’t exist!

Yes, the database tells me even the Version() function doesn’t exist.  Now, I’m on high alert.  It’s either a very old database from several decades back, which is nonsensical.  The firm she’s taking the course from is less than a decade old.  Alternatively, she’s using another database.

I google the error message by itself.  The first thing that pops up on Google.  SQLite.  So I ask my wife to try:

select sqlite_version();

She tells me it works!  What does it mean?

My wife’s website was in PostgreSQL and the one on her laptop was SQLite.  Postgres and SQLite have different functions for the same tasks.  So the SQL from postgres will often error out in SQLite.  The same is true in reverse.

Mystery solved!

Except, my wife is now more confused than ever.

  1.  She now realizes there are different SQL databases.
  2. She doesn’t understand why they would teach Postgres and give a SQLite test database for practice.

I don’t blame her.  It’s very confusing.

Possible reasoning?

Having spent a lot of time in Postgres and having taken a few dozen courses/tutorials in Python.  I realized I might know the answer.  SQLite is a database with a small footprint often used in IOT devices.  You can easily download it as a single file.  Postgres on the other hand requires an installer, setting up a database/users and creating tables.  You can definitely bundle it up and ship it as a single file if you are clever about it (docker etc), but it’s much more complicated.

The problem in this courses case is communication.  They should have made it clear that the databases weren’t the same.  That the reason they use SQLite for practice is it’s much easier to set up.  The only thing the students would have to do is google SQLite functions when using it.

Two things, both different, both look similar!

Here ends the tale of two confusingly superficially similar things, which are actually completely different.  Lesson things that look the same might be completely different.  Communicating those differences can have a huge impact on user experience and prevent you from looking like a couch potato.

Web Certificate – 7 Days Left

 

7 Days until the Start of the End

Another personal update!

I have 7-days left of my month long “vacation” from school.  This is the start of the end of my Harvard Extension School – Front-end Web Development certificate.  I will take a class that perfectly sums my last year: 50% Data Science and 50% web development.  The class is called: building interactive Web Applications for Data Science.  I’m concurrently working towards a Masters in Analytics at Georgia Tech, which will become the main focus afterwards.

What I’ve built so far!

CSCI E-12 – Fundamentals of Website Development:

  1. The final project combined flexbox, media queries, jquery/javascript and light boxes.  I had to do some wire-framing and learn a bit about UX.

CSCI E-33A – Web Programming with Python and JavaScript: 

  1. Pokemon Listing website  – pokemon list and coding documentation containing basic HTML, CSS, SCSS and JavaScript
  2. Bookstore with Search:  Book listing for book store with search done in a mix of SQL, Flask and HTML/CSS.
  3. Slack Clone (Chat): Chat messenger that looks exactly like Slack using a mix of Flask and socket.io.
  4. Pizza Shop – E-commerce: This is a Pizza shop application with a shopping cart built with a mix of Django and Django forms.
  5. Property Management App: This is a property management app.  Search houses in Zillow, add them to your properties and watch monthly mortgage payments/interest auto-calculated including payment schedule.  Built using Django and calling Zillow API.

CSCI E-31 – Web Application Development using Node.js

  1. Task Tracking Application: this was incrementally built starting with a static Express application in Node.js, adding MongoDB as the data source, restful apis for task data and finally converting the front-end to use Angular.js.

What’s Next?

building interactive Web Applications for Data Science combines my interest in Data Science and Data Analytics with recent skills I’ve acquired in web development.  The class focuses on Flask deployed to AWS via Docker.  It requires the student to build Machine Learning models and use D3.js that interact with the Web Application.  This should be interesting.

This ties into 2 other classes I’ve taken at Georgia Tech: CSE – 6040 – Intro to Computing for Data Analysis and ISYE 6501 – Intro to Analytics Modeling.  First class explores Python, Pandas, Numpy and Machine Learning implementation.  Later focuses on analytics techniques and introduces around a dozen different type of models.  6501 coding in R.

Fall Term!

Fall term is approaching.  I will be focusing exclusively on my Masters of Analytics outside of work.  Two courses I will be taking: Data Visual Analytics and Simulations.  The former involves a survey of different data-related technologies: Python, D3.js, Spark/Hadoop, SQL-databases.  The later is theoretical with a focus more on the math.  It involves building a few simulations in Arena and Python.  The Data Visual Analytics course is renown for being time-intensive and difficult.  More so, those with little programming experience.

Other Technologies!

Typically, I don’t talk much about things outside of class!  That said, I did get to pick up and implement docker within a codebuild environment.  Build a continuous integration and development system.  That was pretty cool.  I’ve also been reading up on new data technologies: Spark, Presto etc.  During my 1-month break have implemented a Spark cluster on an EC2 instance that contains Jupyter and used my “micro” cluster to try out Spark’s official Machine Learning documentation.  I’ve been wanting to try it out.

Overall, I’m personally trying to push myself towards trying new things out.  Just implementing more things from scratch.  It’s part of stretching some of my new found skills in Linux Administration gained over the last few years.  Also a general interest in learning more about Devops technologies on the side.

Always stay true to yourself and explore this world!  Life is to short to wait for it to pass you by!

Best,

Chris

2020 – Education Overload

Georgia Tech

Georgia Tech

Education overload: 2019-2020.  That should be a life headline.  Before 2 years ago, I picked up new technologies by: reading a few books, scouring the internet for tutorials and than building a bunch of prototypes.  It worked.  But! I’ve always wanted to try graduate school and I wanted to do it on a topic I struggle with: Math.

why math?

math blackboard

math on blackboard

This is part past history and part recent struggle.

In college, I rapidly switched between: chemistry, mathematics, international relations and than economics.  I even threw in some art courses: drawing and painting.  The class that “got away” was always math.  It’s my ultimate “what-if”, had I stuck with it where would I be today.  I’m not sure I would be more successful, but it still interests me.

The second part is recent struggle.  I’ve read a few books on different math topics over the last 10 years: probability theory, statistics, machine learning and game theory.  I’ve even taken an intermediate statistics course, I skipped the pre-requisites due to cockiness (which quickly resolved itself over the next few classes).  The problem learning was never consistent and sustained.  There was always the next best opportunity, Computer Science, staring at me when I struggled with a math equation or theory.

Graduate school kills two birds with 1 stone: that nagging what-if question and the natural instinct to pick up an easier topic in computer science when the math gets hard.

stupidity’s name is ambition…

coffee

coffee

My coping mechanism to experiencing change is to rapidly increase the pace of change.  I partially blame my coffee addiction: rapid heart-beat, high energy, high tempo.  Love that coffee high.  Love how change generates an adrenaline rush.  Same problem, different skin.

I bought a house December 2019.  I got married the following June.  Once those two events kicked off, I got an adrenaline rush from the change.  I folded, applied for both the Computer Science and Analytics Masters at Georgia Tech.  Meanwhile, I couldn’t stay still and picked up two courses in web development over Fall/Summer 2019 at Harvard Extension School and got Linux certified.

Ambition leads to the: “Education Plan”.  A “5-year” style plan reformatted into a 3-year format.  Now, I can keep myself busy for the next 3 years.  Fun.

the “Plan”

grand plan

the grand plan

What did the education plan do?  It formalized the Masters in Analytics as a 3-year goal by breaking it down into 2 classes per semester (that in task master driven manner I’ve kept to).  Converted the 2 Harvard Web Courses into a Front-end web development certificate due Summer 2020.  Finally, listed a bunch of “optional” certificates to pursue.  You know, just in case I had enough free-time.

how’s that been going for you?

I’m finishing up my 3rd course at both Georgia Tech and at Harvard Extension School.  So far, I’ve done since May 2019 (currently March 2020):

Georgia Tech Master of Analytics:

  1. Business Fundamentals for Analytics
  2. Computing for Data Analytics
  3. Introduction to Statistical Modeling

Harvard Extension School:

  1. Fundamentals of Website Development
  2. Web Programming in Python and JavaScript
  3. Web Application programming in Node.js

All I can say, It has been busy.  I’ve been studying 10-20 hours every week since May 2019.  My only break has been December for 2-3 weeks.  It’s been a relentless march of progress.  I’ve been tired, stressed and also had fun.  Overall, I feel like all my free-time has been sucked out of my life.  I’ve really started cherishing relaxed weekends and time with friends/family more.

how to cope?

stress

how to deal with stress?

Golden rules for studying 2 courses at a time, working full-time and also having time for family/friends (as well as 2 work out sessions a week): do as much as possible as early as possible, schedule things in advance, schedule vacation during school holidays and take frequent breaks.

1.  Do things as early as possible

Your worst enemy is procrastination.  Procrastination leads to late nights studying, finishing up projects or work assignments.  Late nights are not sustainable.  You will sooner or later pay for them by either: resting/recovering for long times (you don’t have time), limiting time with friends/Spouse (stress/psychological relief) or experiencing increased amounts of stress (bad for dealing with people).  The trick: be super proactive about everything.

2.  Scheduling things in advance

Try to put things on the book or on your schedule.  This will help motivate proactive behavior.  If you know you will hang out with friends on the weekend, you’ll try to crush that assignment tonight.  You won’t wait until Monday and panic.  It helps keep you more accountable.

3. Schedule vacation during school breaks

Vacations are a nightmare if you are doing part-time school.  A bunch of problems can arise.  1.  you can easily lose motivation when a beautiful mountain range is right outside your door waiting to be hiked.  2. You’ll get FOMO, when is the next time you’ll be on this tropical island?  Why bother studying? 3. Your internet will fail you or be blocked by the government (what happens in China stays in China).  4. You’ll want to spend time with family since you only see them every few years (my parent/s live in Germany/China).  All 4 cases are worse when you are actively studying in school.  Let a vacation be a true vacation, do it when you don’t have work or school.

4.  Take frequent breaks

When studying, take a bunch of breaks.  Go study for a few hours, than go watch a movie, take a long walk or exercise.  Long stretches of studying will “compress” your brain and make it harder to study in the future.  Don’t try to cram too much at once, because it will get exhausting.  This kind of goes back to part 1 of this list: do things as early as possible.  The earlier you do something, the more opportunities you have to take a break or spend time with friends/family.

last, but definitely not least

Having support and understanding friends, family and spouse is crucial.  Don’t neglect those important people in your life (again step 1). Recognize how they make you stronger and help you out on your day-to-day journey.

I’m especially thankful to my loving wife for putting up with long study hours, making the occasional dinner and believing in me.  She also frequently checks on me and gives me an excuse to take a break!  I think this experience would be much tougher without her.