Category Archives: Python

Data Camp: Week Adventure into Python Data Science

Image result for data camp

Introduction:

Noah, a former TechTarget colleague, mentioned  datacamp.com  to me, while we were discussing an issue with remotely hosted files (probably rdp associated) and pandas DataFrames.  On February 10th, I decided to give it a try and today, 6 days later, I finished up the Python Programmer Track (10 classes)!

What is Data Camp?  Data Camp is an online education company that offers data science specific courses.  They focus mostly on two lines of technology: R and Python.  They offer some auxiliary courses in other topics: SQL, Linux and git (as well as a few derivatives).  All the courses are hosted on their online platform, which overall is beautiful to interact with.  A bit more about the topics:

Data Camp Topics:

R – Statistical Programming Language

A statistical programming language inspired by LISP and developed by professors at the University of Auckland.  R has become famous amongst statistics and machine learning researchers where they prototype cutting-edge research and provide it as packages to CRAN, a online index for hosting R code.

Python – General Purpose Language

A general purpose programming language developed by Guido van Rossum.  It’s a dynamically typed, interpreted language that has been used for rapid prototyping and scripting.  Python as a general purpose programming language that can be used for: web development, networking, devops, data analysis and machine learning.  Python has gained popularity over the last few years with increased interest in it’s scientific computing platform.  Pandas, a popular Data Science framework, was inspired by R’s data frame.

SQL – Structured Query Language

Structured Query Language, SQL, is a domain-specific language typically associated with relational database management systems (RDBMS aka SQL databases), but has been coopted by other solutions (BI platforms, Hadoop and Spark).  SQL is easy to learn and revolves around a few common entities: Servers, Databases and Tables (Views etc).  Tables can be viewed as tabular data where rows are entities (employee) and columns are attributes of the entity (salary).  Most SQL is used to retrieve data hosted in tables mediated by RDBMS.

Context

R and Python are used to manipulate data in a procedural way, typically through the use of things like Data Frames (R), Pandas (Python) or Numpy (Python).  These libraries create tabular, matrix, vector or scalar data often rely on vectorized operations to do computation.  Both R and Python have extensive data visualization frameworks to create graphs.  They also have great libraries for statistics, machine learning and artificial intelligence.  SQL is primary associated with data retrieval and is used to get “data” back from a hard drive (through the database).  Most database are more limited in functionality when it concerns more finite data manipulation, graphing and machine learning (though extensions do exist).  They make up for it by being able to store large quantities of data without relying extensively on memory (volatile and limited).  Generally, most software engineers and analysts use both a programming language like (R/Python) and SQL (to access data).

Data Camp:

Summary:

Data Camp breaks down Python/R courses on career tracks: Python Programmer, Data Analyst and Data Scientist.  Each track is composed of a number of courses: 10, 13 and 20 respectively.  Topics covered: basic programming, data manipulation techniques, graphing, statistics, machine learning, network analysis and ai (1 course).

Each course is composed of 3-5 segments.  Each segment has a set of lectures followed by exercises.  You can expect around 3 lectures and 10 exercises per segment.  Most courses build on themselves intuitively beginning with the basics and gradually building up in complexity.  I was surprised to find lectures on generators, closures and how they relate to data frames within the first 4 classes.

Lectures:

datacamp lecture

Lecture portion of the website is beautiful.  Presentations have nice transitions and the website background is not distracting.  Lecturers were clear and easy to follow.  You get a sense that al to of effort was put into the curriculum.  There were multiple lecturers in the 10 courses I took (around 5-6).  Some of these lecturers work for esteemed companies like Anaconda, published books on the topic they spoke on or had a background in software engineering/consulting.  Overall, the lectures were high quality with very few mistakes.

Practice Sets:

datacamp practice problems.

The practice problems were conducted in what looks like a modified ipython notebook embedded in the website.  They have 4 panels, the 2 on the left: exercise and instructions and 2 on the right: Scrapt.py and ipython Shell.

The exercise and instructions provide guidance on how to complete the exercise.  The Exercise section explains the topic.  The instructions tell you what steps need to be completed before submission is excepted.

Script.py is where you write your code.  The run code button let’s you execute script.py and see the output in the IPYTHON SHELL.  It’s pretty interactive.  When you submit answer, it checks if the solution matches the instruction section.  For the most part, there are few cases where code I submitted was marked wrong when it was in fact right.  That’s great!

What if you get stuck on a programming problem?  There is a button called hint.  This provides some extra guidance typically in the form of small chunks of code.  If you press the hint button, a new button called show solution will appear.  Clicking the show solution button will overwrite the SCRIPT.PY window with the correct solution, which you can then run and submit.

In rare occasions, you might end up with a multiple choice question.  They offer an interactive shell in this case.  It always proceeds the more free-form question variant.

Gamification:

Data Camp has one really great feature that I liked.  Each exercise has an amount of xp that you can collect.  Each lecture is worth 50xp, multiple choice is worth 50xp and problem sets are worth 100xp.  If you click the show hint button, the problem set xp is reduced to 70xp.  Show solution gives you 0xp.  When logging in, you will see the total xp you got that day as well as how many days in a row you have utilized data camp (streak). This is a great way to motivate you to do the exercises.

Another layer of gamification comes from certificates you can collect (and post on linkedin) as well as career tracks you can complete.

Cost/Summary:

I think data camp is very friendly to beginners interested in learning about data analysis in Python and R, both marketable skills.  I think the course layouts, lectures and practice problems are well thought out.  I would suggest that beginners also read books and online documentation on subjects like Pandas and Numpy.  I found data camp focused more on practicing skills and less on implementation details (which is a good thing).

Data Camp is currently on sale for $180/year (usually $300/year).  You can also buy it as a monthly subscription for $30/month.  Data Camp is similar to Udemy.  There are 3 advantages Data Camp has over Udemy:

  1. Practice problems make up a larger percent of the curriculum.
  2. $30/month you have access to any one of 100+ courses.  The Udemy equivalent is $15/course.  A Udemy course is equivalent to 3 Data Camp courses (in material)
  3. Data Camp specializes in data science and has really thought about how to naturally progress through the data science material.
  4. If you like structured programs, this is better.

Advantages of Udemy:

  1. Overall larger selection of content, variety and topics.  If you want to practice Python, but want to tackle multiple topics like: web programming, networking, penetration testing or a specific subset of machine learning.  It might be a better option.
  2. There is no subscription fee.  It’s fixed cost.  If you are not sure you want to invest a lot of time into python programming this is a better choice.
  3. Lectures are not as uniform.  You might find some lectures that are more theoretical.  Others are more practical.  That means you can try out different lecture styles to see what works.

Best,

Chris

 

 

Interactive Programming Diagrams

Python Tutors

Introduction:

Philip J. Guo is a professor of Cognitive Science at UC San Diego, who focuses on teaching programming at interactively online.  He has produced great products for both programming beginners and those interested in Python internals (10-hour lecture).

He impressed me with Python Tutors website: python tutors. Python Tutors is: Python, Java, JavaScript, TypeScript, Ruby, C and C++.  Python tutors converts Python written on the left into data structures on the right.  This provides a deeper view of what Python is actually doing.  The diagrams are step-based, which means you can see the execution of each line of code and what happens beneath the covers.  Red and Green arrows show you the line to be executed and the line being executed.  An example below.

Example (click image to go to interactive page):


Related Posts:

introduction to python – health innovators class

great python books – beginner to intermediate

Happy Holidays,

Chris

Introduction to Python Courses

Dear Reader,

I am providing free lectures on intro-to-python.  Salesforce supporting me through their voluntary time off program.  Salesforce.com under it’s 1-1-1 program gives each employee 7 days to work on voluntary projects.

You can find the lectures here:

Lectures

Code

Application Class

The 4th class will be held at Cambridge Innovation Center, 1 Broadway, Cambridge Ma on Saturday November 18th 2017.  They are held every 2nd Saturday afterwards.  Possible projects:

  1. A mini Q/A program.  It introduces the concept of regex using re.
  2. Opening a file in python, reading it’s lines and analyzing words.
  3. Utilizing SimpleHttp server host a basic webpage with <p>, <h1> and <div> tags.  This is a simple introduction to a one-line web server.

The curriculum is free and I encourage people to submit practice problems to the GitHub repository.

Moodle Platform

Moodle is an open-source learning platform. Often, Moodle is used in universities.  I plan on implementing a Moodle instance to host lectures online.

Spammy E-mails, Great!

Moodle implementation has stalled.  I decided to host e-mail service myself.  I did get smtp and e-mail server up.  The obstacle now is getting Google and other e-mail providers to realize my E-mail isn’t spam.

Why is it considered SPAM?

Evidently, you can send an E-mail from chriskottmyer.com, but claim it originated from john smith.com.  Web industry has developed two processes to prevent this: SPF and DKIM.  SPF creates a guarantee that a message from johnsmith.com originates from johnsmith.com.  DKIM encrypts SMTP header preventing snoopers from changing that in transit.

Apache and VPL!

After resolving the spam crisis, I will have to deal with an annoying URL issue.  Moodle loves my IP address.  It loves it so much, it’s bound it to all the URLs.  I don’t like!  I’ll have to make either application-level change or re-route in Apache to resolve.

Having Moodle is great.  It doesn’t support programming assignments out of the box.  Luckily, some wonderful academics invented a plugin called the VPL.  It takes code presented to the web, submits it to a restrictive JVM-based sandbox and runs it.  It should prevent any malicious hackers from hijacking the server (crossing fingers).  It also supports automated grading of coding exercises (yay!).

Both issues aren’t blocking the lectures!  Hopefully everyone can enjoy those!

Best,

Chris

 

Boston Python Group: Fluent Python and Think Python

Hello Readers,

I tend to have a few books that I really like.  The usual one I recommend for beginners is:

Think Python by Allen Downey.

http://greenteapress.com/thinkpython/thinkpython.pdf

The one I tend to recommend for advanced Python programmers is Fluent Python:

http://shop.oreilly.com/product/0636920032519.do

The later just gets into so many cool little things about the Python language.  Dictionary comprehensions.  How to develop a card deck in a short class (with cool use of list comprehensions) and lots of interesting technicalities.  There is a reason it has 4.9 stars on O’Reilly Media.

Last night I provided some advice to Python beginners and mentioned the above two books.

Chris

Managing Networks – Trial and Error

I’ve been playing around in my free-time in automating connections between different AWS instances as a way to learn more about networking.  Currently, it’s been pretty fun.  Last post I mentioned a series of libraries just for networking.

This post talks more about user friendly interface in the form of CLI libraries as well as some interesting topics regarding asynchronous processing in Python.  A bit about the CLI libraries that I really like:

Click –  This is a really great library in that it seems almost a natural way of building up trivial CLI in a quick and efficient manner.

You start with instantiating a cli.group, which represents a class to hold all your commands.  You then write functions in python and decorate them with the @cli.command(<help>) python decorator.  Then add @click.arguments(<help>) to add arguments.  The type of arguments available is pretty extensive including the ability to use a file (which it checks if it exists).  The nice thing about this interface is it generates the help menu for you and, if the commands ever get more complex, provides ways to subdivide commands into small groups.  This library is great for centralizing a bunch of commands.  Create a setup.py file with an entry point to make the CLI available anywhere within linux with a custom command prompt (I use something like dbops as a prefix).

Cmd – This allows you to create a command line utility using a single class and defining a few methods.  The Cmd.cmd class provides a shell, which takes in user input and then matches it with a set of commands (if they exist).  Commands are specified with def do_<command name>(self,line): where line is the string that excludes the command name (parse this to get arguments).  To make sure that enter key doesn’t execute the previous command make sure to create a method def emptyline() that returns 0 (return 0 re-prompts the command line for a new command, anything else will stop the loop).  I played around with this command prompt as a front to a network management utility and thought it was pretty effective (Cmd.cmd will run asynchronously, which frees you up to develop other services within the application).  I recommend this if you need to get user input and utilize that within the context of a program.

Argparse – Argparse, (not listed here optiparse) are other options that you can also use.  It works by providing a set of rules to handle arguments for a specific command and then assigns those to variables globally.  Good part of argparse is the argument section is very flexible and you can add things like flags.  I think overall it’s a bit harder to implement then the above two cases (but more flexible).  I think this is used mostly with a single file.

Sys/Os – The system and os library is well worth getting to know.  It provides a great way to interact with the operating system.  From checking on files, directories to … doing a stat on a file to …  One of the great uses of Sys and Os commands is the ability to manipulate stdin, arguments and stdout.  I’ve used this to generate python scripts that accept piped results.  Another interesting library to check on for this specifically is subprocess model, which allows you to run commands in the background and provides file like objects for stdin, stdout and stderr (with subprocess.PIPE allowing you to pipe results between subprocesses).

The parallel processing part of my project was pretty cool.  I worked mostly with multiprocessing and the threading library.  Multiprocessing allows you to produce new processes via fork, threading allows for shared memory between processes.

multiprocessing – I really like this library.  You can create a set of workers and provide them a function to do work in parallel.  The join command (similar to bash) waits until they are all done and then continues the process.  The overall command is pretty easy to pick up, you create a multiprocessing process, provide a target function and a set of arguments for the function (typically in list format).   You than just use the start method on the project and it begins to run it in the background.  Other cool things about multiprocessing is the ability to set up queues, pipes (bi-directional communication) and proxy shared-memory manager for dictionaries and lists (didn’t get to work, but see docs).  One thing I did run into is working around shared memory issues (initial fault in not researching threading vs multiprocessing).

threading – The commands are similar to multiprocessing (in terms of setting up), but runs things in a thread instead of a process.  You’ll see threading used a lot in libraries.  TCPServer in the previous post (SocketServer library) uses it in it’s mixing.

Celery – I didn’t get into celery as much as I’d like to.  Mostly due to not wanting to set up rabbitmq or redis for a tiny application (I used sqlite to keep foot print small and set up easy).  It’s still a great tool to look more into as it runs a queue (or set of queues) for you and allows you to execute things asynchronously (use for messaging too).  I will probably look more into this library and the associated products in the future.

The application I developed was a tool for managing database connections.  It was split into 3 parts, a process that polls AWS for connection information, a database for storing that information (sqlite) and a process that managed SQL connections for me (through port forwarding).  This was all controlled via CLI based on Cmd library.  Messages were sent to polling and SQL connection manager via queues (multiprocessing) with each process run within a separate process (multiprocessing).  Within SQL connection manager, I created TCPServer (SocketServer), which I ran in a different thread and added to a class to manage connections.  The threading was done partially to isolate failures due to a computer shutting down or refusing a connection.  This prevents the entire application from failing due to the actions of a single TCPServer.  Overall, I’ve liked the experiment so far, but don’t intend to do much more with it.  It was a experiment to test out a lot of these libraries and get a deeper understanding about things like ssh.