CIS 419/519 Python Introduction

Python/Jupyter Installation Instructions

First, make sure you have python3 (version 3.6 is the latest) installed on your local machine. We recommend managing your python installation with Anaconda. The Anaconda website has instructions for how to install python based on your operating system:

Second, make sure that you have Jupyter notebooks installed. It may have already come with your Anaconda installation. If not, the Jupyter website has installation instructions:

If you have never used python or Jupyter notebooks before, we recommend bookmarking these resources:

Be aware that the Google tutorial uses python2, so some of their example code may not directly work in python3:

  • Print statements now require parentheses: print x now becomes print(x)
  • No longer need to specify file encoding: open('file.txt', 'r') works
  • The division operator / no longer does integer divison: 5 / 2 is 2.5
  • The xrange operator is now only range

Other differences can be found on the web: https://www.geeksforgeeks.org/important-differences-between-python-2-x-and-python-3-x-with-examples/

About Python

Python is a programming language that is very popular among the machine learning research community. It is a higher-level programming language than Java, and has an extensive collection of libraries that can be easily installed. We require that you use python for this class, so it is important to be familiar with it.

Python vs Java

  • Java is compiled, but Python is not: you directly run the Python file
  • Java is statically typed (the types of the variables cannot change), and you do not declare types in Python
    # This works in python
    x = 3
    x = "python"
  • Instantiating a class in Python does not use new
    # Assume there is a class called `Person`
    p = Person('Emily', 25)
  • Python does not use parentheses, like Java. Instead, it uses whitespace to infer the scope of methods, for loops, etc.
    for i in range(10):
        print(i * 2)      # Indent by 4 spaces
        print(i * 4)
    print('Done')       # No indent, runs after for loop
  • Functions cannot be overloaded, instead use optional arguments
    def f(x, y=0.0):
        ....
    f(8)
    f(8, 2)
  • Boolean values are now True and False

Python vs Matlab

  • Python uses hard brackets [] for indexing arrays and matrices instead of () in Matlab
  • Python is 0-indexed, so the first item in a list is at index 0, not 1
  • Python doesn't come by default with a full development tool, like Matlab
  • The numpy library in Python implements many of the same matrix functionality that Matlab has

Development Environments

Jupyter Notebooks

When you start a Jupyter Notebook, you are starting a server on your local machine that hosts the notebooks.

> jupyter notebook

Once the server has started, open up a web browser and go to http://localhost:8888. You should see a file tree, which you can use to navigate to where you want to save your Jupyter Notebook.

Jupyter Notebooks allow you to interactively develop python code. Cells have types, either markdown (https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) or python code. You can write as much code in a cell as you need. When you want to execute it, enter shift+enter. Anything displayed by your program will be written below the cell. Variables that are defined within an executed cell are saved in memory and can be used later (similar behavior in Matlab). It's similar to if you could pause the program execution. Code within other cells can reference any cell which has already been executed. If you have already run a cell and you edit its code, the cell needs to be rerun in order for your changes to take effect.

If you close the Jupyter Notebook, you will need to rerun all of the cells from scratch so the variables and methods will be redefined. Make sure you save before you close the notebook!

Submitting assignments

Your Jupyter Notebook can be downloaded as a python file that can be run under File -> Download -> Python .py. You should turn in this python file and not the .ipynb file.

Basics

In [1]:
# Assigning variables
x = 4.2
y = 8
s = 'hello'

# The types of variables can change without a problem
y = 'abc'

# To display variables, use the print function
print(y)

# The "null" type is called `None`
z = None
print(z)

print(z == None)

# Casting
s = '140'
print(int(s))
s = '140.243'
print(float(s))
x = 123
print(str(x))
abc
None
True
140
140.243
123

String Manipulation

A good reference resource: https://developers.google.com/edu/python/strings

In [2]:
s = "ABCdef"
s = 'ABCdef'

# Length
print(len(s))  # 6

# Standard Indexing: Notice hard brackets and 0-indexing
print(s[0])      # 'A'
print(s[1])      # 'B'
print(s[5])      # 'f'
# print(s[100])  # Error

# Range Indexing
print(s[1:4])    # 'BCd'
print(s[2:])     # 'Cdef'
print(s[:2])     # 'AB'
print(s[1:100])  # 'BCdef'

# Negative Indexing
print(s[-1])    # 'f'
print(s[-2])    # 'e'
print(s[:-1])   # 'ABCde'
print(s[:-3])   # 'ABC'
print(s[::-1])  # 'fedCBA'

# Concatenation
print(s + 'xyz')     # 'ABCdefxyz'
# print(s + 123)     # Error
print(s + str(123))  # 'ABCdef123'

# Splitting
print(s.split('C'))  # ['AB', 'def']
6
A
B
f
BCd
Cdef
AB
BCdef
f
e
ABCde
ABC
fedCBA
ABCdefxyz
ABCdef123
['AB', 'def']
In [3]:
a = [0, 4, 2, 8, 9]  # Create a list with initial items
b = []  # Create an empty list

# Length
print(len(a))  # 5

# The same indexing operations on strings work for lists
print(a[1])    # 4
print(a[2:4])  # [2, 8]
print(a[-2])   # 8

# Sort the list
print(sorted(a))                # [0, 2, 4, 8, 9] Creates a copy, does not modify the original list
print(sorted(a, reverse=True))  # [9, 8, 4, 2, 0] Also a copy

# Add items to the end of the list
a.append(10)  # Returns nothing
print(a)      # [0, 4, 2, 8, 9, 10] 
a += [1, 12]  # Returns nothing
print(a)      # [0, 4, 2, 8, 9, 10, 1, 12]

# Lists don't have to be the same type in python
a.append('machine learning')
print(a)            # [0, 4, 2, 8, 9, 10, 1, 12, 'machine learning']
# print(sorted(a))  # Causes an error

# Searching
print(8 in a)    # True
print(100 in a)  # False
5
4
[2, 8]
8
[0, 2, 4, 8, 9]
[9, 8, 4, 2, 0]
[0, 4, 2, 8, 9, 10]
[0, 4, 2, 8, 9, 10, 1, 12]
[0, 4, 2, 8, 9, 10, 1, 12, 'machine learning']
True
False
In [4]:
d = {}       # Create an empty dictionary, which is a key-value map
d['a'] = 97  # Add an item to the dictionary
print(d)     # {'a': 97}
print(d['a'])

d['c'] = 99
d[0] = [0, 1, 2]  # Keys and values don't need to be the same type
print(d)          # {'a': 97, 'c': 99, 0: [0, 1, 2]}

# Retrieve the keys and values of the dictionary.
# These are not actually lists, but look like it. Call list(d.keys()) to actually convert it to a list
keys = d.keys()      # Retrieve a list of the keys of the dictionary, not necessarily in any order
values = d.values()  # Retrieve a list of the values of the dictionary, not necessarily in any order
print(keys)          # ['a', 'c', 0]
print(values)        # [97, 99, [0, 1, 2]]

# Check to see if items are in the dictionary (checks keys)
print('a' in d)  # True
print('x' in d)  # False

# Remove an item from the list
del d['a']    # Returns nothing, modifies inplace
# del d['y']  # Error
{'a': 97}
{'a': 97, 'c': 99, 0: [0, 1, 2]}
dict_keys(['a', 'c', 0])
dict_values([97, 99, [0, 1, 2]])
True
False

If Statements

Covered in the strings section: https://developers.google.com/edu/python/strings

In [5]:
# Python doesn't require parenthese
a, b = 7, 10

# equality
if a == 7:
    print('yes')
else:
    print('no')
    
# not equal
if a != 8:
    print('neq')
    
# and
if a > 5 and b <= 10:
    print('1')
elif a > 10:
    print('2')
else:
    print('3')

# or
if a < 10 or b > 100:
    print('yes')
    
# Strings also use '=='
s = 'abc'
if s == 'abc':
    print('here')
yes
neq
1
yes
here

For Loops

Covered under lists: https://developers.google.com/edu/python/lists

In [6]:
a = [0, 8, 3, 5, 1]

range(5)     # Dynamically generates 0, 1, 2, 3, 4
range(1, 5)  # 1, 2, 3, 4

# "Standard" for loop from Java
for i in range(len(a)):
    print(a[i])

# Iterate over each element in the list
for x in a:
    print(x)
    
# Iterating through dictionaries
d = {0: 'a', 1: 'b', 2: 'c'}
for key, value in d.items():
    print(str(key) + ' -> ' + value)
    
    
# Loop over items in an array and get the index at the same time
for index, item in enumerate(a):
  print(index, item)
0
8
3
5
1
0
8
3
5
1
0 -> a
1 -> b
2 -> c
0 0
1 8
2 3
3 5
4 1

Methods

In [7]:
# No return type specification necessary
def add(x, y):
    return x + y

def concat(list_1, list_2):
    list_1 += list_2

print(add(3, 4))        # 7
# print(add(4, 8, 10))  # Error

# Arguments can be passed by name
print(add(y=8, x=2))  # 10

a1 = [0, 1]
a2 = [3, 5]
concat(a1, a2)  # Returns nothing
print(a1)       # [0, 1, 3, 5]

# Methods can return more than one thing
def two(a):
  return a + 1, a + 2

a3, a4 = two(1)
print(a3)  # 2
print(a4)  # 3

t = two(1)
print(t[0])
print(t[1])
7
10
[0, 1, 3, 5]
2
3

List and Dictionary Comprehension

In [8]:
# Sometimes code can be simplified using list/dictionary comprehensions
def add_one(x):
  return x + 1

# The standard way to apply a function to every element and save in y
x = [0, 1, 2, 3]
y = []
for i in range(len(x)):
  y.append(add_one(x[i]))

# This is equivalent to the above
y = [add_one(x_i) for x_i in x]

# Add conditionals
y2 = [add_one(x_i) for x_i in x if x_i > 1]

# Dictionary comprehensions can be used to create dictionaries easily.
# This creates a mapping from x_i -> x_i * 4
d = {x_i : x_i * 4 for x_i in x}
print(d)
print(d[2])  # 8
{0: 0, 1: 4, 2: 8, 3: 12}
8
In [9]:
# Reading from a file
# In the same directory as the Jupyter notebook is a file called 'input.txt'.
# You can use this for loop template to iteratively read a file line by line. Generally, you will
# include some parsing logic to extract data from each line
lines = []
with open('input.txt', 'r') as f:   # This is the line which opens the file. `f` is the file handler
                                    # The `with open` will automatically close the file are you are done reading.
                                    # 'r' means you want to read from the file
    for line in f:
        line = line.strip()         # `strip()` removes any  whitespace from the beginning and end of the string
                                    # By default, each line will have '\n' at the end. 
        upper = line.upper()        # Here is where you would add your custom parsing logic
        lines.append(upper)

print(lines)  # ['A', 'B', 'C', 'D', 'E', 'F', 'G']

# Writing to a file
# This will write each of the apital letters to a line in an output file
out = open('out.txt', 'w')
out.write('hi')
out.close()

with open('output.txt', 'w') as out:  # 'w' means you want to write to the file. This will overwite the file if it exists
    for item in lines:
        out.write(str(item) + '\n')   # the `write` method only accepts strings and you have to
                                      # manually take care of the '\n' yourself
['A', 'B', 'C', 'D', 'E', 'F', 'G']

Classes

In [10]:
class Person(object):
    # This is the constructor. All instance methods of a class
    # must accept `self` as the first parameter.
    # You use `self` to access data members and methods of the class,
    # similar to `this` in Java
    def __init__(self, name, age):
        # This will call the super class's constructor; Optional if you inherit from `object`
        super().__init__()
        
        # Assign data members
        self.name = name
        self.age = age
        
    def is_adult(self):
        return self.age >= 18
    
    def is_child(self):
        # Other instance methods are called with the `self` keyword.
        return not self.is_adult()
    
    def increment_age(self, amount):
        self.age += amount
        
    # Test to see if two `Person` objects are equal
    def __eq__(self, other):
        return self.name == other.name and self.age == other.age
In [11]:
p = Person('Bob', 19)

# Data members can be accessed directly
print(p.name)  # 'Bob'
print(p.age)   # 19

# Even though all of the methods require a `self` argument, you ignore that argument
# when calling the method
print(p.is_adult())  # True 

p.increment_age(1)  # Returns nothing
print(p.age)        # 20

p2 = Person('Mary', 30)
print(p == p2)  # False

p3 = Person('Mary', 30)
print(p2 == p3)  # True
Bob
19
True
20
False
True

Exception Handling

In [8]:
# This is the equivalent of Java's try-catch block.
# The `Exception` class is the most generic
a = [0, 1, 2]
try:
    a[5]
except IndexError as e:
    print('Caught error: ' + str(e))
    
# Multiple exceptions can be caught like this
try:
    a[5]
except IndexError as e:
    print('Index Error')
except Exception as e:
    print('Exception')
    
# Or like this
try:
    a[5]
except (IndexError, Exception) as e:
    print('Either one')

# If you want to ignore all exceptions, use `pass`
try:
    a[5]
except:
    pass    
Caught error: list index out of range
Index Error
Either one

Useful Packages

There are many packages that come with your python installation as well as many which can be downloaded easily for free. Importing a package requires the import command. Most packages can be installed using the pip command (which is run on the command line, not within a python environment), that is included with the python installation.

> pip install numpy
> pip install scikit-learn

os

The os package is useful for interacting with your file system by searching through directories or performing operations on file paths.

In [3]:
# To import a package in python, use the import keyword
import os  # Now `os` is defined


# The `help()` command displays all of the methods and variables that you can
# use in a package. This also works with variables.
#help(os)  # Commented out so it does not display in the notebook

os.listdir()  # Lists the files in the current directory. Accepts an argument to
              # list the files in a specific directory
  
print(os.path.dirname('/path/to/file.txt'))   # '/path/to'
print(os.path.basename('/path/to/file.txt'))  #  'file.txt'
/path/to
file.txt

collections

The collections package has many useful data structures that simplify code.

In [13]:
# To import a specific subpackage or class from a package, use `from ... import`
from collections import defaultdict, Counter

# The defaultdict is a wrapper around a dictionary that returns a default value
# if the item queried does not exist. When you instantiate it, you have to tell
# what type the default item should be
d = defaultdict(int)
print(d[0])      # 0, This would throw an error with a normal dictionary
d[1] = 7
print(d[1])      # Otherwise it behaves like a normal dictionary
print('a' in d)  # Still False even though d['a'] would return 0
0
7
False
In [14]:
# The `Counter` class is helpful for keeping count of items. It is another wrapper
# around the dictionary class
c = Counter()

c['a'] += 5  # This would raise an error in a normal dictionary because the key 'a' does not exist
print(c['a'])
# print(c['b'])  # Still causes an error

# You can initialize it with a list of items to count
c = Counter(['a', 'b', 'c', 'c', 'c'])
print(c['a'])  # 1
print(c['b'])  # 1
print(c['c'])  # 3
5
1
1
3

random

The random package provides tools for generating random numbers.

In [15]:
import random

random.seed(4)

print(random.randint(1, 5))  # A random integer in (1, 5)

print(random.random())  # A random number in (0, 1.0]

# Randomly shuffle a list
a = [0, 1, 2, 3, 4, 5]
random.shuffle(a)  # Shuffles in place - does not return a copy
print(a)
2
0.8811128320740301
[2, 3, 0, 1, 5, 4]

numpy

numpy is a very useful matrix library for Python. It is used very frequently in machine learning, so it is good to be familiar with it. If your program crashes with an import error when you import numpy, you need to install it with

> pip install numpy
In [16]:
# Imports can be aliased to simplify code
import numpy as np  # The numpy library is now referenced with np

# Create a vector of length 5 of all 0s
v = np.zeros(5)
print(v)

print(v.shape)     # Returns a tuple of the dimensions: (5,)
print(v.shape[0])  # Returns the size of the 0th dimension, 5

# You can assign specific entries by their index into the vector
v[0] = 1
v[1] = 4

# You can assign ranges of values
v[2:5] = [5, 3, 6]
print(v)

# Create a random vector of values between 0 and 1
np.random.seed(1)
v1 = np.random.rand(5)
print(v1)

# Compute the dot product
print(np.dot(v, v1))

# Element-wise multiplication
print(v * v1)
print(np.multiply(v, v1))  # equivalent

# Multiply the whole vector by a scalar, add a constant
print(v * 5)
print(v + 3)
[0. 0. 0. 0. 0.]
(5,)
5
[1. 4. 5. 3. 6.]
[0.7460098  0.22458445 0.82822782 0.8972963  0.86152284]
13.646512610540363
[0.7460098  0.8983378  4.14113908 2.6918889  5.16913702]
[0.7460098  0.8983378  4.14113908 2.6918889  5.16913702]
[ 5. 20. 25. 15. 30.]
[4. 7. 8. 6. 9.]
In [17]:
# Matrices work much the same way (vectors are just matrices with 1 dimension)
# `np.ones` creates a matrix of the given input size
t = (3, 4)
X = np.ones(t)
print(X)
print(X.shape)     # (3, 4)
print(X.shape[0])  # 3
print(X.shape[1])  # 4

# Assigning specific entries requires 2 indices
# The first is the row index, the second is the column index
X[1, 3] = 8
print(X)

# The `:` selects the entire row or column
row1 = X[1, :]
print(row1)
print(X[:, 2])

# It can also be used to assign values
X[1, :] = [1, 2, 3, 4]
print(X)

# Matrix multiplication
Y = np.random.rand(4, 5)
print(np.matmul(X, Y))  # A matrix of size (3, 5)

# Transpose
print(X.transpose())
print(X.T)  # equivalent

# You can create a matrix from a list (of lists)
Z = np.asarray(
    [
        [0, 1, 2], 
        [3, 4, 5]
    ])
print(Z)
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
(3, 4)
3
4
[[1. 1. 1. 1.]
 [1. 1. 1. 8.]
 [1. 1. 1. 1.]]
[1. 1. 1. 8.]
[1. 1. 1.]
[[1. 1. 1. 1.]
 [1. 2. 3. 4.]
 [1. 1. 1. 1.]]
[[1.9968484  1.29694475 3.50491367 2.12737329 2.64945141]
 [4.24310791 2.12355841 8.43376609 5.6311847  6.23211365]
 [1.9968484  1.29694475 3.50491367 2.12737329 2.64945141]]
[[1. 1. 1.]
 [1. 2. 1.]
 [1. 3. 1.]
 [1. 4. 1.]]
[[1. 1. 1.]
 [1. 2. 1.]
 [1. 3. 1.]
 [1. 4. 1.]]
[[0 1 2]
 [3 4 5]]
In [18]:
# Useful ways to initialize matrices
A = np.zeros((5, 3))  # A 5x5 matrix of all 0s
print(A)

A = np.ones((5, 3))  # A 5x3 matrix of all 1s
print(A)

A = np.eye(5)  # I matrix of size 5x5
print(A)

A = np.random.rand(5, 3)  # A random matrix of size 5x3 with numbers between 0 and 1
print(A)

A = np.random.randint(1, 4, (5, 3)) # A random matrix of integers in (0, 4] of size 5x3
print(A)
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
[[0.1521557  0.13008678 0.12585539]
 [0.75692104 0.68909671 0.7047985 ]
 [0.14267653 0.2329341  0.53017252]
 [0.52697286 0.53669603 0.42658442]
 [0.6704286  0.60622843 0.74766516]]
[[2 1 2]
 [2 1 3]
 [1 2 1]
 [1 1 2]
 [2 2 3]]

sklearn

sklearn is a useful machine learning library. It can be installed with

> pip install scikit-learn
In [19]:
# An example learning problem. You may use this as a template for homework
from sklearn.linear_model import SGDClassifier

# This is the feature matrix. There are 7 rows (7 training examples) with
# 4 columns (4 features).
X = np.asarray(
[
    [0, 0, 1, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 1],
    [1, 0, 0, 1],
    [0, 1, 1, 0],
    [1, 1, 0, 0],
    [0, 1, 0, 1]
])

# These are the binary class labels for all 7 instances
y = np.asarray([0, 0, 1, 1, 0, 0, 0])

# Create the classifier
clf = SGDClassifier(loss="hinge", penalty="l2")

# Train the classifier on our labeled data
clf.fit(X, y)

# Use our trained classifier to predict labels for new input data
X_test = np.asarray(
[
    [1, 0, 0, 0],
    [0, 1, 1, 1],
    [0, 0, 0, 1]
])
y_pred = clf.predict(X_test)

# Let's assume we know the actual labels of `X_test` are this
y_test = np.asarray([0, 1, 1])

# We can compute the total number of correct classifications:
num_correct = sum(y_pred == y_test)
total = y_pred.shape[0]
print('Accuracy: ' + str(num_correct / total * 100))
Accuracy: 33.33333333333333
/Users/dan/miniconda3/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)

Conclusion

Google and Stackoverflow will be your best friend for writing python code. Searching for simple things like "python write string to file" or "python equivalent of java toString" works incredibly well.

If you have specific questions or need help troubleshooting your python installation, please come to the recitations this week or to any TAs office hours.