Algorithms and Data Structures

Algorithms and data structures' implementations in Java from the Algorithms 4th edition 📖

What is this course?

Intermediate level survey course
Programming and problem solving with applications

Definitions

Algorithms: Method for solving a problem
Data structures: Method to store information
Program = Algorithms + Data structures

Topics

Part I

Data types: stack, queue, bag, union find, priority queue
Sorting: quicksort, mergesort, heapsort, radix sorts
Searching: BST, red-black BST, hash table

Part II

Graphs: BFS, DFS, Prime, Kruskal, Dijkstra
Strings: KMP, regular expression, TST, Huffman, LZW
Advanced: B-tee, suffix array, maxflow

Why Algorithms is so important?

Algorithms all around us

Internet: web search, packet routing, distribute file sharing, ...
Biology: human genome project, protein folding, ...
Computers: circuit layout, file system, compilers, ...
Computer graphics: Movies, video games, virtual reality, ...
Security: cell phones, e-commerce, voting machines, ...
Multimedia: MP3, JPG, Divx, HDTV, face recognition, ...
Social networks: recommendations, news feeds, advertisements, ...
Physics: N-body simulation, particle collision simulation, ...

Steps for solving the problem

Model the problem
Find an algorithm to solve it
Fast enough? Fits in memory?
If not, figure out why
Find a way to address the problem
Iterate until satisfied

Algorithm Analyze

Reasons to Analyze Algorithms

Predict performance
Compare algorithms
Provide guarantees
Understand theoretical basis

The scientific method

Observe some feature in the natural world
Hypothesize a model that is consistent with the observations
Predict events using the hypothesis
Verify the predictions by making further observations
Validate by repeating until the hypothesis and observations agree

Principles

Experiments must be reproducible.
Hypotheses must be falsifiable.

Empirical Analysis

Manual measurement (benchmarking) with a stopwatch or programmatic timing method.
Measure running time for different input sizes N (e. g. doubling time) and observe the relationship between the running times.

Data Analysis

Plot running time T(N) vs input size N.
Plot as log-log plot, often get a straight line. lg(T(N)) vs lg(N). Plot tells you the exponent of N.
Regression, power law: a x N^b
Once you have the power b from the slope of the log-log plot, solve for a in the equation T(N) = a x N^b

Doubling Hypothesis

Run program, doubling the size of the input and observe ratios. Observe to what it converges, do not take the average!

N	T(N)	Ratio	lg(Ratio)
1000	0. 1	-	-
2000	0. 8	7. 7	2. 9
4000	6. 4	8	3
...	...	...	...

Hypothesis: Running time is about a x N^b, where b = lg(Ratio)
Caveat: Cannot identify logarithmic factors with the doubling hypothesis.
Calculate a by solving T(N) = a x N^b for a with all other variables now known.

Experimental algorithmic

System independent effects (determines constant a and exponent b in power law)
- Algorithm
- Input data
System dependent effects (contribute only to constant a in power law)
- Hardware: CPU, memory, cache
- Software compiler, interpreter, garbage collector
- System: OS, network, other applications

Mathematical Models

Analyze individual operations to determine complexity
Simplification 1

Count only the most expensive ones, i. e. those that take the most time or where time x frequency is highest.
Simplification 2

Ignore lower order terms, e. g. in 5xN³ + 20N + 16, ignore the term with N and the constant 16 (which is 16 x N⁰) because they are less significant in comparison with the highest order term. We use tilde notation ~ to say that 5 x N³ + 20 x N + 16 __~ 5 x N³__. Technical definition is that for f(N) ~ G(N) when N goes towards infinity, the lower order terms become so insignificant that f(N)/g(N) = 1

Order-of-Growth Classifications

A great number of algorithms (most) are described by the following order of growth functions
- 1 (constant)
- log N (logarithmic)
- N (linear)
- N log N (linearithmic)
- N² (quadratic)
- N³ (cubic)
- 2^N (exponential)

Note: lgN means log₂N

We say the algorithm "is proportional to" e. g. constant time

Properties

Reflexive: p is connected to q
Symmetric: if p is connected to q, then q is connected to p
Transitive: if p is connected to q and q is connected to r, then p is connected to r

Dynamic connectivity

Applications based on this:

Pixels in a digital photo
Computers in network
Friends in a social network
Transistors in a computer chip
Elements in a mathematical set
Variables names in Fortran program
Metallic sites in a composite system

Quick find (Eager approach)

Data structure
Integer array id[] of size N
Interpretation: p and q are connected (iff) if and only if they have the same id
Complexity

Initialize	Union	Find
N	N	1

Defect
- Find too expensive (Could be N array accesses)
- If you have N union commands over N objects will be O(N²) quadratic
Note: We can not accept Quadratic in big problems Quadratic algorithms do not scale

Rough standards (for now)

10⁹ operations per second
10⁹ words of main memory
Touch all words in approximately 1 second

E. g Huge problem for quick find

10⁹ union commands of 10⁹ objects
Quick find takes more than 10¹⁸ operations
30+ years of computer time!

Quick Union

Set the first element based on the root of the second element
Complexity

Initialize	Union	Find
N	N'	1

Defect
- Tree can get tall
- Find too expensive (Could be N array accesses)

Quick Union improvement

Weighted Quick union

Modify the Quick union to avoid tall trees
Balance by linking root of smaller tree to root of larger tree
Depth of any node x is at most lg N
Complexity

Initialize	Union	Find
N	lg N'	lg N

Quick union with path compression N + M lg N
Weighted Quick union with path compression N + M lg * N

Note: WQUPC reduce time from 30 years to 6 seconds

Union find Applications

Percolation
Games (Go, Hex)
Dynamic connectivity
Last common ancestor
Hoshen-kopelman algorithm in physics

Percolation

A model for many physical systems

N-by-N grid of sites
Each site is open probability p (or blocked with probability 1-p)
System percolates iff top and bottom are connected by open sites
Applications in real life

Model	System	Vacant site	Occupied site	Percolates
electricity	material	conductor	insulated	conducts
fluid flow	material	empty	blocked	porous
social interaction	population	person	empty	communicates

Data structures Design

Good practice to make an abstraction between the outside world and internal implementation, In java we will use interface
- Benefits
  - Client can't know details of implementation
  - Implementation can't know details of client needs
  - Design: creates modular, reusable libraries
  - Performance: use optimized implementation where it matters
Client: program using operations defined in interface
Implementation: actual code implementing operations
Interface: description of data type, basic operations

Stack

LIFO (last in first out), useful in many applications
Operations: push(Item item), pop(), size(), isEmpty()
There are two implementation of stack using Linkedlist and Array
Stack removes the item most recently added
What are the Differences between LinkedList and Array implementation?
- Linkedlist: Use extra space for dealing with links
- Array: resize/shrink the array takes some time
When should I use Linkedlist or Array implementation?
- if time is important and don't want to lose any input i. e. dealing with internet packet use Linkedlist implementation, But if you take care of memory space use Array implementation
How duplicate/shrinking array?
- Duplicate When reach 100% full the array resize(arr.length * 2)
- Shrink when reach one quarter full to the half resize(arr.length / 2)
Stack applications
- Parsing in a compiler
- Java virtual machine
- Undo in word processor
- Back button in a web browser
- Implementation function calls in a compiler
- Arithmetic expression evaluation
- Reverse objects
LinkedList implementation code

public class LinkedStack<Item> implements Stack<Item> {

    private class NodeList {
        Item item;
        NodeList next;

        public NodeList(Item item) {
            this.item = item;
        }
    }

    private NodeList first = null;
    private int size = 0;

    public void push(Item item) {
        if (first == null) {
            first = new NodeList(item);
            first.next = null;
        } else {
            NodeList oldFirst = first;
            first = new NodeList(item);
            first.next = oldFirst;
        }
        size++;
    }

    public Item pop() {
        if (isEmpty()) {
            throw new NoSuchElementException("stack underflow");
        }
        Item item = first.item;
        first = first.next;
        size--;
        return item;
    }
}

Queue

FIFO (first in first out), useful in many applications
Operations: enqueue(Item item), dequeue(), size(), isEmpty()
Queue removes the item lest recently
There are two implementation of stack using Linkedlist and Array
Queue applications
- CPU scheduling
- Disk scheduling
- Data transfer asynchronously between two processes. Queue is used for synchronization.
- Breadth First search in a Graph
- Call Center phone systems
LinkedList implementation code

public class LinkedQueue<Item> implements Queue<Item> {

    private class NodeList {
        Item item;
        NodeList next;

        public NodeList(Item item) {
            this.item = item;
        }
    }

    private NodeList first;
    private NodeList last;
    private int size;

    public LinkedQueue() {
        first = null;
        last = null;
        size = 0;
    }

    public void enqueue(Item item) {
        NodeList oldLast = last;
        last = new NodeList(item);
        last.next = null;
        if (isEmpty()) {
            first = last;
        } else {
            oldLast.next = last;
        }
        size++;
    }

    public Item dequeue() {
        if (isEmpty()) {
            throw new NoSuchElementException("Queue underflow");
        }
        Item item = first.item;
        first = first.next;
        if (isEmpty()) {
            last = null;
        }
        size--;
        return item;
    }

    public Item peek() {
        if (isEmpty()) {
            throw new NoSuchElementException("Queue underflow");
        }
        return first.item;
    }
}

Elementary sorts

Rearrange array of N times into ascending/descending order based on a key
- Selection sort
- Insertion sort
- Shell sort
- Heap sort
- Quick sort
Implementation in java there are Comparable and Comparator interfaces we will them any of them in the implementation of the sort algorithms

to allow sort any generic data types

There are three return values: 1, 0, -1 and throw Exception if incompatible types or null
- V less than W (return -1)
- V equal to W (return 0)
- V greater than W (return 1)
Total order
- Antisymmetric : if v<=w and w<=v, then v=w
- Transitivity : if v<=w and w<=x, then v<=x
- Totality : either v<=w or w<=v or both

Selection sort

Scan from left to right
Find the index of min of smallest remaining entry, then swap a[i] and a[min] -→ Time Complexity O(N<sup>2</sup>) and doesn't sensitive if the input is sorted

Code

    public static <Item extends Comparable<Item>> void sort(Item[] arr) {
        int N = arr.length; 
        int min; 
        for (int i = 0; i < N; i++) {
            min = i; 
            for (int j = i + 1; j < N; j++) {
                if (less(arr[j], arr[min])) {
                    min = j; 
                }
            }
            swap(arr, i, min); 
        }
    }

Insertion sort

Scan from left to right
Swap a[i] with each larger entry to its left ← Time Complexity O(N<sup>2</sup>) and has good performance over partially sorted arrays
Fast when the array is partially sorted O(N)
Array called partially sorted when number of elements to be changed less than or equal cN

Code

    public static <Item extends Comparable<Item>> void sort(Item[] arr) {
        int N = arr.length; 
        for (int i = 1; i < N; i++) {
            for (int j = i; j > 0 && less(arr[j], arr[j - 1]); j--) {
                    swap(arr, j, j - 1); 
            }
        }
    }

Shell sort

Move entries more than one position at a time by h-sorting the array What's the h value Knuth says 3x+1
Complexity N^3/2

Code

    public static <Item extends Comparable<Item>> void sort(Item[] arr) {
        int N = arr.length; 
        int h = 1; 
        while (h < N / 3)
            h = 3 * h + 1; 

        while (h >= 1) {
            for (int i = h; i < N; i++) {
                for (int j = i; j >= h && less(arr[j], arr[j - h]); j -= h) {
                    swap(arr, j, j - h); 
                }
            }
            h /= 3; 
        }
    }

Why Shell sort uses insertion sort internally?
1. Fast unless array size is huge
2. Tiny used in some embedded systems
3. Hardware sort prototype

Shuffle sort

Generate a random real number for each array entry
Sort array
Knuth shuffle
- Pick integer r between 0 and i uniformly at random
- Swap a[i] and a[r]
- Complexity: O(N)

Code

    public static <Item extends Comparable<Item>> void shuffle(Item[] arr) {
        int N = arr.length; 
        Random random = new Random(); 
        for (int i = 0; i < N; i++) {
            int r = random.nextInt(i + 1); 
            swap(arr, i, r); 
        }
    }

Applications in sorting

Convex hull of a set of N points
- Is the smallest perimeter fence enclosing the points
- Equivalent definitions:
  1. Smallest convex set containing all the points
  2. Smallest area convex polygon enclosing the points
  3. Convex polygon enclosing the points, whose vertices are points in set
- Convex hull output. Sequence of vertices in counterclockwise order
- Mechanical algorithm. Hammer nails perpendicular to plane, search elastic rubber band around points
- Convex hull application
  1. Robot motion planning. Find shortest path in the plan from s to t that avoids a polygonal obstacle
    - Fact. Shortest path is either straight line from s to t or it is one of two polygonai chains of convex hull
  2. Farthest pair problem. Given N points in the plane, find a pair, find a pair of points with the largest Euclidean distance between them
    - Fact. Farthest pair of points are extreme points on convex hull
- Convex hull: geometric properties
  - Fact. Can traverse the convex hull by making only counterclockwise turns
  - Fact. The vertices of convex hull appear in interesting order of polar angle with respect to point p with lowest y-coordinate
- Graham scan, based on above facts |
  - Choose point p with smallest y-coordinate
  - Sort points by polar angle with p
  - Consider points in order, discard unless if create a ccw turn
  - Q. How to find point p with smallest y-coordinate?
    
    A. Define a total order, comparing by y-coordinate
  - Q. How to sort points by polar angle with respect to p?
    
    A. Define a total order for each point p
  - Q. How to determine where p1 → p2 → p3 is counterclockwise turn?
    
    A. Computational geometry
  - Q. How to sort efficiently?
    
    A. Merge sort in N lg N
- Implement CCW
  - CCW. Given three points a, b and c, is a→b→c a counterclockwise turn?
    
    A. Determinant (or cross product) gives 2x signed area of planer triangle
```
  01. if signed area > 0, then a&#8594;b&#8594;c is counterclockwise
  02. if signed area < 0, then a&#8594;b&#8594;c is clockwise
  03. if signed area = 0, then a&#8594;b&#8594;c are colliner
```
  - Running time N lg N for sorting and linear for rest

Merge sort

This sort based on the technique of divide-and-conquer
Java sort for objects
Steps:
- Divide array into two halves
- Recursively sort each half
- Merge two halves
Complexity N lg N

Code

    public static <Item extends Comparable<Item>> void sort(Item[] arr, Item[] aux, int lo, int hi) {
        if (hi <= lo)
            return; 
        int mid = lo + (hi - lo) / 2; 
        sort(arr, aux, lo, mid); 
        sort(arr, aux, mid + 1, hi); 
        merge(arr, aux, lo, mid, hi); 
    }

    private static <Item extends Comparable<Item>> void merge(Item[] arr, Item[] aux, int lo, int mid, int hi) {

        for (int k = lo; k <= hi; k++)
            aux[k] = arr[k];
        
        int i = lo, j = mid + 1;
        for (int k = lo; k <= hi; k++) {
            if (i > mid)
                arr[k] = aux[j++]; 
            else if (j > hi)
                arr[k] = aux[i++]; 
            else if (less(aux[j], aux[i]))
                arr[k] = aux[j++]; 
            else
                arr[k] = aux[i++]; 
        }
    }

Mergesort improvement
- Stop if array is already sorted: !less(arr[mid+1], arr[mid])
- Cutoff to insertion sort = 7
- Eliminate-the-copy-to-the-auxiliary-array trick
First draft of a Report on the EDVAC by John von Neuman
Compare Running time between Insertionsort and Mergesort
- Laptop executes 10⁸ compares/second
- Supercomputer executes 10¹² compares/second
- In this below table will compare between normal computer in first row and supercomputer in second row
  
  Insertionsort N²

Million	Billion
2. 8 hours	317 years
1 second	1 week

Mergesort N lg N

Million	Billion
1 second	18 min
instant	instant

Note: Good algorithm are better than supercomputers

Bottom-up version of Mergesort

Basic plan
1. Pass through array, merging subarrays of size 1
2. Repeat for subarrays of size 2, 4, 8, 16, ...
3. Slower than Recursive by 10%

Code

    public static <Item extends Comparable<Item>> void sortBottomUp(Item[] arr) {
        int N = arr.length; 
        Item[] aux = (Item[]) new Comparable[N];

        for (int sz = 1; sz < N; sz = sz + sz)
            for (int lo = 0; lo <= N - sz; lo += sz + sz)
                merge(arr, aux, lo, lo + sz - 1, Math.min(lo + sz + sz - 1, N - 1)); 
    }

Sort Stability

Suppose you want to sort BY_NAME then BY_SECTION
You should sort and keep the equal elements that they came as input, don't change equal elements position
Which sorts are stable?
- Insertionsort
- Mergesort
Why Selectionsort and Shellsort not stable?
- Selectionsort keeps keep pointer from past and might move an item some equal item
  - Shellsort makes long distance exchanges

Quick sort

One of the most important algorithm in 20^th century
Java sort for primitive types
Basic plan
1. Shuffle the array
2. Partition
3. Sort
Complexity N lg N
Faster than Mergesort
In place algorithm
Not stable
Worst case in quicksort will not gonna happen
Problems in quick sort

Code

    public static <Item extends Comparable<Item>> void sort(Item[] arr) {
        KnuthShuffleSort.shuffle(arr); 
        sort(arr, 0, arr.length - 1); 
    }

    private static <Item extends Comparable<Item>> void sort(Item[] arr, int lo, int hi) {
        if (hi <= lo)
            return; 
        if (hi <= lo + CUTOFF) {
            InsertionSort.sort(arr, lo, hi); 
            return; 
        }
        int j = partition(arr, lo, hi); 
        sort(arr, lo, j - 1); 
        sort(arr, j + 1, hi); 
    }

    private static <Item extends Comparable<Item>> int partition(Item[] arr, int lo, int hi) {
        int i = lo, j = hi + 1; 
        while (true) {
            while (less(arr[++i], arr[lo]))
                if (i == hi)
                    break; 
            while (less(arr[lo], arr[--j]))
                ; 
            if (i >= j)
                break; 
            swap(arr, i, j); 
        }
        swap(arr, lo, j); 
        return j; 
    }

Improvement
1. Insertion sort for small subarrays
  - Cut OFF ~ 10 items
2. Median of sample
  - Best choice of pivot item = median
  - Median-of-3 random items

Selection

Goal. Given an array of N items, find the k^th largest
- Min(k=0), max(k=N-1), median(k=N/2)
Applications
- Order statistics
- Find the top k
Use theory as a guide
- Easy N lg N upper bound. How? By sorting the array and loop util reach k^th
- Easy N upper bound for k = 1,2,3. How?
- Easy N lower bound. Why?
Which is true?
- N lg N lower bound? ← Is selection as hard as sorting?
- N upper bound? ← Is there a linear algorithm for each k?
Quick-select
- Version of Quick-sort
- Entry a[j] is in
- No larger entry to the left of j
- No smaller entry to the right of j
- Repeat in one sub-array, depending on j; finished when j equals k
- Analysis: Linear time on average
Remark. Quick-select uses ~ 1/2N² compares in the worst case, but (as with quicksort) the random shuffle provides a probabilistic guarantee
- Algorithm

    public static <Item extends Comparable<Item>> Comparable<Item> select(Item[] arr, int k) {
        KnuthShuffleSort.shuffle(arr);
        int lo = 0, hi = arr.length - 1;
        while (hi > lo) {
            int j = partition(arr, lo, hi);
            if (j < k)
                lo = j + 1;
            else if (j > k)
                hi = j - 1;
            else
                return arr[k];
        }
        return arr[k];
    }

Remark. But, constants are too high => not used in practice

Use theory as a guide
- Still in worthwhile to seek practical linear time (worst-case) algorithm
- Until one is discovered, use quick-select if you don't need a full sort

Duplicate keys

Often, purpose of sort is to bring items with equal keys together
- Sort population by age
- Find collinear points
- Remove duplicates from mailing list
- Sort job applicants by college attended
Typical characteristics of such applications
- Huge array
- Small number of key values
Mergesort with duplicate keys. always between 1/2 N lg N and N lg N compares
Quicksort with duplicate keys.
- Algorithms goes quadratic unless partition stop on equal keys
- 1990s C user found this defect in qsort()
- Mistake. Put all items equals to the partitioning item in one side
  - Consequence. ~1/2N² compares when all keys equal
    
    B A A B A B B B C C C A A A A A A A A A A A
- Recommended. Stp scan on item equals to the partitioning item
  - Consequence. ~ N lg N compares when all keys equal
    
    B A A B A B B B C C C A A A A A A A A A A A
- Desirable. Put all items equal to the partitioning item in place
```
  A A A `B B B B B` C C C `A A A A A A A A A A A`
```
3-way partitioning
- Goal. Partition array into 3 parts so that:
  1. Entries between lt and gt equal to partition item v
  2. No larger entries to left of lt
  3. No smaller entries to right of gt
- Dutch national flag problem. [Edsger Dijkstra]
  - Conventional wisdom until mid 1990s: not worth doing
  - New approach discovered when fixing mistake in C library qsort()
  - Now incorporated into qsort() and Java system sort
- Steps
  1. Let v be partitioning item a[lo]
  2. Scan i from left to right
    
    . (a[i] < v): exchange a[lt] with a[i]; increment both it and i . (a[i] > v): exchange a[gt] with a[i]; decrement gt . (a[i] == v): increment i

Code

    private static <Item extends Comparable<Item>> void sort(Item[] arr, int lo, int hi) {
        if (hi <= lo)
            return;
        int lt = lo, gt = hi;
        Item v = arr[lo];
        int i = lo;
        while (i <= gt) {
            int cmp = arr[i].compareTo(v);
            if (cmp < 0)
                swap(arr, lt++, i++);
            else if (cmp > 0)
                swap(arr, i, gt--);
            else
                i++;
        }
        sort(arr, lo, lt - 1);
        sort(arr, gt + 1, hi);
    }

Proposition. [Sedgewick-Bentley, 1997]
- Quicksort with 3-way partition is entropy-optimal
Bottom line. Randomized quicksort with 3-way partitioning reduces running time from linearithmic to linear in broad class of application

System sorts

Sort applications
- obvious applications
  - Sort a list of names
  - Organize an MP3 library
  - Display Google PageRank results
  - List Rss feed in reverse chronological order
- Problems became easy once items are in sorted order
  - Find the median
  - Binary search in a database
  - Identify statistical outliers
  - Find duplicates in a mailing list
- Non-obvious applications
  - Data compression
  - Computer graphics
  - Computational biology
  - Load balancing on a parallel computers
Java System sort
- Arrays.sort()
- Has different method for each primitive type
- Has a method for data types that implement Comparable
- Has a method that uses a Comparator
- Uses tuned quicksort for primitive types; tuned mergesort for objects

System sort: Which algorithm to use?

Many sorting algorithms to choose from
- Internal sorts
  - Insertion sort, selection sort, bubblesort, shaker sort
  - Quicksort, mergesort, heapsort, samplesort, shellsort
  - Solitaire sort, red-black sort, splaysort, Yaroslavskiy sort, psort
- External sorts. Poly-phase mergesort, cascade-merge, oscallating sort
- String/radix sorts. Distribution, MSD, LSD, 3-way string quicksort
- Parallel sorts
  - Bitonic sort, Batcher even-odd sort
  - Smooth sort, cube sort, column sort
  - GPUsort
Applications have diverse attributes
- Stable?
- Parallel?
- Deterministic?
- Key all distinct?
- Multiple key types?
- Linked list or arrays?
- Large or small items?
- Is your array randomly ordered?
- Need guaranteed performance?
Elementary sort may be method of choice for some combination
- Cannot cover all combinations of attributes
Q. Is the system sort good enough?

A. Usually

Sort complexity

Name	In-place	Stable	Best	Average	Worst	Remarks
Selectionsort	Yes	No	1/2 N²	1/2 N²	1/2 N²	N exchanges
Insertionsort	Yes	Yes	N	l/4N²	l/2N²	use for small N or partially ordered
Shellsort	Yes	No	N log₃N	?	cN^3/2	tight code, sub-quadratic
Mergesort	No	Yes	½ N lg N	N lg N	N lg N	N lg N guarantee, stable
Quicksort	Yes	No	N	N lg N	½ N²	N lg N probabilistic guarantee fastest in practice
3-ways Quicksort	Yes	No	N²/2	2 N ln N	½ N²	improves quicksort in presence of duplicate keys
Timesort	No	Yes	N	N lg N	N lg N	-

Priority Queues

A collection is a data types that store group of items

data type	key operations	data structures
stack	Push, Pop	linked list, resizing array
queue	Enqueue, Dequeue	linked list, resizing array
priority queue	Insert, Del-Max/Min	binary heap
symbol table	Put, Get, Delete	BST, hash table
set	Add, Contains, Delete	BST, hash table

Priority Queue (PQ)

Collections. Insert and delete items. which item to delete?
- Stack. Remove the item most recently added
- Queue. Remove the item least recently added
- Randomized queue. Remove a random item
- Priority Queue. Remove the largest or smallest item
Operations: insert(Item item), delMax(), isEmpty(), max(), size()
Applications
- Event-driven simulation: [customers in a line, colliding particles]
- Numerical computation: [reducing roundoff error]
- Data compression: [Huffman codes]
- Graph searching: [Dijkstra's algorithm, Prim's algorithms]
- Number theory: [sum of powers]
- Artificial intelligence: [A* search]
- Statistics: [maintain largest M values in a sequence]
- Operating systems: [load balancing, interrupt handling]
- Discrete optimization: [bin packing, scheduling]
- Spam filtering: [Bayesian spam filters]
Implementation: you can make multiple implementation using unordered and ordered array, and Binary heap
Complexity of unordered and ordered array

Implementation	Insert	Del max	Max
unordered array	1	N	N
ordered array	N	1	1
goal	lg N	lg N	lg N

Binary heaps

Binary tree, Empty or node with links to left and right binary trees
Complete tree, perfectly balanced, except for bottom level
Complexity: lg N
Height of complete tree with N nodes is lg N , why?

Height only increase when N is a power of 2
A complete binary tree in nature

Binary heap representations
- Binary heap, Array representation of a heap-ordered complete binary tree.
- Heap-ordered binary tree
  - Keys in nodes
  - Parent's key no smaller than children's keys
- Array representation
  - Indices start at 1
  - Take nodes in level order
  - No explicit links needed!
Binary heap properties
- Parent of node at k is at k/2
- Children of node at k are at 2k and right 2k+1
Best practice use immutable keys
Underflow and overflow
- Underflow: throw exception if deleting from empty PQ
- Overflow: add no-arg constructor and use resize array
To avoid maxing the operating like delMax(), delMin() there are two implementations MaxPQ and MinPQ
- The difference in comparing less(), greater()
Good practice to use immutable keys, to prevent the client to change the key while processing the PQ
- Advantages
  - Simplifies debugging
  - Safer in presence of hostile code
  - Simplifies concurrent programming
  - Safe to use as key in priority queue or symbol table
- Disadvantages
  - Must create new object for each data type value

Code

    public void insert(Item key) {
        pq[++N] = key;
        swim(N);
    }

    public Item delMax() {
        if (isEmpty())
            throw new IndexOutOfBoundsException();
        Item max = pq[1];
        swap(1, N); 
        N--; // decrement N here, will be used in sink
        sink(1);
        pq[N + 1] = null; // prevent loitering, object no longer needed
        return max;
    }

    public Item max() {
        return pq[1];
    }

    private void swim(int k) {
        while (k > 1 && less(k / 2, k)) { // k less than parent, swap
            swap(k / 2, k);
            k /= 2;
        }
    }

    private void sink(int k) {
        while (2 * k <= N) {
            int j = 2 * k;
            if (j < N && less(j, j + 1))
                j++;
            if (!less(k, j)) // K less than children then break
                break;
            swap(k, j);
            k = j;
        }
    }

Heapsort

Basic plan
- Create max heap with all N keys
- Repeatedly remove the maximum key
Steps
- First pass
  - Build heap using bottom-up method
  - arrange from max parent to min child

for (int k = N/2; k >= 1; k--) sink(arr, k, N);

    + we will loop from N/2 to 1 because in sink operation

- Second pass
    + Remove the maximum, one at a time
    + Leave in array, instead of nulling out
    + replace from min to max in the array

while (N > 1) {
    swap(arr, 1, N--);
    sink(arr, 1, N);
}

In place sorting algorithm with N lg N worst-case
Proposition. Heap construction uses <= 2 N compares and exchanges
Proposition. Heapsort uses <= 2 N lg N compares and exchanges
Significance. In-place sorting algorithm with N lg N worst-case
- Mergesort: no, linear extra space
- Quicksort: no, quadratic time in worst-case
- Heapsort: yes
Bottom line: heapsort is optimal for both time and space, but:
- Inner loop longer than quicksort's
- Make poor usage of cache memory
- Not stable

Code

    public final static <Item extends Comparable<Item>> void sort(Item[] pq) {
        int n = pq.length;

        // heapify phase
        for (int k = n / 2; k >= 1; k--) {
            sink(pq, k, n);
        }

        // sortdown phase
        int k = n;
        while (k > 1) {
            swap(pq, 1, k--);
            sink(pq, 1, k);
        }
    }

    // Get the largest and put as a parent
    private static <Item extends Comparable<Item>> void sink(Item[] pq, int k, int n) {
        while (2 * k <= n) {
            int j = 2 * k;
            if (j < n && less(pq, j, j + 1))
                j++;
            if (!less(pq, k, j)) // K greater or equal than children then break
                break;
            swap(pq, k, j);
            k = j;
        }
    }

Sorting algorithms complexity till heapsort

Name	In-place	Stable	Best	Average	Worst	Remarks
Selectionsort	Yes	No	1/2 N²	1/2 N²	1/2 N²	N exchanges
Insertionsort	Yes	Yes	N	l/4N²	l/2N²	use for small N or partially ordered
Shellsort	Yes	No	N log₃N	?	cN^3/2	tight code, sub-quadratic
Mergesort	No	Yes	½ N lg N	N lg N	N lg N	N lg N guarantee, stable
Quicksort	Yes	No	N	N lg N	½ N²	N lg N probabilistic guarantee fastest in practice
3-ways Quicksort	Yes	No	N²/2	2 N ln N	½ N²	improves quicksort in presence of duplicate keys
Timesort	No	Yes	N	N lg N	N lg N	-
Heapsort	Yes	No	N	2 N lg N	2 N lg N	N lg N
???	Yes	Yes	N lg N	N lg N	N lg N	holy sorting grail

Event driven simulation

Goal. Simulate the motion of N moving particles that behave according to the laws of elastic collision
Idea goes back to Einstein
Application based on priority queue, without PQ you can not do it with large number of particles because would required quadratic time and not affordable
Hard disc model
- Moving particles interact via elastic collisions with each other and walls
- Each particle is a disc with know position, velocity, mass and radius
- No other forces
Significance. Relates macroscopic observables to microscopic dynamics
- Maxwell-Boltzmann: distribution of speeds as function of temperature
- Einstein: explain Brownian motion of pollen grains
Time-driven simulation
- Update the position of each particle every after dt units of time, and check for overlaps
- If overlap, roll back the clock to the time of the collision, update the velocities of the colliding particles, and continue the simulation
- Main drawbacks
  - ~ N²/2 overlap checks per time quantum
  - Simulation is too slow if dt is very small
  - May miss collisions if dt is too large
Event-driven simulation
- Between collisions, particle move in straight-line trajectories
- Change sate only when something happens
- Focus on times when collision occur
- Maintain PQ of collision events, prioritized by time
- Remove the min = get next collision
- Collision prediction. Given position, velocity, and radius of a particle, when will it collide next a wall or another particle?
- Collision resolution. if collision occurs, update colliding particle(s) according to law of elastic collisions
- Steps
  - Initialization
    1. Fill PQ with all potential particle-wall collisions
    2. Fill PQ with all potential particle-particle collisions

Symbol tables

API

Key-value pair abstraction
- Insert a value with specified key
- Give a key, search for the corresponding value
Ex. DNS lookup
- Insert URL with specified IP address
- Give URL, find corresponding IP address
Applications
1. Dictionary
2. Book index
3. File share
4. Compiler
5. Routing table
6. DNS
7. Genomics
8. File system
9. Web search
Operations:


put(Key key, Value val)
get(Key key)
delete(Key key)
contains(Key key)
isEmpty()
size()
keys()

Conventions:
- Values are not null
- Method get() returns null if key not present
- Method put() overwrites old values with new value
Key type: several natural assumption
- Assume keys are Comparable, use compareTo()
- Assume keys are any generic type, use equals() to test equality
- Assume keys are any generic type, use equals() to test equality, use hashCode() to scramble key
Best practices: Use immutable types for symbol data keys
- Immutable in Java: String, Integer, Double, java.io. File, ...
- Mutable in Java:: StringBuilder, java.net. URL, Arrays, ...
Equality test
- All Java classes inherit a method equals()
- Java requirements. For any references x, y and z
  - Reflexive: x.equals(x) is true
  - Symmetric: x.equals(y), iff y.equals(x)
  - Transitive: if x.equals(y) and y.equals(z), then x.equals(z)
  - Non-null: x.equals(null) is false
Equals design
- Standard recipe for user-defined type
  - Optimization for reference equality
  - Check again null
  - Check that two objects are of the same type and cast
  - Compare each significant field
    1. if field is a primitive type, use ==
    2. if field is an object, use equals()()
    3. if field is any array, apply to each entry
- Best practices
  - No need to use calculated fields that depends on other fields
  - Compare fields mostly likely to differ first
  - Make compareTo() consistent with equals() x.equals(y), if and only if (x.compareTo(y) == 0)

Elementary implementations

Sequential search (unordered list)
- Data structure: Maintain an (unordered) linked list of key-value pairs
- Node contains key and value
- Search: Scan through all keys until find a match
- Insert: Scan through all keys until find a match, if no match add to front
- summary

worst-case search	worst-case insert	average-case search	average-case insert	ordered	key interface
N	N	N / 2	N	no	equals()

Binary search in an order array
- Data structure: Maintain an ordered array of key-value pairs
- Rank helper function. how many keys < k?
- summary

worst-case search	worst-case insert	average-case search	average-case insert	ordered	key interface
lg N	N	lg N	N / 2	yes	compareTo()

BST Binary search tree

Ordered operations

void put(Key key, Value value)
Value get(Key key)
void delete(Key key)
boolean contains(Key key)
boolean isEmpty()
int size()
Key min()
Key max()
Key floor(Key key) // largest key less than or equal to key
Key ceiling(Key key) // smallest key greater than or equal to key
int rank(Key key) // number of key less than key
Key select(int k) // Key of rank k
void deleteMin() // delete smallest key
void deleteMax() // delete largest key
int size(Key lo, Key hi)
Iterable<Key> keys(Key lo, Key hi)
Iterable<Key> keys()

Binary search trees

Classic data structure provides efficient implementations of ST algorithm
BST is a binary tree in symmetric order
A binary tree is either
- Empty
- Two disjoint binary tress(left and right)
Symmetric order each node has a key, and every node's key is:
- Larger than all keys in its left subtree
- Smaller than all keys in its right subtree
Representation in java using a Node has key, value and reference to left and right

Code

    public void put(Key key, Value val) {
        root = put(root, key, val);
        size++;
    }

    private Node put(Node node, Key key, Value val) {
        if (node == null)
            return new Node(key, val);
        int cmp = key.compareTo(node.key);
        if (cmp < 0)
            node.left = put(node.left, key, val);
        else if (cmp > 0)
            node.right = put(node.right, key, val);
        else
            node.val = val;
        return node;
    }

    public Value get(Key key) {
        Node node = root;
        while (node != null) {
            int cmp = key.compareTo(node.key);
            if (cmp < 0)
                node = node.left;
            else if (cmp > 0)
                node = node.right;
            else
                return node.val;
        }
        return null;
    }

Tree shape
- Many BSTs correspond to same set of keys
- Number of compares for search/insert is equal to 1 + depth of node
- Remark. Tree shape depends on order of insertion
- Worst case will be in natural order
Mathematical analysis
- Proposition. If N distinct keys are inserted into a BST in random order, the expected number of compares for a search/insert is ~ 2 ln N
- Pf. 1-1 correspondence with quicksort partitioning
- But worst case height is N

worst-case search	worst-case insert	average-case search	average-case insert	ordered	key interface
N	N	1.39 lg N	1.39 lg N	stay tunned	compareTo()

Ordered ST operations

Minimum and maximum
- Minimum. Smallest key in table
- Maximum. Largest key in table
- Q. How to find min/max in BST?
  - For min move to the left from the root until find null key
  - For max move to the right from the root until find null key
Floor and ceiling
- Floor. Largest key <= to given key
- Ceiling. Smallest key >= to given key
Computing the floor
- Case 1. [k equals the key at root]
  - The floor of k is k
- Case 2. [k is less than the key at root]
  - The floor of k in the left subtree
- Case 3. [k is greater than the key at root]
  - The floor of k is in the right subtree
  - if there is any key <= k in right subtree otherwise it is the key in the root

public Key floor(Key key) {
    Node node = floor(root, key);
    if (node == null) return null;
    return node.key;
}

private Node floor(Node node, Key key) {
    if (node == null) return null;

    int cmp = key.compareTo(node.key);
    if (cmp == 0) return node;
    if (cmp < 0) return floor(node.left, key);

    Node n = floor(node.right, key);
    if (n != null) return n;
    else return node;
}

Ceiling

    public Key ceiling(Key key) {
        Node node = ceiling(root, key);
        if (node == null)
            return null;
        return node.key;
    }

    private Node ceiling(Node node, Key key) {
        if (node == null)
            return null;

        int cmp = key.compareTo(node.key);
        if (cmp == 0)
            return node;
        if (cmp < 0) {
            Node n = ceiling(node.left, key);
            if (n != null)
                return n;
            else
                return n;
        }

        return ceiling(node.right, key);
    }

Rank
Rank. How many keys < k?
- Easy recursive algorithm (3 cases!)

public int rank(Key key) {
    return rank(key, root);
}

private int rank(Key key, Node node) {
    if (node == null) return 0;
    int cmp = key.compareTo(node.key);

    if (cmp < 0) return rank(key, node.left);
    else if (cmp > 0) return 1 + size(node.left) + rank(key, node.right);
    else return size(node.left);
}

Inorder traversal
- Traverse left subtree
- Enqueue key
- Traverse right substree

public Iterable<Key> keys() {
    Queue<Key> q = new Queue<>();
    inorder(root, q);
    return q;
}

private void inorder(Node node, Queue<Key> q) {
    if (node == null) return;
    inorder(node.left, q);
    q.enqueue(node.key);
    inoder(node.right, q);
}

Property. inorder traversal of a BST yields keys in ascending order

Deletion in BST

To remove a node with a given key
- Set its value to null
- Leave key in tree to guide searches (but don't consider it equal in search)
- Cost ~ 2 ln N^' per insert, search, and delete (if keys in random order)
- Unsatisfied solution. Tombstone (memory) overload
To delete the minimum
- Go left until finding a node with null left link
- Replace that node by its right link
- Update subtree counts

public void deleteMin() {
    root = deleteMin(root);
}

private Node deleteMin(Node node) {
    if (node.left == null) return node.right;
    node.left = deleteMin(node.left);
    node.count = 1 + size(node.left) + size(node.right);
    return node;
}

Hibbard deletion
- To delete a node with key k: search for node n containing key k
  - Case 0. Delete n by setting parent link to null
  - Case 1. Delete n by replacing parent link
  - Case 2. [2 children]
    1. Find successor x of n
    2. Delete the minimum in n's right subtree
    3. Put x in n's spot

    public void delete(Key key) {
        size--;
        root = delete(root, key);
    }

    private Node delete(Node node, Key key) {
        if (node == null)
            return null;
        int cmp = key.compareTo(node.key);
        if (cmp < 0)
            node.left = delete(node.left, key);
        else if (cmp > 0)
            node.right = delete(node.right, key);
        else {
            if (node.right == null)
                return node.left;
            if (node.left == null)
                return node.right;
            Node t = node;
            node = min(t.right);
            node.right = deleteMin(t.right);
            node.left = t.left;
        }
        node.count = size(node.left) + size(node.right) + 1;
        return node;
    }

Balanced search tree

2-3 search trees

Allow 1 or 2 keys per node
- 2-node: one key, two children
- 3-node: two keys, three children
Perfect balance: every path from the root to null link has same length
Implementation: Red-black BSTs

Red-black BSTs

Represent 2-3 tree as a BST (binary search tree)
Left-leaning red-black BSTs (Guibas-Sedgewick 1979 and 2007)
- Use "internal" left-learning links as "glue" for 3-nodes
- A BST such that:
  - No node has two red links connected to it
  - Every path from root to the null link has the same number of black links
  - Red links lean left

B-Tree

TODO this part

Geometric applications of BSTs

Intersections among geometric objects
Applications. CAD, games, movies, virtual reality, database, ...
Efficient solutions. Binary search trees (and extensions)

1 d range search

Extensions of ordered symbol table
- Same operations or ordered symbol table
- Range search: find all keys between k1 and k2
- Range count: number of keys between k1 and k2
Application. Database queries
- i.e salary between val-1 AND val-2
Geometric interpretation
- Keys are point of a line
- Find/count points in a given 1 d interval
Implementations
- Unordered array. Fast insert, slow range search
- Order array. Slow insert, binary search for k1 and k2 to do range search
- 1 d range count

line segment intersection

TODO this part

kd trees

TODO this part

interval search trees

Instead of points the data is interval
1 d interval search. Data structure to hold set of (overlapping) intervals
Create BST, where each node stores an interval (lo, hi)
- Use left endpoints as BST key
- Store max endpoint in subtree rooted at node
TODO this part

rectangle intersection

TODO this part

Hash tables

Basic plan:
- Save items in a key-indexed table (index is a function of the key)
- Hash function: Method for computing array index from key
Issues:
- Computing the has function
- Equality test: Method for checking whether two keys are equal
- Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index
Classic space-time tradeoff
- No space limitation: trivial hash function with key as index
- No time limitation: trivial collision resolution with sequential search
- Space and time limitations: hashing (the real world)

Hash functions

Efficiently computable
Each table index equally likely for each key
Ex 1. Phone numbers:
- Bad: first three digits
- Better: last three digits
Ex 2. Social security numbers:
- Bad: first three digits
- Better: last three digits
Practical challenge, need different approach for each key type
Java's hash code conventions
- All java classes inherit a method hashCode() , with return 32-bit in
- Requirement, if x.equals(y), then (x.hashCode() == y.hashCode())
- Highly desirable: if !x.equals(y), then (x.hashCode() != y.hashCode())
- Default implementation. Memory address of x
- Legal (but poor) implementation. Always return 17
- Customized implementations. Integer, Double, String, File, URL, Date, ...

      x
      |
    -----
    |   |   
    -----
      |
  x.hashCode()

Implementing hash code: strings
- Cache the hash value in an instance variable
- Return cached value

public final class String {
    private int hash = 0; // cache of hash code
    private final char[] s;

    public int hashCode() {
        int h = hash;
        if (h != 0) return h; // returned cached value
        for (int i = 0; i < length(); i++) {
            h = s[i] + (31 * h);
        }
        hash = h; // store cache of hash code
        return h;
    }
}

Implementing hash code: user-defined types

public final class Transaction implements Comparable<Transaction> {
    private final String who;
    private final Date when;
    private final double amount;

    public int hashCode() {
        int hash = 17; // nonzero constant
        // typical a small prime 31
        // get all fields in the class
        hash = 31 * hash + who.hashCode();
        hash = 31 * hash + when.hashCode();
        hash = 31 * hash + ((Double) amount).hashCode();
        return hash;
    }
 }

Hash code design
- Standard recipe for user-defined types
  - Combine each significant field using the 31x+y rule
  - If field is a primitive type, use wrapper type hashCode()
  - If field is null, return 0
  - If field is a reference type, use hashCode() ← applies rule recursively
  - If field is an array, apply to each entry ← or use Arrays.deepHashCode()
- In practice. Recipe works reasonably well, used in Java libraries
- In theory. Keys are bit string: universal hash functions exits
- Basic rule. Need to use the whole key to compute hash code, consult an expert for state-of-the-art hash codes
Modular hashing
- Hash code. An int between -2³¹ and 2³¹-1
- Hash function. An int between 0 and M-1 (for uses as array index)
  - M typically a prime or power of 2
- Why choose a prime for M?
  - We will use all the bits in the number in that point

There is a problem in below code for detect the negative values

// Bug
private int hash(Key key) {
    return key.hashCode() % M;
}

There is another problem in below code which is will hit -2³¹

// 1-in-a-billion bug
private int hash(Key key) {
    return Math.abs(key.hashCode()) % M;
}

The correct implementation, this is a template for hashCode to get number between 0 and M - 1

private int hash(Key key) {
    return (key.hashCode() & 0x7fffffff) % M;
}

Uniform hashing assumption
- Uniform hashing assumption. Each key is equally likely to hash to an integer between 0 and M - 1
- Bins and balls. Throw balls uniformly at random into M bins
- Classically studies in statistics
  - Birthday problem. Expect two balls in the same bin after ~ sqrt(PI M /2) tosses
  - Coupon collector. Expect every bin hash >= 1 ball after ~ M ln M tosses
  - Load balancing. After M tosses, expect most loaded bin hash Omega (log M / Log Log M) balls

Separate chaining (Collision resolution)

Collision, Two distinct keys hashing to the same index
- Birthday problem → can't avoid collisions unless you have a ridiculous quadratic amount of memory
- Coupon collector + load balancing → collisions will be evenly distributed
- Challenge. Deal with collisions efficiently
Separate chaining symbol table
- Use an array of M < N linked lists [H.P. Luhn, IBM 1953]
  - Hash: map key to integer i between 0 and M-1
  - Insert: put at front of i_th chain (if not already there)
  - Search: need to search only i_th chain

public class SeparateChainingHashST<Key, Value> {
    private int M = 97;
    private Node[] st = new Node[M];

    private static class Node {
        private Key key;
        private Value val;
        private Node next;
    }

    private int hash(Key key) {
        return (key.hashCode() & 0xfffffff) % M;
    }

    public Value get(Key key) {
        int i = hash(key);
        for (Node node = st[i]; node != null; node = node.next) {
            if (key.equals(node.key)) return node.val;
        }
        return null;
    }

    public void put(Key key, Value val) {
        int i = hash(key);
        for (Node node = st[i]; node != null; node = node.next) {
            if (key.equals(node.key)) {
                node.val = val;
                return;
            }
        }
        st[i] = new Node(key, val, st[i]); // put the new node at the beginning of the LinkedList
    }
}

Consequence. Number of probes for search/insert is proportional to N/M
- M too large ==> too many empty chains
- M too small ==> chains too long
- Typical choice M ~ N/5 ==> constant time ops

Linear probing

Open addressing. When a new key collides, find the next empty slot and put it there
Hash. Map key to integer i between 0 and M-1
Insert. Put at table index i if free, if not try i+1, i+2, etc
Search. Search table index i, if occupied but no match, try i+1, i+2, etc

Note: Array size M must be greater than number of key-value pairs N

public class LinearProbingHashST<Key, Value> {
    private int M = 30001;
    private Value[] vals = (Value[]) new Object[M];
    private Key[] keys = (Key[]) new Object[M];

    private int hash(Key key) {
        return (key.hashCode() & 0xfffffff) % M;
    }

    public void put(Key key, Value val) {
        int i;
        for (i = hash(key); keys[i] != null; i = (i + 1) % M) {
            if (keys[i].equals(key)) break;
        }
        keys[i] = key;
        vals[i] = val;
    }

    public Value get(Key key) {
        for (int i = hash(key); keys[i] != null; i = (i + 1) %  M) {
            if (key.equals(keys[i])) return vals[i];
        }
        return null;
    }
}

Clustering
- Cluster. A contiguous block of items
- Observation. New keys likely to hash into middle of big clusters
Knuth's parking problem
- Model. Cars arrive at one-way street with M parking spaces, each desires a random space i: if space i is taken, try i + 1, i + 2, etc
- Q. What is mean displacement of a car?
  - Half-full. With M / 2 cars, mean displacement is ~ 3 / 2
  - Full. With M cars, mean displacement is ~ sqrt(PI * M / 8)

ST implementations: summary

ST implementation	Worst-case search	Worst-case insert	Worst-case delete	Ordered iteration	key interface
linked list	N	N	N	no	equals()
binary search (ordered array)	log N	N	N	yes	compareTo()
BST	N	N	N	stay tuned	compareTo()
2-3 tree	c lg N	c lg N	?	yes	compareTo()
red-black tree	2 lg N	2 lg N	2 lg N	yes	compareTo()
separate chaining	lg N_*	lg N_*	lg N_*	no	equals()
linear probing	lg N_*	lg N_*	lg N_*	no	equals()

Note: * under uniform hashing assumption

Context

War storey: String hashing in Java
- String hashCode() in Java 1.1
  - For long strings: only examine 8-9 evenly spaced characters
  - Benefit: save time in performing arithmetic
  - Downside: great potential for bad collision patterns

public int hashCode() {
    int hash = 0;
    int skip = Math.max(1, length() / 8);
    for (int i = 0; i < length(); i += skip) {
        hash = st[i] + (37 * hash);
    }
    return hash;
}

- Could end with examine the same spaced character, so you have to examine the all string

Q. Is the uniform hashing assumption important in practice?
- A. Obvious situations: aircraft control, nuclear reactor, pacemaker
- A. Surprising situations. denial-of-service attacks
Algorithmic complexity attach on Java
- Goal. Find family of strings with the same hash code
- Solution. the base 31 hash code is part of Java's string API
Diversion: one-way hash functions
- One-way hash function. "Hard" to find a key that will hash to a desired value (or two keys that hash to the same value)
  - Ex. MD4, MD5, SHA-0, SHA-1, SHA-2, WHIRLPOOL, RIPEMD-160, etc
  - Applications. Digital fingerprint, message digest, storing passwords
  - Caveat. Too expensive for use in ST implementations
Separate chaining vs. Linear probing
- Separate chaining
  - Easier to implement delete
  - Performance degrades gracefully
  - Clustering less sensitive to poorly-designed hash function, if you have a bad function
- Linear probing
  - Less wasted space
  - Better cache performance
- Q. How to delete?
- Q. How to resize?
Hashing: variations on the theme
- Many improved versions have been studied.
- Two-probe hashing (separate-chaining variant)
  - Hash to two positions, insert key in shorter of two chains
  - Reduces expected length of the longest chain to log log N
- Double hashing (linear-probing variant)
  - Use linear probing, but skip a variable amount, not just 1 each time
  - Effectively eliminates clustering
  - Can allow table to become nearly full
  - More difficult to implement delete
- Cuckoo hashing (linear-probing variant)
  - Hash key to two positions, insert key either position, if occupied reinsert displaced key into its alternative position (and recur)
  _ Constant worst case time for search
Applications
- Security One way hash function: MD4, MD5, SHA-0, SHA-1, SHA-2, WHIRLPOOL, RIPEMD-160, ...
  - Digital fingerprint
  - Message digest
  - Storing passwords
- Dictionary lookup
  - DNS lookup
  - Amino acids
  - Class list
Hash tables vs.balanced search trees
- Hash tables
  - Simpler to code
  - No effective alternative for unordered keys
  - Faster for simple keys (a few arithmetic ops versus log N compares)
  - Better system support in Java for strings (e.g. cached hash code)
- Balanced search trees
  - Stronger performance guarantee
  - Support for ordered ST operations
  - Easier to implement compareTo() correctly than equals() and hashCode()
- Java system includes both:
  - Red-black BSTs: java.util. TreeMap, java.util. TreeSet
  - Hash tables: java.util. HashMap, java.util. IdentityHashMap

Applications

Sets
- Mathematical set: a collection of distinct keys
- Operations: add(Key key), contains(Key key), remove(Key key), size(), iterator()
Dictionary lookup
- Command-line arguments
  - A comma-separated value (CVS) file
  - Key field
  - Value field
- Ex 1. DNS lookup
- Ex 2. Amino acids
- Ex 3. Class list
File indexing
- Goal. Index a PC (or the web), given a list of files specified, create an index so that you can efficiently find all files contains a given query string
- Book index
  - Goal. preprocess a text corpus to support concordance queries: given a word, find all occurrences with their immediate contexts
Sparse vectors
- Matrix-vector multiplication
- Problem. sparse matrix-vector multiplication
- Assumptions. Matrix dimension is 10, 000 average nonzeros per row ~ 10
- Vector representations
  - 1D array (standard) representation
    
    . Constant time access to elements
    
    . Space proportional to N
  - Symbol table representation
    
    . Key = index, value = entry
    
    . Efficient iterator
    
    . Space proportional to number of nonzeros
- Matrix representations
  - 2D array (standard) matrix representation. Each row of matrix is an array
  - Space proportional of N²
- Sparse matrix representation: Each row of matrix is a sparse vector
  - Efficient access to elements
  - Space proportional to number of nonzeros (plus N)

public class SparseVector {
    private HashST<Integer, Double> v;

    public SparseVector() {
        v = new HashST<>(); // empty ST represents all 0s vector
    }

    public void put(int i, double x) {
        v.put(i, x); // put all not zero values
    }

    public double get(int i) {
        if (!v.contains(i)) return 0.0;
        return v.get(i);
    }

    public Iterable<Integer> indices() {
        return v.keys;
    }

    public double dot(double[] that) {
        double sum = 0.0;
        fo (int i : indices()) {
            sum += that[i] * this.get(i);
        }
        return sum;
    }
}

Graph

Undirected graphs

Graph.: Set of vertices connected pairwise by edges
Why study graph algorithms?
- Thousand of practical applications
- Hundreds of graph algorithms known
- Interesting and broadly useful abstraction
- Challenging branch of computer science and discrete math
Example of graph
- Protein-protein interaction network
- The internet as mapped by the Opte project
- Map of science click-streams
- Facebook friends
- One week of Enron emails
- Framingham heart study
Graph terminology
- Path: Sequence of vertices connected edges
- Cycle: Path whose first and last vertices are the same
- Two vertices are connected if there is a path between them
Some graph-processing problems
- Path. Is there a path between v and w?
- Shortest path. What is the shortest path between v and w?
- Cycle. Is there a cycle in the graph?
- Euler cycle. Is there a cycle that uses each edge exactly once?
- Hamilton cycle. Is there a cycle that uses each vertex exactly once
- Connectivity. Is there a way to connect all of the vertices?
- MST. What is the best way to connect all of the vertices?
- Bi-connectivity. Is there a vertex whose removal disconnects the graph?
- Planarity. Can you draw the graph in the plane with no crossing edges
- Graph isomorphism. Do two adjacency lists represent the same graph?

Undirected Graph API

Graph drawing provides intuition about the structure of the graph
Caveat intuition can be misleading
Vertex representation
- Will use integers between 0 and V-1
- Applications: convert between names and integers which symbol table
- Implementations
  1. Set-of-edges graph representation
    
    . Maintain a list of the edges (linked list or array)
    
    . In-efficient implementation make unusable in huge graph
  2. Adjacency-matrix graph representation
    
    . Maintain a two-dimensional V-by-V boolean array, for each edge v-w in graph: adj[v][w] = adj[w][v] = true
    
    . Not very widely used because for a huge graph, you would have billion square number of entries
  3. Adjacency-list graph representation
    
    . Maintain vertex-indexed array of lists
    
    . Widely used
Operations: addEdge(int v, int w), adj(int v), V(), E(), toString()

Code

public class Graph {

    private final int vertices; 
    private int edges; 
    private Bag<Integer>[] adj; 

    public Graph(int v) {
        this.vertices = v; 
        this.edges = 0; 
        adj = (Bag<Integer>[]) new Bag[v]; 
        for (int i = 0; i < v; i++) {
            adj[v] = new Bag<Integer>(); 
        }
    }

    public int getVertices() {
        return vertices; 
    }

    public int getEdges() {
        return edges; 
    }

    public void addEdge(int v, int w) {
        adj[v].add(w); 
        adj[w].add(v); 
        edges++; 
    }

    public int degree(int v) {
        return adj[v].size(); 
    }

    public Iterable<Integer> adj(int v) {
        return adj[v]; 
    }
}

In practice, use adjacency-lists representation
- Algorithms based on iterating over vertices adjacent to v
- Real-world graphs tend to be sparse (huge number of vertices small average vertex degree)

representation	space	add edge	edge between v and w?	iterate over vertices adjacent to v?
list of edges	E	1	E	E
adjacency matrix	v²	1^*	1	v
adjacency lists	E + V	1	degree(v)	degree(v)

Depth-first search DFS Undirected search

Classical graphical search algorithm
Once you explored vertex suspend then explore the new vertices
Good example: Maze graph
- Vertex = intersection
- Edge = passage
Algorithm Idea:
- Unroll a ball of string behind you
- Mark each visited intersection and each visited passage
- Retrace steps when no unvisited options
Goal: systematically search through a graph
Idea: Mimic maze exploration
Typically applications:
- Find all vertices connected to a given source vertex
- Find a path between two vertices
Design pattern: We decouple Graph representation from graph-processing routine
- Create a Graph object
- Pass the Graph to a graph-processing routine
- Query the graph-processing routine for information
Depth-first search demo
- To visit a vertex v:
  - Mark v as visited
  - Recursively visit all unmarked vertices w adjacent to v
- Put unvisited vertices on a stack

Code

    public class DepthFirstPath {
        private boolean[] marked; 
        private int[] edgeTo; 
        private int s; 

        public DepthFirstPath(Graph G, int s) {
            dfs(G, s); 
        }

        private void dfs(Graph G, int v) {
            marked[v] = true; 
            for (int w : G.adj(v)) {
                if (!marked[w]) {
                    dfs(G, w); 
                    edgeTo[w] = v; 
                }
            }
        }
    }

Proposition. DFS marks all vertices connected to s in time proportional to the sum of their degrees from single source
- PF. [correctness]
  - If w marked, then w connected to s (why?)
  - If w connected to s, then w marked
  - If w unmarked, then consider last edge on a path from s to w that goes from a marked vertex to an unmarked one
- Pf. [running time]
  - Each vertex connected to s is visited once
Proposition. After DFS, can find vertices connected to s in constant time and can find a path to s (if not exist) in time proportional to its length
- Pf. edgeTo[] is parent-link representation of a tree rooted at s
Challenge. Flood fill (Photoshop magic wand)
- Assumptions. Pictures has millions to billions of pixels
  - Solution. Build a grid graph
    1. Vertex: pixel
    2. Edge: between two adjacent gray pixels
    3. Blob: all pixels connected to given pixel

Breadth-first search BFS Undirected search

Explore vertices and it's adjacency then go next vertices for exploration
Not recursive algorithm, it uses a queue
Put s into a FIFO queue, and mark s as visited, Repeat until the queue is empty
- Remove the least recently added vertex v
- Add each of v's unvisited neighbors to the queue, and mark them as visited
Proposition. BFS computes the shortest paths (fewest number of edges) from s to all other vertices in a graph in time proportional to E + V from single source
- Pf. [correctness] Queue always consists of zero or more vertices of distance k from s, followed by zero more vertices of distance k + 1
- Pf. [running time] Each vertex connected to s is visited once

Code

    public class BreadthFirstPath {
        private boolean[] marked; 
        private int[] edgeTo; 
        private int[] distTo; 

        private void bfs(Graph G, int s) {
            Queue<Integer> q = new Queue<Integer>(); 
            q.enqueue(s); 
            marked[s] = true; 
            distTo[s] = 0; 
            while (!q.isEmpty()) {
                int v = q.dequeue(); 
                for (int w : G.adj(v)) {
                    if (!marked[w]) {
                        q.enqueue(w); 
                        marked[w] = true; 
                        edgeTo[w] = v; 
                        distTo[w] = distTo[v] + 1; 
                    }
                }
            }
        }
    }

Applications
- routing: Fewest number of hops in a communication network

BFS vs. DFS

Depth-first search. Put unvisited vertices on a stack
Breadth-first search. Put unvisited vertices on a queue
Shortest path. Find from s to w that uses fewest number of edges

       1
     /   \
    /     \
   2       3
  / \     / \
 4   5   6   7

Level order, BFS: 1, 2, 3, 4, 5, 6, 7
Pre-order, DFS: 1, 2, 4, 5, 3, 6, 7

Connected components

Connectivity queries
- Def. vertices v and w are connected if there is a path between them
- Goal. preprocess graph to answer queries of the form is v connected to w? in constant time
- Operations: connected(int v, int w), count(), id(int v)
Union-find? not quite
Depth-first search. Yes
The relation "is connected to" is an equivalence relation
- Reflexive: v is connected to v
- Symmetric: if v is connected to w, then w is connected to v
- Transitive: if v connected to w and w connected to x, then v connected to x
Def. A connected component is a maximal set of connected vertices
Goal. Partition vertices into connected components
Demo
- To visit a vertex v:
  - Mark vertex v as visited
  - Recursively visit all unmarked vertices adjacent to v

public class ConnectedComponent {

    private boolean[] marked;
    private int[] id;
    private int count;

    public ConnectedComponent(Graph graph) {
        marked = new boolean[graph.getVertices()];
        id = new int[graph.getVertices()];
        for (int s = 0; s < graph.getVertices(); s++) {
            if (!marked[s]) {
                dfs(graph, s);
                count++;
            }
        }
    }

    private void dfs(Graph graph, int v) {
        marked[v] = true;
        id[v] = count; // all vertices discovered in same call have samd id
        for (int w : graph.adj(v)) {
            if (!marked[w])
                dfs(graph, w);
        }
    }

    /**
     * two vertices are connected if have the same id
     */
    public boolean connected(int v, int w) {
        return id[v] == id[w];
    }

    /**
     * the id of the vert
     */
    public int id(int v) {
        return id[v];
    }

    /**
     * sum of the connected components in the graph
     */
    public int count() {
        return count;
    }
}

Applications
- Particle detection
  - Given grayscale image of particles, identity blobs
    1. Vertex: pixel
    2. Edge: between two adjacent pixels with grayscale value >= 70 "black 0, white = 255"
    3. Blob: connected components of 20-30 pixels
  - Particle tracking. Track moving particle over time

Graph challenges

Graph-processing challenge 1
- Problem. Is a graph bipartite?
  - You can use DFS to solve it
  - Divide the vertices into two subsets with property every edge connects one subset to another
    
    0-1 0-2 0-5 0-6
    
    1-3 2-3 2-4
    
    4-5 4-6
  - Application: is dating graph bipartite?
- Problem. Find a cycle
  - You can use DFS to solve it
  - Ex. 0-5-4-6-0
- Problem. the seven bridges of konigsberg [1736]
  - Euler tour. Is there a (general) cycle that uses each edge exactly once?
    
    . Answer. A connected grapy is Eulerian iff all vertices have even degree
- Problem. Find a (general) cycle that uses every edge exactly once
  - Eulerian tour (classic graph-processing problem)
- Problem. Find a (cycle) that visits every vertex exactly once
  - Sometimes called "Travelling sells person between cities and visit each city once"
    
    . Hamiltonian cycle (classical NP-complete problem)
- Problem. Are two graphs identically except for vertex names?
  - Graph isomorphism problem longstanding open problem
  - No one knows how to classify the problem
- Problem. Lay out a graph in the plane without crossing edges?
  - Classic problem in graph processing
  - Linear-time DFS-based planarity algorithm discovered by Tarjan in 1970s

Directed graphs

Edges now have direction
Digraph: set of vertices connected pairwise by directed edges
Examples
- road network
  - Vertex = intersection
  - Edge = one-way street
- Political blogosphere graph
  - Vertex = political blog
  - Edge = link
- Overnight interbank load graph
  - Vertex = bank
  - Edge = overnight loan
- Implication graph
  - Vertex = variable
  - Edge = logical implication
- Combinatorial circuit
  - Vertex = logical gate
  - Edge = wire
- WordNet graph
  - Vertex = synset
  - Edge = hypernym relationship
Digraph applications

Digraph	Vertex	Directed Edge
web	web page	hyperlink
game	board position	legal move
object graph	object	pointer
cell phone	person	placed call
financial	bank	transaction
control flow	code block	jump
Inheritance hierarchy	class	inherits from

Some digraph problems:
- Is there a directed path from s to t?
- Shortest path, What's the shortest directed path from s to t?
- Topological sort, can you draw a digraph so that all edge point upwards?
- Strong connectivity, Is there a directed path between all pairs of vertices?
- Transitive closure, For which vertices v and w is there a path from v to w?
- PageRank, What is the importance of a web page?

Digraph API

Operations: addEdge(int v, int w), adj(int v), vertices(), edges(), reverse(), toString()

public class Digraph {

    private static final String NEW_LINE = System.lineSeparator();

    private final int vertices;
    private int edges;
    private Bag<Integer>[] adj;

    public Digraph(int v) {
        this.vertices = v;
        this.edges = 0;
        adj = (Bag<Integer>[]) new Bag[v];
        for (int i = 0; i < v; i++) {
            adj[i] = new Bag<Integer>();
        }
    }

    public int getVertices() {
        return vertices;
    }

    public int getEdges() {
        return edges;
    }

    public void addEdge(int v, int w) {
        adj[v].add(w);
        edges++;
    }

    /**
     * get number of adjacency to vertex v
     */
    public int degree(int v) {
        return adj[v].size();
    }

    /**
     * get adjacency to vertex v
     */
    public Iterable<Integer> adj(int v) {
        return adj[v];
    }

    /**
     * Returns the reverse of the digraph
     */
    public Digraph reverse() {
        Digraph reverse = new Digraph(vertices);
        for (int v = 0; v < vertices; v++) {
            for (int w : adj(v)) {
                reverse.addEdge(w, v);
            }
        }
        return reverse;
    }

    @Override
    public String toString() {
        StringBuilder builder = new StringBuilder();
        for (int v = 0; v < vertices; v++) {
            builder.append(v + ": ");
            for (int w : adj[v]) {
                builder.append(w + " ");
            }
            builder.append(NEW_LINE);
        }
        return builder.toString();
    }
}

Digraph representations
- In practice. Use adjacency-list representation
  - Algorithms based on iterating over vertices adjacent to v
  - Real-world graphs tend to be sparse

representation	space	add edge	edge between v and w?	iterate over vertices adjacent to v?
list of edges	E	1	E	E
adjacency matrix	v²	1⁺	1	v
adjacency lists	E + V	1	outdegree(v)	outdegree(v)

Depth-first search in digraphs

Problem. find all vertices reachable from s along a directed path
Some method as for undirected graphs
- Every undirected graph is digraph (with edges in both directions)
- DFS is a digraph algorithm
Applications:
1. Reachability application: Program control-flow analysis
  - Every program is a digraph
    - Vertex = basic block of instructions (straight-line program)
    - Edge = jump
  - Dead-code elimination
    - find and remove unreachable code
  - Infinite-loop detected
    - Determine whether exit is unreachable
2. Mark-sweep garbage collector
  - Every data structure is a digraph
    - Vertex = object
    - Edge = reference
  - Memory cost. Uses 1 extra mark bit per object (plus DFS stack)
  - Roots. Objects known to be directly accessible by program (e.g. stack)
  - Reachable objects. Objects indirectly accessible by program (starting at a root and following a chain of pointers)
  - Mark-sweep algorithm. [McCarthy, 1960]
    - Mark: mark all reachable objects
    - Sweep: if object is unmarked, it is garbage (so add to free list)
DFS enables direct solution of simple digraph problems
1. Reachability
2. Path finding
3. Topological sort
4. Directed cycle direction
Basis for solving difficult digraph problems
1. 2-satisfiability
2. Directed Euler path
3. Strongly-connected components

Breadth-first search in digraphs

BFS is a digraph algorithms
Proposition. BFS computes shortest paths (fewest number of edges) from s to all other vertices n a digraph in time proportional to E + V
Multiple-source shortest paths
- Given. a digraph and set of source vertices, find shortest path from any vertex in the set to each other vertex
- Q. How to implement multi-source constructor?
  - A. Use BFS, but initialize by enqueuing all source vertices
Web crawler
- Goal. Crawl web, starting from some root web page, say www.princeton.edu
  - Solution. [BFS with implicit digraph]
    
    . Choose root web page as source s
    
    . Maintain a Queue of websites to explore
    
    . Maintain a SET of discovered websites
    
    . Dequeue the next website and enqueue websites to which it links (provided you haven't done so before)
- Q. Why not use DFS?
  - A. You gonna far away for searching web, some web page traps new page

Code

public class BareBonesWebCrawler {

    private Queue<String> queue = new ArrayQueue<>();
    private Set<String> discovered = new HashSet<>();
    private static final String REGEX = "https://(\\w+\\.)*(\\w+)";

    public BareBonesWebCrawler(String root) {
        queue.enqueue(root);
        discovered.add(root);
        bfs();
    }

    private void bfs() {
        while (!queue.isEmpty()) {
            String v = queue.dequeue();
            System.out.println(v);
            String input = readRawHtml(v);
            Pattern pattern = Pattern.compile(REGEX);
            Matcher matcher = pattern.matcher(input);

            while (matcher.find()) {
                String w = matcher.group();
                if (!discovered.contains(w)) {
                    discovered.add(w);
                    queue.enqueue(w);
                }
            }
        }
    }

    private String readRawHtml(String url) {
        URL u;
        try {
            u = new URL(url);
            URLConnection conn = u.openConnection();
            BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            StringBuffer buffer = new StringBuffer();
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                buffer.append(inputLine);
            }
            in.close();
            // System.out.println(buffer.toString());
            return buffer.toString();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return "";
    }

    public static void main(String[] args) {
        new BareBonesWebCrawler("https://github.com/openjdk");
    }
}

Topological sort

Goal. Given a set of tasks to completed with precedence constraints in which order should we schedule with the tasks?
Digraph model
- Vertex = task
- Edge = precedence constraint
DAG
- acyclic Digraph with no cycle
- If you have cycle there is no way to solve the problem
Topological sort. Redraw DAG so all edges points upwards
Solution. DFS
Demo
1. run depth-first search
2. Return vertices in reverse post order

Code

public class DepthFirstOrder {

    private boolean[] marked;
    private Stack<Integer> reverseOrder;

    public DepthFirstOrder(Digraph graph) {
        reverseOrder = new ArrayStack<>();
        marked = new boolean[graph.getVertices()];
        for (int v = 0; v < graph.getVertices(); v++) {
            if (!marked[v]) {
                dfs(graph, v);
            }
        }
    }

    private void dfs(Digraph graph, int v) {
        marked[v] = true;
        for (int w : graph.adj(v)) 
            if (!marked[w]) {
                dfs(graph, w);
        reverseOrder.push(v);
    }

    // return all vertices in reverse DFS post-order
    public Stack<Integer> reversePost() {
        return reverseOrder;
    }
}

, Detect cycle in directed graph

public class DirectedCycleDetector {

    private boolean[] marked;
    private int[] edgeTo;
    private Stack<Integer> cycle; // vertices on a cycle
    private boolean[] onStack; // vertices on recursive call stack

    public DirectedCycleDetector(Digraph graph) {
        onStack = new boolean[graph.getVertices()];
        marked = new boolean[graph.getVertices()];
        edgeTo = new int[graph.getVertices()];

        for (int v = 0; v < graph.getVertices(); v++) {
            if (!marked[v] && cycle == null)
                dfs(graph, v);
        }
    }

    private void dfs(Digraph graph, int v) {
        onStack[v] = true;
        marked[v] = true;
        for (int w : graph.adj(v)) {
            if (cycle != null) {
                // short circuit if directed cycle found
                return;
            } else if (!marked[w]) {
                // found new vertex, so recur
                edgeTo[w] = v;
                dfs(graph, w);
            } else if (onStack[w]) {
                // trace back directed cycle
                cycle = new Stack<>();
                for (int x = v; x != w; x = edgeTo[x]) {
                    cycle.push(x);
                }
                cycle.push(w);
                cycle.push(v);
            }
        }
        onStack[v] = false;
    }

    public boolean hasCycle() {
        return cycle != null;
    }

    public Iterable<Integer> cycle() {
        return cycle;
    }
}

, Will check if no cycle first then get order

public class TopologicalSort {

    private Stack<Integer> order;

    public TopologicalSort(Digraph graph) {
        DirectedCycleDetector finder = new DirectedCycleDetector(graph);
        if (!finder.hasCycle()) {
            DepthFirstOrder dfs = new DepthFirstOrder(graph);
            order = dfs.reversePost();
        }
    }

    public Stack<Integer> order() {
        return order;
    }

    public boolean hasOrder() {
        return order != null;
    }

    public boolean isDag() {
        return hasOrder();
    }
}

Used for package management, like maven, brew, etc
Proposition. Reverse DFS post-order of a DAG is a topological order
- Pf. [correctness] Consider any edge v → w. When dfs(v) is called
  - Case 1: dfs(w) has already been called and returned. Thus, w was done before v
  - Case 2: dfs(w) has not yet been called, dfs(w) will get called directly on indirectly by dfs(v) and will finish before dfs(v). Thus, w will be done before v
  - Case 3: dfs(w) has already been called, but has not yet returned. Can't happen in a DAG: function call stack contains path from w to v, so v→w would complete a cycle
Directed cycle detection
- Proposition. A digraph has a topological order iff no directed cycle
  - Pf.
    1. If directed cycle, topological order impossible
    2. If directed cycle, DFS-based algorithm finds a topological order
- Goal. Given a digraph. find a directed cycle
  - Solution. DFS
- Applications
  1. Scheduling. Given a set of tasks to be completed with precedence constraints, in what order should we schedule the tasks??
    
    . Remark. A directed cycle implies scheduling problem is infeasible
  2. Java compiler does cycle detection (cyclic inheritance)
  3. Microsoft Excel does cycle detection (and has a circular reference toolbar!)

Code

public class DirectedCycleDetector {

    private boolean[] marked;
    private int[] edgeTo;
    private Stack<Integer> cycle; // vertices on a cycle
    private boolean[] onStack; // vertices on recursive call stack

    public DirectedCycleDetector(Digraph graph) {
        onStack = new boolean[graph.getVertices()];
        marked = new boolean[graph.getVertices()];
        edgeTo = new int[graph.getVertices()];

        for (int v = 0; v < graph.getVertices(); v++) {
            if (!marked[v] && cycle == null)
                dfs(graph, v);
        }
    }

    private void dfs(Digraph graph, int v) {
        onStack[v] = true;
        marked[v] = true;
        for (int w : graph.adj(v)) {
            if (cycle != null) {
                // short circuit if directed cycle found
                return;
            } else if (!marked[w]) {
                // found new vertex, so recur
                edgeTo[w] = v;
                dfs(graph, w);
            } else if (onStack[w]) {
                // trace back directed cycle
                cycle = new Stack<>();
                for (int x = v; x != w; x = edgeTo[x]) {
                    cycle.push(x);
                }
                cycle.push(w);
                cycle.push(v);
            }
        }
        onStack[v] = false;
    }

    public boolean hasCycle() {
        return cycle != null;
    }

    public Iterable<Integer> cycle() {
        return cycle;
    }
}

Strong components

Def. Vertices v and w are strongly connected if there is a directed path from v to w and a directed path from w to v
Key property. Strong connectivity is an equivalence relation
- v is strongly connected to v
- If v is strongly connected to w, then w is strongly connected to v
- If v is strongly connected to w and w connected to x, then v is strongly connected x
v and w are connected if there is a path between v and w
Applications
- Food web graph
  - Vertex = species
  - Edge = from producer to consumer
  - Strong component. Subset of species with common energy flow
- Software modules
  - Vertex = software module
  - Edge = from module to dependency
  - Strong component. Subset of mutually interacting modules
    1. Approach 1. Package strong components together
    2. Approach 2. Use to improve design
Kosaraju-Sharir algorithm
- Reverse graph. Strong components in G are same as G^R
- Kernel DAG. Contact each strong components into a single vertex
- Idea
  - Compute topological order (reverse post-order) in kernel DAG
  - Run DFS, considering vertices in reverse topological order
- Demo
  1. Phase 1. Compute reverse postorder in G^R
  2. Phase 2. Run DFS in G, visiting unmarked vertices in reverse postorder of G^R
- Simple (but mysterious) algorithm for computing strong components
- Proposition. Kosaraju-Sharir algorithm computes the strong components of a digraph in time proportional to E + V
  - pf.
    1. Running time: bottleneck is running DFS twice (and computing G^R)
    2. Correctness: tricky
    3. Implementation: easy

Code

public class KosarajuSharirCC {

    private boolean[] marked;
    private int[] id;
    private int count;

    public KosarajuSharirCC(Digraph graph) {
        marked = new boolean[graph.getVertices()];
        id = new int[graph.getVertices()];
        count = 0;
        DepthFirstOrder order = new DepthFirstOrder(graph);
        for (int s : order.reversePost()) {
            if (!marked[s]) {
                dfs(graph, s);
                count++;
            }
        }
    }

    private void dfs(Digraph graph, int v) {
        marked[v] = true;
        id[v] = count;
        for (int w : graph.adj(v)) {
            if (!marked[w]) {
                dfs(graph, w);
            }
        }
    }

    public boolean stronglyConnected(int v, int w) {
        return id[v] == id[w];
    }

    public int id(int v) {
        return id[v];
    }

    public int count() {
        return count;
    }
}

MST (Minimum spanning trees)

Given. Undirected graph G with positive edge weights (connected)
Def. A spanning tree of G is a subgraph T that is both a tree (connected and acyclic) and spanning (includes all of the vertices)
Goal. Find a min weight spanning tree
Applications
- Dithering
- Cluster analysis
- Real-time face verification
- Image registration with Renyi entropy
- Find road networks in satellite and aerial imagery
- Network design (communication, electrical, computer, road)
- Auto config protocol for ether bridging to avoid cycles in a network
- Reducing data storage in sequencing amino acids in a protein

Greedy algorithm

General principle of Algorithm desing
Simplifying assumptions
- Edge weights are distinct
- Graph is connected
Based on these MST exists and unique
Cut property
- Def. A cut in a graph is partition of its vertices into two non-empty set
- Def. A crossing edge connects a vertex in one set with a vertex in the other
- Cut property. Given any cut, the crossing edge of min weight is the MST
- Pf. Suppose min-weight crossing edge e is not in the MST
  - Adding e to the MST creates a cycle
  - Some other edge f in cycle must be a crossing edge
  - Removing f and adding e is also a spanning tree
  - Since weight of e is less than the weight of f that spanning tree is lower weight
Greedy MST algorithm demo
- Start with all edges colored gray
- Find cut with no black crossing edges, color its min-weight edge black
- Repeat until V-1 edges are colored black
Proposition. The greedy algorithm computes the MST
- Pf. [correctness]
  - Any edge colored black is in the MST (via cut property)
  - Fewer than V-1 black edges → cut with no black crossing edges (consider cut whose vertices are one connected component)
Efficient implementations. Choose cut? find min-weight edge?
1. Kruskal's algorithm [stay tuned]
2. Prim's algorithm [stay tuned]
3. Boruvka's algorithm
Q. What if edge weights are not all distinct?
- A. Greedy MST algorithm still correct if equal weights are present!
Q. What iff graph is not connected?
- A. Compute m,minimum spanning forest = MST of each components

Edge-weight graph API

Edge abstraction needed for weighted edges
Idiom for processing an edge e: int v = e.either(), w = e.other(v);

public class Edge implements Comparable<Edge> {

    private final int v, w;
    private final double weight;

    public Edge(int v, int w, double weight) {
        this.v = v;
        this.w = w;
        this.weight = weight;
    }

    /**
     * Either endpoint
     */
    public int either() {
        return v;
    }

    /**
     * Other endpoint
     */
    public int other(int vertex) {
        if (vertex == v)
            return w;
        else
            return v;
    }

    @Override
    public int compareTo(Edge that) {
        if (this.weight < that.weight)
            return -1;
        else if (this.weight > that.weight)
            return 1;
        return 0;
    }
}

Graph API which has weighted edges, we will use EdgeWeightedGraph

public class EdgeWeightedGraph {

    private final int vertices;
    private Bag<Edge>[] adj;

    public EdgeWeightedGraph(int v) {
        this.vertices = v;
        adj = (Bag<Edge>[]) new Bag[v];
        for (int i = 0; i < v; i++) {
            adj[v] = new Bag<Edge>();
        }
    }

    public int getVertices() {
        return vertices;
    }

    public void addEdge(Edge edge) {
        int v = edge.either(), w = edge.other(v);
        adj[v].add(edge);
        adj[w].add(edge);
    }

    public Iterable<Edge> adj(int v) {
        return adj[v];
    }
}

Edge-weighted graph: adjacency-list representation
- Maintain vertex-indexed array of Edge lists
Minimum spanning tree API
- Q. How to represent the MST?

public class MST {
    MST(EdgeWeightedGraph G)
    Iterable<Edge> edges() // edges in MST
    double weight() // weight of MST
}

Kruskal's algorithm

classical algorithm for computing MST
Steps:
- Consider edges in ascending order of weight
- Add next edge to tree T unless doing so would create a cycle
- Ignore edge that create a cycle
Proposition. [Kruskal 1956] Kruskal's algorithm computes the MST
- Pf. Kruskal's algorithm is a special case of the greedy MST algorithm
  - Suppose Kruskal's algorithm colors the edge e=v-w black
  - Cut = set of vertices connected to v in tree T
  - No crossing edge is black
  - No crossing edge has lower weight, why?
Challenge. Would adding edge v-w to tree T create a cycle? if not, add it
- V run DFS from v, check if w is reachable (T has at most V-1 edges)
- log^*V use union-find data structure
- Efficient solution. Use the Union-find data structure
  
  . Maintain a set for each connected component in T
  
  . If v and w are in same set, then add v-w would create a cycle
  
  . To add v-w to T, merge sets containing v and w

Code

public class KruskalMST {

    private double weight;
    private Queue<Edge> mst;

    public KruskalMST(EdgeWeightedGraph graph) {
        MinPQ<Edge> pq = new MinPQ<>(graph.getVertices());
        mst = new ArrayQueue<>();
        for (Edge e : graph.edges())
            pq.insert(e);

        UnionFind unionFind = new WeightedQuickUnionPassCompression(graph.getVertices());

        while (!pq.isEmpty() && mst.size() < graph.getVertices() - 1) {
            Edge e = pq.delMin();
            int v = e.either(), w = e.other(v);
            if (!unionFind.isConnected(v, w)) {
                unionFind.union(v, w);
                mst.enqueue(e);
                weight += e.weight();
            }
        }
    }

    public Iterable<Edge> edges() {
        return mst;
    }

    public double weight() {
        return this.weight;
    }
}

Proposition. Kruskal's algorithm computes MST in time proportional to E log E (in the worst case)

Prim's algorithm

Classical algorithm for computing MST
Steps:
- Start with vertex 0 and greedily grow tree T
- Add to T the min weight edge with exactly one endpoint in T
- Repeat until V-1 edges
Proposition. Prim's algorithm computes the MST
- Pf. Prime's algorithm is a special case of the greedy MST algorithm
  - Suppose edge e = min weight edge connecting a vertex on the tree to a vertex not on the tree
  - Cut = set of vertices connected on tree
  - No crossing edge is black
  - No crossing edge has lower weight
Challenge. Find the min weight edge with exactly one endpoint in T
- E ← try all edges
- lg E ← Priority Queue

Prim's algorithm: lazy implementation

Challenge. Find the min weight edge with exactly one endpoint in T
- Lazy solution. Maintain a PQ of edges with (at least) one endpoint in T
  - Key = edge, priority = weight of edge
  - Delete-min to determine next edge e = v-w to add to T
  - Disregard if both endpoints v and w are in T
    
    . add to PQ any edge incident to w (assuming other endpoint not in T)
    
    . add w to T
Prim's algorithm (lazy) demo
- Start with vertex 0 and greedily grow tree T
- Add to T the min weight edge with exactly one endpoint T
- Repeat until V-1 edges

Code

public class LazyPrimMST {

    private double weight;
    private boolean[] marked;
    private Queue<Edge> mst;
    private MinPQ<Edge> pq;

    public LazyPrimMST(EdgeWeightedGraph graph) {
        pq = new MinPQ<>(graph.getVertices());
        mst = new ArrayQueue<>();
        marked = new boolean[graph.getVertices()];

        for (int v = 0; v < graph.getVertices(); v++) { // run Prim from all vertices to
            if (!marked[v])
                prim(graph, v); // get a minimum spanning forest
        }
    }

    private void prim(EdgeWeightedGraph graph, int s) {
        scan(graph, s);

        while (!pq.isEmpty()) {
            Edge e = pq.delMin();
            int v = e.either(), w = e.other(v);
            if (marked[v] && marked[w])
                continue;
            mst.enqueue(e);
            weight += e.weight();
            if (!marked[v])
                scan(graph, v); // v become part of tree
            if (!marked[w])
                scan(graph, w); // w become part of tree
        }
    }

    // add all edges e incident to v onto pq if the other endpoint has not yet been
    // scanned
    private void scan(EdgeWeightedGraph graph, int v) {
        marked[v] = true;
        for (Edge e : graph.adj(v))
            if (!marked[e.other(v)])
                pq.insert(e);
    }

    public Iterable<Edge> edges() {
        return mst;
    }

    public double weight() {
        return this.weight;
    }
}

Proposition. Lazy Prim's algorithm computes the MST in time proportional to E log E

and extra space proportional to E (in the worst case)

- Pf. 

|operation|frequency|binary heap|
|---------|---------|-----------|
|delete min| E | lg E |
|insert| E | lg E |

Prim's algorithm: eager implementation

Challenge. Find the min weight edge with exactly one endpoint in T
Eager solution. Maintain a PQ of vertices connected by an edge to T, where priority of vertex v = weight of shortest edge connected v to T
- Delete min vertex v and add its associated edge e = v-w to T
- Update PQ by considering all edges e=v-x incident to v
  - ignore if x is already in T
  - add x to PQ if not already on it
  - decrease priority of x if v-x becomes shortest edge connecting x to T
Prim's algorithm (eager) demo
- Start with vertex 0 and greedily grow tree *
- Add to T the min weight edge with exactly one endpoint in T
- Repeat until V-1 edges
Indexed priority queue
- Associate an index between 0 and N-1 with each key in a priority queue
  - Client can insert and delete-the-minimum
  - Client can change the key by specify the index
- Implementation
  - Start with same code as MinPQ
  - Maintain parallel arrays keys[], pq[], and qp[] so that:
    1. keys[i] is the priority of i
    2. pq[i] is the index of the key in heap position i
    3. pq[i] is the index of the key in heap position i
    4. qp[i] is the heap position of the key with index i
  - Use swim(qp[k]) implementation

public class IndexMinPQ<Key extends Comparable<Key>> {
    IndexMinPQ(int N)

    void insert(int i, Key key)
    
    void decreaseKey(int i, Key key)

    boolean contains(int i)

    int delMin()

    boolean isEmpty()

    int size()
}

Prims' (eager) algorithm: running time
- Depends on PQ implementation: v insert, v delete-min, E decrease key

PQ implementation	Insert	delete-min	decrease-key	total
array	1	V	1	V²
binary heap	log V	log V	log V	E log V
d-way heap	log_dV	d log_dV	log_dV	E log_1/vV
Fibonacci heap	1⁺	log V^*	1⁺	E + V log V

Note: ⁺ amortized

MST context

Euclidean MST
- Given N points in the plane, find MST connecting them, where the distances between point paris are their Euclidean distances
Scientific application: clustering
- K-clustering. Divide a set of objects classify into K coherent groups
- Distance function. Numeric value specifying "closeness" of two objects
- Goal. Divide into clusters so that objects in different clusters are far apart
Applications
- Routing in mobile ad hoc networks
- Document categorization for web search
- Similarity searching in medical image databases
- Sky-cat: cluster 10⁹ objects into stars, quasars, galaxies

Shortest path

Given an edge weighted digraph, find the shortest path from s to t
Applications:
- Map routing
- Car navigation
- Network routing protocols (OSPF, BGP, RIP)
- Optimal pipelining of VLSI chip
- Robot navigation
Which vertices?
- Source-sink: from one vertex to another
- Single source: from one vertex to every other
- All pairs: between all pairs of vertices
Restrictions on edge weights
- Nonnegative weights
- Arbitrary weights
- Euclidean weights
Cycles?
- No directed cycles
- No "negative cycles"

Weighted directed edge API

Similar to Edge for undirected graphs, but a bit simpler

Code

public class DirectedEdge {

    private final int v, w;
    private final double weight;

    public DirectedEdge(int v, int w, double weight) {
        this.v = v;
        this.w = w;
        this.weight = weight;
    }

    public int from() {
        return v;
    }

    public int to() {
        return w;
    }

    public double weight() {
        return weight;
    }
}

Edge-weighted digraph API

Same as EdgeWeightedGraph except replace Graph with Digraph

Code

public class EdgeWeightedDigraph {

    private final int vertices;
    private final Bag<DirectedEdge>[] adj;

    public EdgeWeightedDigraph(int vertices) {
        this.vertices = vertices;
        adj = (Bag<DirectedEdge>[]) new Bag[vertices];
        for (int v = 0; v < vertices; v++) {
            adj[v] = new Bag<>();
        }
    }

    public void addEdge(DirectedEdge e) {
        int v = e.from();
        adj[v].add(e);
    }

    public Iterable<DirectedEdge> adj(int v) {
        return adj[v];
    }
}

Single-source shortest paths API

Goal. Find the shortest path from s to every other vertex

API

public class SP {
    SP(EdgeWeightedDigraph graph, int s)

    double distTo(int v)
    
    Iterable<DirectedEdge> pathTo(int v)
    
    boolean hasPathTo(int v)
}

Shortest paths properties

Goal. Find the shortest path from s to every other vertex
Observation. A shortest-paths tree (SPT) solution exists, why?
Consequence. Can represent the SPT with two vertices-indexed arrays:
- distTo[v] is length of shortest path from s to v
- edgeTo[v] is last edge on shortest path from s to v

Code

public double distTo(int v) {
    return distTo[v];
}

public Iterable<DirectedEdge> pathTo(int v) {
    Stack<DirectedEdge> path = new Stack<>();
    for (DirectedEdge e = edgeTo[v]; e != null; e = edgeTo[e.from()]) {
        path.push(e);
    }
    return path;
}

Edge relaxation
- Relax edge e = v → w
  - distTo[v] is length of shortest known path from s to v
  - distTo[v] is length of shortest known path from s to w
  - edgeTo[w] is last edge on shortest known path from s to w
  - If e = v→w gives shorter path to w through v, update both distTo[w] and edgeTo[w]

Code

private void relax(DirectedEdge e) {
    int v = e.from(), w = e.to();
    if (distTo[w] > distTo[v] + e.weight() {
        distTo[w] = distTo[v] + e.weight();
        edgeTo[w] = e;
        if (pq.contains(w)) pq.decreaseKey(w, distTo[w]);
        else pq.insert(w, distTo[w]);
    }
}

Shortest-paths optimality conditions
- Proposition. Let G be an edge-weighted digraph then distTo[] are the shortest path distances from s iff
  - distTo[s] = 0
  - For each vertex v, distTo[v] is the length of some path from s to v
  - For each edge = e = v→w, distTo[w] <= distTo[v] + e.weight()
  - Pf. [necessary]
    
    . Suppose that distTo[w] > distTo[v] + e.weight() for some edge e = v→w
    
    . Then, e gives a path from s to w (through v) of length less than distTo[w]
Generic shortest-paths algorithm
- Initialize distTo[s] = 0 and distTo[v] = ∞ for all other vertices
- Repeat until optimality donstions are satisfied
  - Relax any edge
Efficient implementations. How to choose which edge to relax?
- Dijkstra's algorithm (non-negative weights)
- Topological sort algorithm (no directed cycles)
- Bellman-Ford algorithm (no negative cycles)

Dijkstra's algorithm

Dijkstra's algorithm demo
- Consider vertices in increasing order of distance from s (non-tree vertex with the lowest distTo[] value)
- Add vertex to tree and relax all edges pointing from that vertex

Code

public class DijkstraSP {

    private DirectedEdge[] edgeTo;
    private double[] distTo;
    private IndexMinPQ<Double> pq;

    public DijkstraSP(EdgeWeightedDigraph graph, int s) {
        edgeTo = new DirectedEdge[graph.getVertices()];
        distTo = new double[graph.getVertices()];
        pq = new IndexMinPQ<>(graph.getVertices());

        for (int v = 0; v < graph.getVertices(); v++) {
            distTo[v] = Double.POSITIVE_INFINITY;
        }
        distTo[s] = 0.0;

        pq.insert(s, 0.0);
        while (!pq.isEmpty()) {
            int v = pq.delMin();
            for (DirectedEdge e : graph.adj(v)) {
                relax(e);
            }
        }
    }

    private void relax(DirectedEdge e) {
        int v = e.from(), w = e.to();
        if (distTo[w] > distTo[v] + e.weight(){
            distTo[w] = distTo[v] + e.weight();
            edgeTo[w] = e;
            if (pq.contains(w)) pq.decreaseKey(w, distTo[w]);
            else pq.insert(w, distTo[w]);
        }
    }
}

Proposition. Dijkstra's algorithm computes a SPT in any edge-weighted digraph with nonnegative weights
- Pf.
  - Each edge e = v→ w is relaxed exactly once (when v is relaxed) leaving distTo[w] <= distTo[v] + e.weight()
  - Inequality holds until algorithm terminates because
    
    . distTo[w] cannot increase
    
    . distTo[v] will not change
Computing spanning trees in graphs
- Dijkstra's algorithm seems familiar?
  - Prim's algorithm is essentially the same algorithm
  - Both are in a family of algorithms that computes a graph's spanning tree
- Main distinction. Rule uses to choose next vertex for the tree
  - Prim's: Closest vertex to the tree (via an undirected edge)
  - Dijkstra's: Closest vertex to the source (via a directed path)

Note: DFS and BFS are also in this family of algorithms

Dijkstra's algorithm: which priority queue?
- Bottom line
  - Array implementation optimal for dense graphs
  - Binary heap much faster for sparse graphs
  - 4-way heap worth the trouble in performance critical situations
  - Fibonacci heap best in theory, but not worth implementing

PQ implementation	Insert	delete-min	decrease-key	total
array	1	V	1	V²
binary heap	log V	log V	log V	E log V
d-way heap	log_dV	d log_dV	log_dV	E log_1/vV
Fibonacci heap	1⁺	log V^*	1⁺	E + V log V

Edge-weighted DAGs

Q. Suppose that an edge digraph has no directed cycles is it easier to find shortest paths than in general digraph?
- A. Yes
Acyclic shortest paths demo
- Consider vertices in topological order
- Relax all edges pointing from that vertex
Proposition. Topological sort algorithm computes SPT in any edge weighted DAG in time proportional to E + V
- Pf.
  - Each edge e = v→w relaxed exactly once (when v is relaxed)
  - Inequality holds until algorithm terminates because
    
    . distTo[w] cannot increase
    
    . distTo[v] will not change
  - Thus, upon termination, shortest-paths optimality conditions hold

Code

public class AcyclicSP {

    private DirectedEdge[] edgeTo;
    private double[] distTo;

    public AcyclicSP(EdgeWeightedDigraph graph, int s) {
        edgeTo = new DirectedEdge[graph.getVertices()];
        distTo = new double[graph.getVertices()];

        for (int v = 0; v < graph.getVertices(); v++) {
            distTo[v] = Double.POSITIVE_INFINITY;
        }

        distTo[s] = 0.0;

        TopologicalSort topological = new TopologicalSort(graph);
        for (int v : topological.order()) {
            for (DirectedEdge e : graph.adj(v)) {
                relax(e);
            }
        }
    }

    private void relax(DirectedEdge e) {
        int v = e.from(), w = e.to();
        if (distTo[w] > distTo[v] + e.weight()) {
            distTo[w] = distTo[v] + e.weight();
            edgeTo[w] = e;
        }
    }

    public boolean hasPathTo(int v) {
        return distTo[v] < Double.POSITIVE_INFINITY;
    }

    public double distTo(int v) {
        return distTo[v];
    }

    public Iterable<DirectedEdge> pathTo(int v) {
        if (!hasPathTo(v))
            return null;
        Stack<DirectedEdge> path = new ArrayStack<>();
        for (DirectedEdge e = edgeTo[v]; e != null; e = edgeTo[e.from()]) {
            path.push(e);
        }
        return path;
    }
}

Applications
- Content-aware resizing
  - Seam carving. Resizing an image without distortion for display on cell phones and web browsers
  - Photoshop, GIMP, etc
- To find vertical seam:
  - Grid DAG: vertex = pixel, edge = from pixel to 3 downward neighbors
  - Weight of pixel = energy function of 8 neighboring pixels
  - Seam = shortest path (sum of vertex weights) from top to bottom
- To remove vertical seam:
  - Delete pixels on seam (one in each row)
Longest paths in edge-weighted DAGs
- Formulate as a shortest paths problem in edge-weighted DAGs
  
  [equivalent: reverse senes of equality in relax()]
  - Negate all weights
  - Find shortest paths
  - Negate weights in result
- Parallel job scheduling. Given a set of jobs with durations and precedence constraints, schedule the jobs (by finding a start time for each) so as to achieve the minimum completion time, while respecting the constraints
  - Critical path method (CPM)

Negative weights

Dijkstra. Doesn't work with negative edge weights
Negative cycles
- Def. A negative cycle is a directed cycle whose sum of edge weights is negative
Bellman-Ford algorithm
- Initialize distTo[s] = 0 and distTo[v] = ∞ for all other vertices
- Repeat V times:
  - Relax each edge
- Proposition. Dynamic programming algorithm computes SPT in any edge weighted digraph with no negative cycles in time proportional to E x V
  - Pf idea. After passing i, found shortest path containing at most i edges
- Improvement
  - Observation. If distTo[v] does not change during pass i, no need to relax any edge pointing from v in pass i+1
  - FIFO implementation. Maintain queue of vertices whose distTo[] changed
  - Overall effect
    
    . The running time is still proportional to E x V in worst case
    
    . But much faster than in practice

Single source shortest-path implementation: cost summary

algorithm	restriction	typical case	worst case	extra space
topological sort	no directed cycles	E + V	E + V	V
Dijkstra (binary heap)	no negative weights	E log V	E log V	V
Bellman-ford	no negative cycles	E V	E V	V
Bellman-ford (queue-based)	no negative cycles	E + V	E V	V

Remark 1. Directed cycles make the problem harder
Remark 2. Negative weights make the problem harder
Remark 3. Negative cycles make the problem intractable

Finding a negative cycle

Negative cycle. Add two method to the API for SP

boolean hasNegativeCycle()

Iterable<DirectedEdge> negativeCycle()

Observation. If there is a negative cycle, Bellman-Forest gets stuck in loop, un-updating distTo[] and edgeTo[] entries of vertices in the cycle
Proposition. If any vertex v is updated in phase V, there exist a negative cycle (and can trace back edgeTo[v] entries to find it)
- In practice. Check for negative cycles more frequently
Applications
- Arbitrage detection
  - Problem. Given table of exchange rates, is there an arbitrage opportunity?
- Ex. $1.000 → 741 Euros → 1,012.206 Canadian dollars → $1,007.14497
- Currency exchange graph
  - Vertex = currency
  - Edge = transaction, with weight equal to exchange rate
  - Find a directed cycle whose product of edge weights is > 1

Currency	USD	EUR	GBP	CHF	CAD
USD	1	0.741	0.657	1.061	1.011
EUR	1.350	1	0.888	1.433	1.366
GBP	1.521	1.126	1	1.614	1.538
CHF	0.943	0.698	0.620	1	0.953
CAD	0.995	0.732	0.650	1.049	1

Shortest paths summary

Dijkstra's algorithm
- Nearly linear-time when weights are nonnegative
- Generalization en-compresses DFS, BFS and Prim
Acyclic edge-weighted digraphs
- Arise in applications
- Faster than Dijkstra's algorithm
- Negative weights are no problem
Negative weights and negative cycles
- Arise in applications
- If no negative cycles, can find shortest paths via Bellman-Ford
- If negative cycles, can find one via Bellman-Ford
Shortest-paths is a broadly useful problem-solving model

Maximum flow

General problem solving model

Mincut problem

Input. An edge-weighted digraph, source vertex s and target vertex t
Def. A st-cut (cut) is a partition of the vertices into two disjoint sets with s in one set A and t in the other set B
Def. its capacity is the sum of the capacities of the edges from A to B
Minimum st-cut (mincut) problem. Find a cut of minimum capcity
Applications
- Free world goal. Cut supplies (if cold war turns into real war)
- Government-in-power's goal. Cut off communication of set of people

Maxflow problem

Input. An edge-weighted digraph, source vertex s and target t
Every edge has a positve capcity
Def. An st-flow (flow) is an assignment of values to the edges such that:
- Capacity constraint: 0 <= edge's flow <= edge's capacity
- Local equilibrium: inflow = outflow at every vertex (except s and t)
Maximum st-flow (maxflow) problem. Find a flow of maximum value

Ford-Fulkerson algorithm

Steps
- Initialization. Start with 0 flow
- Augmenting path. Find an undirected path from s to t such that:
  - Can increase flow on forward edges (not full)
  - Can decrease flow on backward edge (not empty)
- Termination. All paths from s to t are blocked by either a:
  - Full forward edge
  - Empty backward edge
Ford-Fulkerson algorithm

Start with 0 flow
While there exists an augmenting path

    - find an augmenting path
    - compute bottleneck capcity
    - increase flow on that path by bottleneck capacity

Questions
- How to compute a mincut? Easy
- How to find an augmenting path? BFS works well
- If FF terminates, does it always compute a maxflow? Yes
- Does FF always terminate? If so, after how many augmentations?

Running time analysis

TODO this part

Java code

TODO this part

Strings

String. Sequence of characters
Important fundamentals abstractions
- Information processing
- Genomic sequences
- Communication systems (e.g email)
- Programming systems (e.g java programs)
C char data type
- Typically an 8-bit integer
- Supports 7-bit ASCII
- Can represent only 256 characters
Java char data type
- Support 16-bit Unicode
- Support 21-bit Unicode 3.0 (awkwardly)
String data type
- String data type in Java. Sequence of characters (immutable)
- Length. Number of characters
- Substring extraction. Get a contiguous subsequence of characters
- String concatenation. Append one character to end of another string

Code

public final class String implements Comparable<String> {
    private char[] value; // characters
    private int offset; // index of first char in array
    private int length; // length of substring
    private int hash; // cache of hashCode()

    public int length() { return length;}

    public char charAt(int i) { return value[i + offset];}

    private String(int offset, int length, char[] value) {
        this.offset = offset;
        this.length = length;
        this.value = value;
    }

    public String substring(int from, int to) {
        return new String(offset + from, to - from, value); // copy of reference to char array
    }
}

String data type performance
- Memory 40 + 2N bytes for a virgin String of length of N
- Remark. StringBuffer data type is similar, but thread safe (and slower)
- String operations

operation	guarantee	extra space
length()	1	1
charAt()	1	1
substring()	1	1
concat()	N	N

StringBuilder data type performance
- Remark StringBuffer data type is similar, but thread safe (and slower)
- StringBuilder operations

operation	guarantee	extra space
length()	1	1
charAt()	1	1
substring()	N	N
concat()	1^*	1^*

String vs. StringBuilder
- Q. How to efficiently reverse a string?

Quadratic time

public static String reverse(String s) {
    String rev = "";
    for (int i = s.length() - 1; i >= 0; i--) rev += s.charAt(i);

    return rev;
}

Linear time

public static String reverse(String s) {
    StringBuilder rev = new StringBuilder();
    for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i));

    return rev.toString();
}

Note: For concat string use StringBuilder , for substring use String

Alphabets

Radix. Number of digits R in alphabet

name	R()	lgR()	characters
Binary	2	1	01
Octal	8	3	01234567
Decimal	10	4	0123456789
HexaDecimal	16	4	0123456789ABCED
DNA	4	2	ACTG
LowerCase	26	5	a-z
UpperCase	26	5	A-Z
Protein	20	5	A-Y
Base64	64	6	A-Za-z0-9+/
ASCII	128	7	ASCII characters
Extended ASCII	256	8	extended ASCII characters
Unicode16	65536	16	unicode characters

Key-indexed counting

We stopped in sorting algorithm at Heap sort
- Lower bound ~ N lg N compares required by any compared-based algorithm
- Q. Can e do better (despite the lower bound)?
  - A. Yes if e don't depend o key compares
Frequency of operations = key compares
Assumption about keys
- Assumption. Keys are integers between 0 and R-1
- Implication. Can use key as an array index
- Applications
  1. Sort string by first letter
  2. Sort class roster by section
  3. Sort phone numbers by area code
  4. Subroutine in a sorting algorithm [stay tuned]

Remark. keys may have associated data => can't just count up number of keys of each value

Goal. Sort an array a[] of N integers between 0 and R-1
Key-indexed counting demo
- Count frequencies of each letter using key as index
- Compute frequency cumulates which specify destination
- Access cumulates using key as index to remove items
- Copy back into original array
Performance ~ 11 N + 4 R and extra space N + R
- R is Radix from table above

Code

    public final static void sort(char[] a) {
        int N = a.length;
        int R = 256;
        int[] count = new int[R + 1];
        char[] aux = new char[N];

        // count frequencies
        for (int i = 0; i < N; i++)
            count[a[i] + 1]++; // offset by 1

        // compute cumulates
        for (int r = 0; r < R; r++)
            count[r + 1] += count[r];

        // move items
        for (int i = 0; i < N; i++)
            aux[count[a[i]]++] = a[i]; 

        // copy back
        for (int i = 0; i < N; i++)
            a[i] = aux[i];
    }

LSD radix sort

Least significant digit first string sort
- Consider characters from right to left
- Stably sort using d^th character as the key (using key-indexed counting)
Proposition. LSD sorts fixed-length strings in ascending order
- Pf. [by induction on i]
  - After pass i, strings are sorted by last i characters
    
    . If two strings differ on sort key, key-indexed sort puts them in proper relative order
    
    . If two strings agree on sort key, stability keeps them in proper relative order

Code

public static void sort(String[] a, int W) { 
        int R = 256;
        int N = a.length;
        String[] aux = new String[N];

        for (int d = W - 1; d >= 0; d--) {
            int[] count = new int[R + 1];
            for (int i = 0; i < N; i++)
                count[a[i].charAt(d) + 1]++;
            for (int r = 0; r < R; r++)
                count[r + 1] += count[r];
            for (int i = 0; i < N; i++)
                aux[count[a[i].charAt(d)]++] = a[i];
            for (int i = 0; i < N; i++)
                a[i] = aux[i];
        }
    }

Problem. Sort one million 32-bit integers
- Ex. Google interview
- Which sorting method to use?
  1. Insertion sort
  2. Mergesort
  3. Quicksort
  4. Heapsort
  5. LSD string sort
Summary of the performance of sorting algorithms

algorithm	guarantee	random	extra space	stable?	operations on keys
insertion sort	N²/2	N²/4	1	yes	compareTo()
mergesort	N lg N	N lg N	N	yes	compareTo()
quicksort	1.39 N lg N	1.39 N lg N	c lg N	no	compareTo()
heapsort	2 N lg N	2 N lg N	1	no	compareTo()
LSD^*	2 W N	2 W N	N + R	yes	charAt()

MSD radix sort

Most significant digit first string sort
Steps
- Partition array into R pieces according to first character (use key-indexed counting)
- Recursively sort all strings that start ith each character (key-indexed counts delineate subarrays to sort)
Variable-length strings
- Treat strings as if they had an extra char at end (smaller than any char)
- C strings. Have extra char '\0' at end → no extra work to do

Code

  public final static void sort(String[] a) {
        String[] aux = new String[a.length];
        sort(a, aux, 0, a.length - 1, 0);
    }

    private final static void sort(String[] a, String[] aux, int lo, int hi, int d) {
        if (hi <= lo)
            return;
        int[] count = new int[R + 2];
        for (int i = lo; i <= hi; i++) {
            int c = charAt(a[i], d);
            count[c + 2]++;
        }
        for (int r = 0; r < R + 1; r++)
            count[r + 1] += count[r];
        for (int i = lo; i <= hi; i++) {
            int c = charAt(a[i], d);
            aux[count[c + 1]++] = a[i];
        }
        for (int i = lo; i <= hi; i++)
            a[i] = aux[i - lo];

        for (int r = 0; r < R; r++)
            sort(a, aux, lo + count[r], lo + count[r + 1] - 1, d + 1); // sort R subarrays recursively
    }

    private final static int charAt(String s, int d) {
        if (d < s.length())
            return s.charAt(d);
        else
            return -1;
    }

Performance
- Observation 1. Much too slow for small subarrays
  - Each function call needs its own count[] array
  - ASCII (256 counts) 100x slower than copy pass for N=2
  - Unicode (65,536 counts): 32.000x slower for N = 2
- Observation 2. Huge number of small subarrays because of recursion
- Improvement
  1. Cutoff to insertion sort for small subarrays
- Number of characters examined
  - MSD examines just enough characters to sort the keys
  - Number of characters examined depends on keys
  - Can be sublinear in input size
Disadvantages of MSD string sort
- Access memory "randomly" (cache inefficient)
- Inner loop has a lot of instructions
- Extra space for count[]
- Extra space for aux[]
Disadvantages of quicksort
- Linearithmic number of string compares (not linear)
- Has to re-scan many characters in keys with long prefix matches
Complexity

algorithm	guarantee	random	extra space	stable?	operations on keys
insertion sort	N²/2	N²/4	1	yes	compareTo()
mergesort	N lg N	N lg N	N	yes	compareTo()
quicksort	1.39 N lg N	1.39 N lg N	c lg N	no	compareTo()
heapsort	2 N lg N	2 N lg N	1	no	compareTo()
LSD^*	2 W N	2 W N	N + R	yes	charAt()
MSD^*	2 W N	N log₂N	N + D R	yes	charAt()

3-Way radix quicksort

Combines benefits of Quicksort and MSD radix
Overview. Do 3-way partitioning on the d^th character
- Less overhead than R-way partition in MSD string sort
- Does not re-examine characters equal to the partition char (but does re-examine characters not equal to the partition char)

Code

public final class MSDQuickSort {

    public final static void sort(String[] a) {
        sort(a, 0, a.length - 1, 0);
    }

    private final static void sort(String[] a, int lo, int hi, int d) {
        if (hi <= lo)
            return;
        int lt = lo, gt = hi;
        int v = charAt(a[lo], d);
        int i = lo + 1;
        while (i <= gt) {
            int t = charAt(a[i], d);
            if (t < v)
                swap(a, lt++, i++);
            else if (t > v)
                swap(a, i, gt--);
            else
                i++;
        }
        sort(a, lo, lt - 1, d);
        if (v >= 0)
            sort(a, lt, gt, d + 1);
        sort(a, gt + 1, hi, d);
    }

    private final static int charAt(String s, int d) {
        if (d < s.length())
            return s.charAt(d);
        else
            return -1;
    }

    private static void swap(String[] a, int i, int j) {
        String temp = a[i];
        a[i] = a[j];
        a[j] = swamp;
    }
}

Performance
- Uses ~ 2 N ln N character compares on average for random string

3-way string quicksort vs. MSD string sort

MSD string sort
- Is cache-inefficient
- Too much memory sortig count[]
- Too much overhead reinitialiazing count[] and aux[]
3-way string quicksort
- Has a short inner loop
- Is cache-friendly
- Is in-place
Complexity

algorithm	guarantee	random	extra space	stable?	operations on keys
insertion sort	N²/2	N²/4	1	yes	compareTo()
mergesort	N lg N	N lg N	N	yes	compareTo()
quicksort	1.39 N lg N	1.39 N lg N	c lg N	no	compareTo()
heapsort	2 N lg N	2 N lg N	1	no	compareTo()
LSD⁺	2 W N	2 W N	N + R	yes	charAt()
MSD⁺	2 W N	N log₂N	N + D R	yes	charAt()
3-way string quicksort	1.39 W N lg N	1.39 W N lg N	log N + W	no	charAt()

Suffix arrays

Keyword-in-context search
- Given a text of N characters, preprocess it to enable fast substring search (find all occurrence of query string context)
- Applications
  - Grep
  - Linguistics
  - Databases
  - Web search
  - Word processing
- Suffix sort
  - Take the input string and form suffixes
  - Sort suffixes to bring repeated substrings together
  - You can use Binary search
- Suffix-sort soltion
  - Preprocess: suffix sort the text
  - Query: binary search for query, scan until mismatch
Longest repeated substring
- Given a string of N characters, find the longest repeated substring
- Applications
  - Bioinformatics
  - Cryptanalysis
  - Data compression
  - Visualize repetitions in music
- Brute-force algorithm
  - Try all indices i and j for start of position match
  - Compute longest common prefix (LCP) for each pair
  - The same solution before: Suffix-sort solution

Code

public String lrs(String s) {
    int N = s.length();
    String[] suffixes = new String[N];
    for (int i = 0; i < N; i++) suffix[i] = s.substring(i, N);

    Arrays.sort(suffixes);

    String lrs = "";
    for (int i = 0; i < N-1; i++) {
        int len = lcp(suffix[i], suffix[i+1]);
        if (len > lrs.length()) lrs = suffixes[i].substring(0, len);
    }
    return lrs;
}

Sorting challenge
- Problem. Five scientists A, B, C, D, and E are looking for long repeated substring in a genome with over 1 billion nucleotides
  - A has a grad student do it by hand
  - B uses brute force (check all pairs)
  - C uses suffix sorting solution with insertion sort
  - D uses suffix sorting solution wit mergesort
  - ✔ E uses suffix sorting solutio with 3-way string quicksort (but only if LRS is not long)
- Q. Which one is more likely to lead to a cure for cancer?
- Problem. Suffix sort an arbitrary string of length N
  - Q. What is the worst-case running time of best algorithm for problem?
    
    . ✔ Linearithmic (Manber-Myers algorithm)
    
    . ✔ Linear (suffix trees)

Search in String

Data structure for searching in String
Complexity for previous searching algorithms

algorithm	search	insert	delete	ordered operations	operations on keys
red-black BST	lg N	lg N	lg N	yes	compareAt()
hash table	1⁺	1⁺	1⁺	no	equals() hashCode()

Q. Can we do better?
- A. yes, if we can avoid examining the entire key, as with string sorting
String symbol table API

public class StringST<Value> {
    StringST()

    void put(String key, Value val)

    void get(String key)

    void delete(String key)
}

Goal. Faster than hashing, more flexible than BSTs
Challenge. Efficient performance for string keys

R-way Tries

From retrieval, but pronounced "try", to distribute from binary search tree
- For now store characters in nodes (not key)
- Each node has R children, one for each possible character, where R is the possible of character
- Store values in nodes corresponding to last characters in keys
Search in a trie
- Follow links corresponding to each character in the key
  - Search hit: node where search ends has a non-null value
  - Search miss: reach null link or node where search ends has null value
Insertion into a trie
- Follow links correspoinding to each character in the key
  - Encounter a null link: create new node
  - Encounter the last character of the key: set value in that node
Delete in an R-way tries
- Find the node corresponding to key and set value to null
- If node has null value and all null links, remove that node (and recur)
Trie performance
- Search hit. Need to examine all L characters for equality
- Search miss.
  - Could have mismatch on
- first character
  - Typical case examine only a few characters
- Space. R null links at each leaf. (but sub-linear space possible if many short strings share common prefixes)
- Bottom line. Fast search hit and even faster search miss, but wastes space
Goal. Design a data structure to perform efficient spell checking
- Solution. Build a 26-way trie (key = word, value = bit), 26 English letters

Code

public class TriesST<Value> {

    private final static int R = 256;

    private class Node {
        private Object val;
        private Node[] next = (Node[]) new Object[R];
    }

    private Node root;

    public void put(String key, Value val) {
        root = put(root, key, val, 0);
    }

    private Node put(Node node, String key, Value val, int d) {
        if (node == null)
            node = new Node();
        if (d == key.length()) {
            node.val = val;
            return node;
        }
        char c = key.charAt(d);
        node.next[c] = put(node.next[c], key, val, d + 1);
        return node;
    }

    public boolean contains(String key) {
        return get(key) != null;
    }

    public Value get(String key) {
        Node node = get(root, key, 0);
        if (node == null)
            return null;
        return (Value) node.val;
    }

    private Node get(Node node, String key, int d) {
        if (node == null)
            return null;
        if (d == key.length())
            return node;
        char c = key.charAt(d);
        return get(node.next[c], key, d + 1);
    }
}

Comparison

algorithm	search	search miss	insert	space
red-black BST	L + c lg² N	c lg²N	c lg²N	4N
hash table (linear probing)	L	L	L	4N to 16N
R-way tries	L	log_*N	L	(R+1)N

R-way tries
- Method of choice for small R
- Too much memory for large R
Challenge. Use less memory, e.g. 65,536-way trie for Unicode!

Ternary search tries

Store characters and values in nodes (not keys)
Each node has 3 children: small(left), equal(middle), larger(right)
Search in a TST
- Follow links corresponding to each character in the key
  - If less take left link, if greater take right link
  - If equal take the middle link and move to the next key character
- Search hit. Node where search ends has a non-nul value
- Search miss. Reach a null link or node where search node ends has null value
26-way trie vs. TST
- 26-way trie. 26 null links in each leaf
- TST. 3 null links in each leaf

Code

public class TernaryST<Value> {

    private class Node {
        Value val;
        char c;
        Node left, mid, right;
    }

    private Node root;

    public void put(String key, Value val) {
        root = put(root, key, val, 0);
    }

    private Node put(Node node, String key, Value val, int d) {
        char c = key.charAt(d);
        if (node == null) {
            node = new Node();
            node.c = c;
        }
        if (c < node.c)
            node.left = put(node.left, key, val, d);
        else if (c > node.c)
            node.right = put(node.right, key, val, d);
        else if (d < key.length() - 1)
            node.mid = put(node.mid, key, val, d + 1);
        else
            node.val = val;
        return node;
    }

    public boolean contains(String key) {
        return get(key) != null;
    }

    public Value get(String key) {
        Node node = get(root, key, 0);
        if (node == null)
            return null;
        return node.val;
    }

    private Node get(Node node, String key, int d) {
        if (node == null)
            return null;
        char c = key.charAt(d);
        if (c < node.c)
            return get(node.left, key, d);
        else if (c > node.c)
            return get(node.right, key, d);
        else if (d < key.length() - 1)
            return get(node.mid, key, d + 1);
        else
            return node;
    }
}

Comparison

algorithm	search	search miss	insert	space
red-black BST	L + c lg² N	c lg²N	c lg²N	4N
hash table (linear probing)	L	L	L	4N to 16N
R-way tries	L	log_RN	L	(R+1)N
TST	L + ln N	ln N	L + ln N	4 N

Remark. can build balanced TST via rotations to achieve L + Log N worst-case guarantees

String symbol table implementation cost summary

algorithm	search	search miss	insert	space
red-black BST	L + c lg² N	c lg²N	c lg²N	4N
hash table (linear probing)	L	L	L	4N to 16N
R-way tries	L	log_*N	L	(R+1)N
TST	L + ln N	ln N	L + ln N	4 N
TST with R²	L + ln N	ln N	L + ln N	4 N + R²

Bottom line. TST is as fast as hashing (for string keys), space efficient
TST vs. hashing
- Hashing
  1. Need to examine entire key
  2. Search hits and misses cost about the same
  3. Performance relies on hash function
  4. Does not support ordered symbol table operations
- TSTs
  1. works only for strings (or digital keys)
  2. Only examines just enough key characters
  3. Search miss may involve only a fe characters
  4. Supports ordered symbol table operations (plus others)
Bottom line. TSTs are:
- Faster than hashing (especially for search misses)
- More flexible than red-black BSTs [stay tuned]

Character based operations

The string symbol table API supports several useful character-based operations
Prefix match. keys with prefix "sh": "she", "shells" and "shore"
wildcard match. keys that match ".he": "she" and "the"
Longest prefix. key that is the longest prefix of "shellsort": "shells"
Prefix matches
- Find all keys in a symbol table starting with a given prefix
- Ex. Autocomplete in a cell phone, search bar (Google search), text editor or shell
  1. User types characters one at a time
  2. System reports all matching strings
T9 texting
- Goal. type text messages on a phone keypad
- Multi-tap input. Enter a letter by repeatedly pressing a key until the desired
- T9 text input
  1. Find all words that correspond to given sequence of numbers
  2. Press 0 to see all completion options
Patricia trie
- Remove one-way branching
- Each node represents a sequence of characters
- Implementation: one step beyond this course
- Applications
  1. Database search
  2. P2P network search
  3. IP routing table: find longest prefix match
  4. Compressed quad-tree for N-body simulation
  5. Efficiently storing and query XML documents

String symbol tables summary

A success stroy in algorithm design and analysis
Red-black BST
- Performance guarantee: log N key compares
- Supports ordered symbol table API
Hash tables
- Performance guarantee: constant number of probes
- Requires good hash function for key type
Tries. R-way TST
- Performance guarantee: log N characters accessed
- Support character-based operations

Bottom line. You can get at anything by examining 50-100 bits!!!!!

Substring search algorithms

Goal. find pattern of length M in a text of length N (typically N > M)
Applications
- Find and replace
- computer forensics. Search memory or disk for signatures, e.g. al URLs or RSA keys that the user has entered
- Identify patterns indicative of spam
  1. Profits
  2. Lose weight
  3. herbal viagra
  4. there is no catch
  5. this is a one-time mailing
  6. this message is sent in compliance with spam regulations
- Screen scraping. Extract relevant data from web page
  1. Ex. find string delimited by and after first occurrence of pattern last trade

Brute-force substring search

Check for pattern starting at each text position
Worst case ~ M N char compares
Brute-force algorithm can be slow if text and pattern are repetitive

Code

    public final static int indexOf(String text, String sub) {
        int N = text.length();
        int M = sub.length();

        for (int i = 0; i <= N - M; i++) {
            int j;
            for (j = 0; j < M; j++) {
                if (sub.charAt(i + j) != text.charAt(j))
                    break;
                if (j == M)
                    return i; // index in text where patterns starts
            }
        }
        return N; // not found
    }

Bad performance for search in repeated e.g. "AB" in "AAAAAAAAAB"
- The pattern could be so long
Backup
- In many applications, we want to avoid backup in text stream, we will see in the next algorithms
  1. Treat input as stream of data
  2. Abstract model: standard input
- Approach 1. Maintain buffer of lat M characters
- Approach 2. Stay tuned

    /**
     * Alternative impl brute-force with explicit backup
     * 
     * @param text String
     * @param sub  String
     * @return int
     */
    public final static int indexOfEnhanced(String text, String sub) {
        int i, N = text.length();
        int j, M = sub.length();

        for (i = 0, j = 0; i < N && j < M; i++) {
            if (text.charAt(i) == sub.charAt(j))
                j++;
            else { // backup
                i -= j;
                j = 0;
            }
        }
        if (j == M)
            return i - M;
        else
            return -1;
    }

Brute-force is not always good enough
Theoretical challenge. Linear-time guarantee
Practical challenge. Avoid backup in text stream

Knuth-Morris-Pratt (KMP)

Coolest algorithm :)
This is the algorithm when theoretical meets practices, this discovered by two theoreticians and a hacker
Intuition. Suppose e are search in text for pattern BAAAAAAAAA
KMP. Clever method to avoid backup in brute-force substring
Deterministic finite state automaton (DFA)
- Finite number of states (includes start and halt)
- Exactly one transition for each char in alphabet
- Accept if sequence of transitions leads to halt state
Q. what is interpretation of DFA state after reading in txt[i]?
- A. State = number of characters in pattern that have been matched
Key difference from brute-force implementation
- Need to pre-compute dfa[][] from pattern
- Text pointer i never decrements
Running time
- Simulate DFA on text: at most N character accesses
- Build DFA: ho to do efficiently? [warning: tricky algorithm ahead!]
How to build DFA from pattern?
- If in state j and next char c != pat.charAt(j) , then the last j-1 characters of input are pat[1..j-1] followed by c
Proposition. KMP substring search access no more than M + N chars to search for a pattern of length M in a text of length N
- Pf. each pattern char accessed once when constructing DFA, each text char accessed once (in the worst case) when simulating of the DFA
Proposition. KMP constructs dfa[][] in time and space proportional to R M
- Because we have all of those mismatches

Code

public class KMP {

    private static final int R = 256;

    private int M;
    private int[][] dfa;

    public KMP(String pattern) {
        M = pattern.length();
        dfa = new int[R][M];
        dfa[pattern.charAt(0)][0] = 1;
        for (int i = 0, j = 1; j < M; j++) {
            for (int c = 0; c < R; c++) {
                dfa[c][j] = dfa[c][i]; // copy mis-match cases
            }
            dfa[pattern.charAt(j)][j] = j + 1; // set match case
            i = dfa[pattern.charAt(j)][i]; // update restart state
        }
    }

    public int search(String text) {
        int i, j, N = text.length();
        for (i = 0, j = 0; i < N && j < M; i++) {
            j = dfa[text.charAt(i)][j]; // no backup
        }
        if (j == M)
            return i - M;
        else
            return N;
    }
}

Boyer-Moore

KMP in linear time can we improve over that? yes : D
Rather than scan elements from left to right will do scan from right to left

, to know fast if then end pattern matches or not to avoid from 1 to M-1

Key here: find the mismatch quickly
Steps
- Scan characters in pattern from right to left
- Can skip as many as M text chars when finding one not in the pattern
How much to skip?
- Pre-compute index of rightmost occurrence of character c in pattern (-1 if character not in the pattern)
Property. substring search with the Boyer-Moore mismatched character heuristic takes about ~ N / M character compares to search for a pattern of length M in a text of length N
worse-case can be as bad as ~ M N
Boyer-Moore variant. Can improve worst case to ~ 3 N by adding a KMP-like rule to guard against repetitive patterns

Code

public class BoyerMoore {

    private static final int R = 256;
    private int[] right;
    private String pat;

    public BoyerMoore(String pattern) {
        this.pat = pat;
        right = new int[R];
        for (int c = 0; c < R; c++)
            right[c] = -1;
        for (int j = 0; j < pat.length(); j++)
            right[pat.charAt(j)] = j;
    }

    public int indexOf(String text) {
        int N = text.length();
        int M = pat.length();
        int skip;
        for (int i = 0; i <= N - M; i += skip) {
            skip = 0;
            for (int j = M - 1; j >= 0; j--) {
                if (pat.charAt(j) != text.charAt(i + j)) {
                    skip = Math.max(1, j - right[text.charAt(i + j)]);
                    break;
                }
            }

            if (skip == 0)
                return i;
        }
        return -1; // not found
    }
}

Rabin-Karp

Basic idea = modular hashing
- Computer a hash of pattern characters 0 to M-1
- For each i, compute a hash of text characters i to M+i-1
- If pattern hash = text substring hash check for a match
Efficiently computing the hash function
- Modular hash function. Using the notation t_i for txt.charAt(i)
  - We wish to compute x_i = t_i R^M-1* + t_i+1 R^M-2+...+t_i+M-1 R⁰ (mod Q)
- Horner's method. Linear-time method to evaluate degree-M polynomial

Code

    private long hash(String key, int M) {
        long h = 0;
        for (int j = 0; j < M; j++)
            h = (R * h + key.charAt(j)) % Q;
        return h;
    }

Complexity 7 N
Rabin-Karp search
- Advantages
  - Extends to 2d patterns
  - Extends to finding multiple patterns
- disdvantages
  - Arithmetic ops slower than char compares
  - Las Vegas version require backup
  - Poor worst-case guarantee

Code

/**
 * Rabin-Karp substring search for operation Index of Which return the first
 * occupance of substring in text
 */
public class RabinKarp {

    private static final int R = 256;

    private final String pattern;
    private final long patternHash;
    private final int M; // pattern length
    private final long Q; // modulus
    private long RM;

    public RabinKarp(String pattern) {
        this.pattern = pattern;
        M = pattern.length();
        Q = longRandomPrime();

        // R^(M-1) % Q
        RM = 1;
        for (int i = 1; i <= M - 1; i++) {
            RM = (R * RM) % Q;
        }
        patternHash = hash(pattern, M);
    }

    // a random 31-bit prime
    private static long longRandomPrime() {
        BigInteger prime = BigInteger.probablePrime(31, new Random());
        return prime.longValue();
    }

    private long hash(String key, int M) {
        long h = 0;
        for (int i = 0; i < M; i++) {
            h = (R * h + key.charAt(i)) % Q;
        }
        return h;
    }

    public int search(String text) {
        int N = text.length();
        if (N < M)
            return -1;

        // hash of first M characters in text
        long h = hash(text, M);

        if ((patternHash == h) && check(text, 0))
            return 0;

        // check for hash match; if hash match, check for exact match
        for (int i = M; i < N; i++) {
            // Remove leading digit
            h = (h + Q - RM * text.charAt(i - M) % Q) % Q;

            // Add trailing digit
            h = (R * h + text.charAt(i)) % Q;

            // Check match
            int offset = i - M + 1;
            if ((patternHash == h) && check(text, offset))
                return offset;
        }
        return -1;
    }

    // Las Vegas version: does pattern[] match text[i..i-m+1] ?
    private boolean check(String text, int i) {
        for (int j = 0; j < M; j++) {
            if (pattern.charAt(j) != text.charAt(i + j)) {
                return false;
            }
        }
        return true;
    }

    // Monte Carlo version: always return true
    // private boolean check(String text, int i) {
    // return true;
    // }
}

Compare the substring algorithms

algorithm	version	guarantee	typical	backup in input	correct
brute force	M N	1.1 N	yes	yes	1
knuth-Morris-Pratt (full DFA)	2 N	1.1 N	no	yes	M R
knuth-Morris-Pratt (mismatch transitions only)	3 N	1.1 N	no	yes	M
Boyer-Moore (full algorithm)	3 N	N / M	yes	yes	R
Boyer-Moore (mismatched char)	M N	N / M	yes	yes	R
Robin-Karp (monte carlo)	7 N	7 N	no	yes^*	1
Robin-Karp (las vegas)	7 N^*	7 N	yes	yes	1

* probability guarantee with uniform hash function

Regular expression

Pattern matching
- Substring search. Find a string iin text
- Pattern matching. Find one of a specified set of strings in text
- Ex. [genomics]
  - Fragile X syndrome is a common cause of metntal retardation
  - Human genome contains triplet repeats of CGG or AGG bracketed by GCC at the beginning and CTG at the end
  - Pattern GCG(GGC|AGG)*CTG
Applicatons
- Process natural language
- Scan for virus signature
- Specify a programming language
- Access information in digital libraries
- Search genome using PROSITE patterns
- Filter text (spam, NetNanny, Carnivore, malware)
- Validate data-entry fields (dates, email, URL, credit card)
- Parse text files
  - Compile a Java program
  - Crawl and index the web
  - Read in data stored in ad hoc input file format
  - Create Java docmentation from Javadoc comments
Regular expressions
- Is a notation to specify a set of string possible to infinite

operation	order	example RE	matches	does not match
concatenation	3	AABAAB	AABAAB	every other string
or	4	AA	BAAB	AA BAAB
closure	2	AB*A	AA ABBBBBBBA	AB ABABA
parentheses	1	(AB)*A	A ABABABABA	AA ABBA

NFA

Duality between REs and DFAs
- RE. Concise way to describe a set of strings
- DFA. Machine to recognize whether a given string is in a given set
- Kleene's theorem
  - For any DFA, there exists a RE that describes the same set of strings
  - For any RE, there exists a DFA that recognizes the same set of strings
Pattern matching implementation: basic plan (first attempt)
- Overview is the same as for KMP
  - No backup in text input stream
  - Linear-time guarantee
- Underlying abstraction. Deterministic finite state automata (DFA)
- Basic plan. [apply Kleen's theorem]
  - Build DFA from RE
  - Simulate DFA with text as input
- Bad news. Basic plan is infeasible (DFA may have exponential # of states)
Pattern matching implementation: basic plan (revised)
- Overview is the same as for KMP
  - No backup in text input stream
  - Quadratic-time guarantee (linear-time typical)
- Underlying abstraction. Nondeterministic finite state automata (NFA)
- Basic plan. [apply Kleen's theorem]
  - Build NFA from RE
  - Simulate NFA with text as input
Nondeterministic finite-state automata
- Regular expression matching NFA
  - RE enclosed in parantheses
  - One state per RE character (start = 0, accept = M)
  - Red e-transition (change state, but don't scan text)
  - Black mtach transition (change state and scan to next text char)
  - Accept if any sequence of transitions ends in accept state

NFA simulation

NFA representation
- State names. Integer from 0 to M
- Match-transitions. Keep regular expression in array re[]
- e-transitions. Store in digraph G
Q. How to efficiently simulte an NFA?
- A. Maintain set of all possible states that NFA could be in after reading the first i text characters

NFA constuction

Sates. Incluse a state for each symbol in the RE, plus an accept state
Proposition. Building the NFA corresponding to an M-character RE takes time and space proportional to M
- Pf. For each of the M characters in the RE, we add at most three e-transitions and execute at most two stack operations

Code

public class NFA {

    private char[] re;
    private Digraph graph;
    private int M;
    private String regexp;

    public NFA(String regexp) {
        this.regexp = regexp;
        this.re = regexp.toCharArray();
        this.M = regexp.length();
        this.graph = buildEpsilonTransitionDigraph();
    }

    private Digraph buildEpsilonTransitionDigraph() {
        Digraph g = new Digraph(M + 1);
        Stack<Integer> ops = new ArrayStack<>();
        for (int i = 0; i < M; i++) {
            int lp = i;
            if (regexp.charAt(i) == '(' || regexp.charAt(i) == '|')
                ops.push(i);
            else if (regexp.charAt(i) == ')') {
                int or = ops.pop();
                if (regexp.charAt(or) == '|') {
                    lp = ops.pop();
                    g.addEdge(lp, or + 1);
                    g.addEdge(or, i);
                } else
                    lp = or;
            }

            if (i < M - 1 && regexp.charAt(i + 1) == '*') {
                g.addEdge(lp, i + 1);
                g.addEdge(i + 1, lp);
            }
            if (regexp.charAt(i) == '(' || regexp.charAt(i) == '*' || regexp.charAt(i) == ')')
                g.addEdge(i, i + 1);
        }
        return g;
    }

    public boolean recognize(String txt) {
        Bag<Integer> pc = new Bag<>();
        DirectedDFS dfs = new DirectedDFS(graph, 0);
        for (int v = 0; v < graph.getVertices(); v++) {
            if (dfs.visited(v))
                pc.add(v);
        }

        for (int i = 0; i < txt.length(); i++) {
            Bag<Integer> match = new Bag<>();
            for (int v : pc) {
                if (v == M)
                    continue;
                if ((re[v] == txt.charAt(i)) || re[v] == '.')
                    match.add(v + 1);
            }

            dfs = new DirectedDFS(graph, match);
            pc = new Bag<>();
            for (int v = 0; v < graph.getVertices(); v++) {
                if (dfs.visited(v))
                    pc.add(v);
            }
        }

        for (int v : pc)
            if (v == M)
                return true;
        return false;
    }
}

Applications

Grep. Take a RE as a command-line argument and print the lines from standard input having some substring that is matched by the RE
- Bottom-line. Worst-case for grep (proportional to M N) is the same as for brute-force substring search
- Typical grep application: Crossword puzzles

Code

public class Grep {
    public static void main(String[] args) throws FileNotFoundException {
        String regexp = "(.*" + args[1] + ".*)";
        NFA nfa = new NFA(regexp);
        Scanner scanner = new Scanner(new File(args[0]));
        while (scanner.hasNextLine()) {
            String line = scanner.nextLine();
            if (nfa.recognize(line))
                System.out.println(line);
        }
    }
}

To complete the implementation
- Add wildcard
- Add multiway or
- Handle metacharacters
- Support character classes
- Add capturing capabilities
- Extend the closure operator
- Error checking and recovery
- Greedy vs. reluctant matching
Broadly applicables programmer's tool
- Originated in Unix in the 1970s
- Many languages support extended regular expression
- Built into grep, awk, emacs, Perl, PHP, Python, Javascript

Context

Abstract machines, languages, and nondeterminism
- Basis of the theory of computation
- Intensively studied since the 1930s
- Basis of programming language
Compiler. A program that translates a program to machine code
- KMP string → DFA
- grep RE → NFA
- javac Java language → Java byte code

operation	KMP	grep	Java
Pattern	string	RE	program
compiler output	DFA	NFA	byte code
simulator	DFA simulator	NFA-simulator	JVM

Data compression

Introduction

Compression reduces size of a file
- To save space when sorting it
- To save time when transmitting it
- Most files have lots of redundancy
Who needs compressions?
- Moore's law: # transistors on a chip doubles every 18-24 months
- Parkinsons' law: data expands to fill space available
- Text, image, sound, video, ....
Applications
- Generic file compression
  1. Files: GZIP, BZIP, 7z
  2. Archives: RKZIP
  3. File systems: NTFS, HFS+, ZFS
- Multimedia
  1. Images: GIF, JPEG
  2. Sound: MP3
  3. Video: MPED, DivX, HDTV
- Communication
  1. ITU-T T4 Group 3 fax
  2. Skype
- Databases
  1. Google
  2. Facebook
Lossless compression and expansion
- Message. Binary data B we want to compress
- Compress. Generates a "compressed" representation C(B)
- Expand. Reconstructs original bit-stream B
- Compression ration. Bits in C(B) / bits in B
  - Ex. 50-75% or better compression ration for natural language
Food for thought
- Data compression has been omnipresent since antiquilty
  - Number systems
  - Natural languages
  - Mathematical notation
- has played a central role in communication technology
  - Grade 2 Braille
  - Morse code
  - Telephone system
- And is part of modern life
  - MP3
  - MPEG
Another application. Data representation: genomic code
- Genome. String over the alphabet { A, C, T, G }
  - Goal. Encode an N-character genome: ATAGATGCATِG...
  - Standard ASCII encoding
    
    . 8 bits per chars
    
    . 8 N its
    
    . because they are 4 characters you can use 2 bits to store them
- Fixed-length code. A-bit code support alphabet of size 2^k
- Amazing but true. Initial genomic databases in 1990s used ASCII

Run length coding

Simple type of redundancy in a bitstream. Long runs of repeated bits
- This string

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1

Representation. 4-bit counts to represent alternating runs of 0s and 1s: 15 0s, then7 1s, then 7 0s, then 11 1a
Q. How many bits to store the counts?
- A. We'll use 8 (but 4 in the example above)
Q. What to do when run length exceeds max count?
- A. if longer than 55, intersperse runs on length -
Applications. JPEG, ITU-T T$ Group 3 fax, ....

Huffman compression

Variable-length codes
Use different number of bits to encode different chars
Ex. Morse code: ... - - - . . .
- Issue. Ambiguity
  - SOS ?
  - V7 ?
  - IAMIE ?
  - EEWNI ?
- In practice. Use a medium gap to separate codewords
Q. How do we avoid ambiguity?
- A. Ensure that no codeword is a prefix of another
  - Ex 1. Fixed-length code
  - Ex 2. General prefix-free code
  - Ex 3. General prefix-free code
Q. How to represent the prefix-free code?
- A. binary trie!
  - Chars in leaves
  - Codeword is path from root to leaf
Compression
- Method 1: start at leaf, follow path up to the root, print bits in reverse
- Method 2: creates ST of key-value pairs
Expansion
- Start at root
- Go left if bit is 0, go right if 1
- If leaf node, print char and return to root
Q. How to write the trie?
- A. Write pre-order traversal of trie, mark leaf and internal nodes with a bit
How to find best prefix-free code?
- Shannon-Fano algorithm
  - Partition symbols S into two subsets S0 and S1 of (roughly) equal freq
  - Codewords for symbols in S0 start with 0, for symbols in S1 start with 1
  - Recur in S0 and S1
- Problem 1. How to divide up symbols?
- Problem 2. Not optimal!
Huffman algorithm demo
- Start with one node corresponding to each character wit weight equal to frequency
- Select two tries with min weight
- Merge into single trie with cumulative weight
Applications
- JPEG
- PDF adobe
- MP3
- Divx MPEG4
- Gzip

Huffman codes

Q. How to find best prefix-free code?

. Huffman algorithm:

  01. Count frequency freq\[i\] for each char i in input
  02. Start with one node corresponding to each char i \(with weight freq\[i\]\)
  03. Repeat until single trie formed

      Select two tries with min weight freq[i] and freq[j]  
      Merge into single trie with freq[i] + freq[j]

Running time. Use a binary heap N + R log R

LZW compression

Static model. Same model for all texts
- Fast
- Not optimal: different texts have different statistical properties
- Ex: ASCII, Morse code
Dynamic model. Generate model based on text
- Preliminary pass needed to generate model
- Must transmit the model
- Ex: Huffman code
Adaptive model. Progressively learn and update model as you read text
- More accurate modeling produces better compression
- Decoding must start from beginning
- EX: LZW
Steps
1. Create ST associating W-bit codewords with string keys
2. Initialize ST with codewords for single-char keys
3. Find longest string s in ST that is a prefix of un-scanned part of input
4. Write the w-bit codeword associated with s
5. Add s+c to ST, where c is next char in the input
Q. How to represent LZW compression code table?
- A. A trie to support longest prefix match
How big to make ST?
- How long is message?
- While message similar model
- [many variations have been developed]
What to do when ST fills up?
- Throw away and start over [GIF]
- Throw away when not effective [Unix compression]
- [many other variations]
LZW in the real world
- Lempel-Ziv and friends
  
  . LZ77 . LZ78 . LZW . Deflate / zlib = LZ77 variant + huffman

Data compression summary

Lossless compression
- Represent fixed-length symbols with variable-length codes [huffman]
- Represent variable-length symbols with fixed-length codes [LZW]
Lossy compression [not covered in this course]
- JPEG, MPEG, MP3
- FFT, wavelets, fractals

Overview: introduction to advanced topics

Reduction: design algorithms, establish lower bounds, classify problems
Linear programming: the ultimate practical problem-solving model
Intractability: problems beyond our reach
Shifting gears
- From individual problems to problem-solving models
- From linear/quadratic to polynomial/exponential scale
- From details of implementation to conceptual framework
Goal
- Place algorithms we've studied in a larger context
- Introduce you to important and essential ideas
- Inspire you to learn more about algorithms 😄 !

Reductions

Classify problems according to computational requirements

complexity	order of growth	examples
linear	N	min, max, median, burrows-wheeler transform
linearithmic	N log N	sorting, convex hull, closest pair, farthest pair
quadratic	N2	?
.....	.....	.....
exponential	c^N	?

Frustrating news. Huge number of problems have defined classification
Suppose we could (could not) solve problem X efficiently

what else could (could not) we solve efficiently?

Definition
- Problem X reduces to problem Y if you can use an algorithm that solves Y to help solve X
- Cost of solving X = total cost of solving Y + cost of reduction
- Ex 1. [finding the median reduces to sorting]
  
  . To find median of N items
```
  01. Sort N items
  02. Return item in the middle
```
  . Cost of solving finding the median N log N + 1
- Ex 2. [element distinctness reduces to sorting]
  
  . To solve elements distinctness on N times:
```
  01. Sort N items
  02. Check adjacent paris for equality
```
  . Cost of solving finding the median N log N + 1

Design algorithms

Def. Problem X reduces to problem Y if you can use an algorithm that solves Y to help solve X
Design algorithm. Given algorithm for Y, can also solve X
Ex.
- 3-collinear reduces to sorting
- Finding the median reduces to sorting
- Element distinctness reduces to sorting
- CPM reduces to topological sort
- Arbitrage reduces to shortest paths
- Burrows-wheeler transform reduces to suffix sort
Mentality. Since I know how to solve Y, can I use that algorithm to solve X

Establish lower bound

Goal. prove that a problem require a certain number of steps
Ex. In decision tree model, any compare-based sorting algorithm requires theta N log N in the worst case
BAd news. very difficult to establish lower bounds from scratch
Good news. spreed theta N log N lower bound to Y by reducing sorting to Y
Linear-time reductions
- Def. problem X linear-time reduces to problem Y if x can be solved with:
  
  . Linear number of standard computational steps . Constant number of calls to Y
- Ex. almost all of the reductions we've seen so far
- Establish lower bound
  
  . If x takes theta N log N steps, then so does Y . If x takes theta N², then so does Y
- Mentality
  
  . If I could easily solve Y, then I could easily solve X . I can't easily solve X . Therefore, I can't easily solve Y

Classify problems

Desiderata. Problem with algorithm that matches lower bound
- Ex. sorting and convex hull have complexity of N log N
Desiderata. prove that two problems X and Y have the same complexity
- First, show that problem X linear-time reduces to Y
- Second, show that Y linear-time reduces to X
- Concludes that X and Y have same complexity
Integer arithmetic reductions
- Integer multiplication. given two N-bit integers, compute their product
- Brute force. N² bit operations
- Q. Is brute-force algorithm optimal?

Name		Name	Last commit message	Last commit date
Latest commit History 250 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
README.md		README.md
Readme-tips		Readme-tips
Todos.md		Todos.md
pom.xml		pom.xml
renovate.json		renovate.json

0xh3xa/algorithms

Folders and files

Latest commit

History

Repository files navigation

Algorithms and Data Structures

What is this course?

Definitions

Topics

Why Algorithms is so important?

Steps for solving the problem

Algorithm Analyze

Reasons to Analyze Algorithms

The scientific method

Principles

Empirical Analysis

Data Analysis

Doubling Hypothesis

Experimental algorithmic

Mathematical Models

Order-of-Growth Classifications

Properties

Dynamic connectivity

Quick find (Eager approach)

Rough standards (for now)

E. g Huge problem for quick find

Quick Union

Quick Union improvement

Union find Applications

Percolation

Data structures Design

Stack

Queue

Elementary sorts

Selection sort

Insertion sort

Shell sort

Shuffle sort

Applications in sorting

Merge sort

Bottom-up version of Mergesort

Sort Stability

Quick sort

Selection

Duplicate keys

System sorts

System sort: Which algorithm to use?

Sort complexity

Priority Queues

Priority Queue (PQ)

Binary heaps

Heapsort

Event driven simulation

Symbol tables

API

Elementary implementations

Ordered operations

Binary search trees

Ordered ST operations

Deletion in BST

Balanced search tree

2-3 search trees

Red-black BSTs

B-Tree

Geometric applications of BSTs

1 d range search

line segment intersection

kd trees

interval search trees

rectangle intersection

Hash tables

Hash functions

Separate chaining (Collision resolution)

Linear probing

ST implementations: summary

Context

Applications

Graph

Undirected graphs

Undirected Graph API