Data Loading

For a custom dataset one needs to implement the Dataset class even if its the most basic dataset.

Derivatives

Regression

November 21, 2023 · One min read

Regression tries to find the parameters of a function that represents the relationship between input and output variables with the least amount of error.

Search

November 21, 2023 · 2 min read

Implementation of different search algorithms in python.

Sample Array:

x = [3,1,4,5,9,6,2]

Goal:

Search 1 specific number. If not in array, return -1. If in array return the index of the number.

Linear Search

Just look through every entry from left to right and check if the entry is equal to the target.

def linear(input, target):
    for idx, entry in enumerate(input):
        if entry == target:
            return idx
    return -1

Complexity:
$O(n)$
$\Omega(1)$

Binary Search

Only works with a sorted list!
Look at the middle of the list first and check if that entry is the target. If it isn't the target, compare that number with the target. If the target is higher, repeat from the first step with the right half of the list, otherwise with the left half.

def binary(input, target, idx = None):
    
    length = len(input)

    if length == 0:
        return -1

    middle = length//2

    if idx == None:
        idx = middle

    if input[middle]==target:
        return idx

    if input[middle]>target:
        return binary(input[:middle], target, idx-((middle//2)+1))
    else:
        return binary(input[middle+1:], target, idx+((middle//2)+1))

Complexity:
$O(\log n)$
$\Omega(1)$

Sorting

Now the question is:
Is it better to just do linear search or sort the array and then do binary search. For one search linear search would make more sense. However in practice the same arrays often get searched multiple times. So it is better to sort them once and then do binary search multiple times on the sorted array to save time.
Some Sort Algorithms.

Sort

November 21, 2023 · 2 min read

Implementation of different search algorithms in python.

Sample Array:

x = [3,1,4,5,9,6,2]

Goal:

Sort array from lowest to highest entry and return it.

Selection sort

Go through whole list and find the lowest number. Swap that number with the first number in the list. Start with one position to the right and repeat.

def selection(input):

    for i in range(len(input)):

        min_idx = i
        for j in range(i,len(input)):
            if input[j] < input[min_idx]:
                min_idx = j
        
        input[i], input[min_idx] = input[min_idx], input[i]

    return input

Complexity:
$O(n^2)$
$\Omega(n^2)$

Bubble sort

Go through list and check if number is higher than the following number. If yes, swap the two numbers. If no, go to the next number. Repeat from the first step, but end one further position to the left.

def bubble(input):

    for i in range(len(input)):
        for j in range(len(input)-i-1):
            if input[j] > input[j+1]:
                input[j], input[j+1] = input[j+1], input[j]


    return input

Complexity:
$O(n^2)$
$\Omega(n)$

Merge sort

Divide list in middle and recursively repeat for left and right. When a list is only 1 number return it. When two of these lists got returned, they are sorted. Then they are combined again, by looking at the first entry in each list and appending the lower number to the result. Repeat until right and left are "empty".

def merge(input):

    if len(input)==1:
        return input

    middle = len(input)//2

    left, right = input[:middle], input[middle:]

    left = merge(left)
    right = merge(right)

    result=[]

    i = j = 0
    while i < len(left) and j < len(right):
        if left[i]<right[j]:
            result.append(left[i])
            i+=1
        else:
            result.append(right[j])
            j+=1
    
    if i < len(left):
        result += left[i:]

    if j < len(right):
        result += right[j:]

    return result

Complexity:
$O(n \log n)$
$\Omega(n \log n)$

Visualization

Here.

Tidy data

November 21, 2023 · One min read

Mostly a summary of the paper tidy data.

Example of badly formatted data:

	treatmenta	treatmentb
John Smith	-	2
Jane Doe	16	11
Mary Johnson	3	1

Better formatted version of that data:

name	trt	result
Jane Doe	a	16
Jane Doe	b	11
John Smith	a	-
John Smith	b	2
Mary Johnson	a	3
Mary Johnson	b	1

Important guidelines:

Rows: observations
Columns: variables
Values: variable values at specific observations

Order:

variables: fixed (descriptions of the experiment) first, then measured variables, always the ones related to each other next to each other
observations: order by first variable, then break ties with the following variables

This leads to a standard, which is important as programs knows how their input is structured. So they can take the data, transform it and return tidy data again.

From messy to tidy

To get a dataset from messy to tidy one can employ three operations:

Melting

Turns multiple columns that are variables into a column with the names of the specific columns and a column with the value.

Messy:

row	a	b	c
A	1	4	7
B	2	5	7
C	3	6	9

Molten:

row	column	value
A	a	1
A	b	4
A	c	7
B	a	2
B	b	5
B	c	8
C	a	3
C	b	6
C	c	9

String splitting

Casting

Knowing the unknown

November 21, 2023 · One min read

Why are people often times so bad when they don't have all information. And cant deal with probabilities.

Data Loading​

Derivatives​

Linear Search​

Binary Search​

Sorting​

Selection sort​

Bubble sort​

Merge sort​

Visualization​

From messy to tidy

Melting​

Messy:​

Molten:​

String splitting​

Casting​

Data Loading

Derivatives

Linear Search

Binary Search

Sorting

Selection sort

Bubble sort

Merge sort

Visualization

Melting

Messy:

Molten:

String splitting

Casting