Skip to main content

· One min read

Data Loading

For a custom dataset one needs to implement the Dataset class even if its the most basic dataset.

Derivatives

· One min read

Regression tries to find the parameters of a function that represents the relationship between input and output variables with the least amount of error.

· 2 min read

Implementation of different search algorithms in python.

Sample Array:

x = [3,1,4,5,9,6,2]

Goal:

Search 1 specific number. If not in array, return -1. If in array return the index of the number.

Just look through every entry from left to right and check if the entry is equal to the target.

def linear(input, target):
for idx, entry in enumerate(input):
if entry == target:
return idx
return -1

Complexity:
O(n)O(n)
Ω(1)\Omega(1)

Only works with a sorted list!
Look at the middle of the list first and check if that entry is the target. If it isn't the target, compare that number with the target. If the target is higher, repeat from the first step with the right half of the list, otherwise with the left half.

def binary(input, target, idx = None):

length = len(input)

if length == 0:
return -1

middle = length//2

if idx == None:
idx = middle

if input[middle]==target:
return idx

if input[middle]>target:
return binary(input[:middle], target, idx-((middle//2)+1))
else:
return binary(input[middle+1:], target, idx+((middle//2)+1))

Complexity:
O(logn)O(\log n)
Ω(1)\Omega(1)

Sorting

Now the question is:
Is it better to just do linear search or sort the array and then do binary search. For one search linear search would make more sense. However in practice the same arrays often get searched multiple times. So it is better to sort them once and then do binary search multiple times on the sorted array to save time.
Some Sort Algorithms.

· 2 min read

Implementation of different search algorithms in python.

Sample Array:

x = [3,1,4,5,9,6,2]

Goal:

Sort array from lowest to highest entry and return it.

Selection sort

Go through whole list and find the lowest number. Swap that number with the first number in the list. Start with one position to the right and repeat.

def selection(input):

for i in range(len(input)):

min_idx = i
for j in range(i,len(input)):
if input[j] < input[min_idx]:
min_idx = j

input[i], input[min_idx] = input[min_idx], input[i]

return input

Complexity:
O(n2)O(n^2)
Ω(n2)\Omega(n^2)

Bubble sort

Go through list and check if number is higher than the following number. If yes, swap the two numbers. If no, go to the next number. Repeat from the first step, but end one further position to the left.

def bubble(input):

for i in range(len(input)):
for j in range(len(input)-i-1):
if input[j] > input[j+1]:
input[j], input[j+1] = input[j+1], input[j]


return input

Complexity:
O(n2)O(n^2)
Ω(n)\Omega(n)

Merge sort

Divide list in middle and recursively repeat for left and right. When a list is only 1 number return it. When two of these lists got returned, they are sorted. Then they are combined again, by looking at the first entry in each list and appending the lower number to the result. Repeat until right and left are "empty".

def merge(input):

if len(input)==1:
return input

middle = len(input)//2

left, right = input[:middle], input[middle:]

left = merge(left)
right = merge(right)

result=[]

i = j = 0
while i < len(left) and j < len(right):
if left[i]<right[j]:
result.append(left[i])
i+=1
else:
result.append(right[j])
j+=1

if i < len(left):
result += left[i:]

if j < len(right):
result += right[j:]

return result

Complexity:
\(O(n \log n)\)
\(\Omega(n \log n)\)

Visualization

Here.

· One min read

Mostly a summary of the paper tidy data.

Example of badly formatted data:

treatmentatreatmentb
John Smith-2
Jane Doe1611
Mary Johnson31

Better formatted version of that data:

nametrtresult
Jane Doea16
Jane Doeb11
John Smitha-
John Smithb2
Mary Johnsona3
Mary Johnsonb1

Important guidelines:

  • Rows: observations
  • Columns: variables
  • Values: variable values at specific observations

Order:

  • variables: fixed (descriptions of the experiment) first, then measured variables, always the ones related to each other next to each other
  • observations: order by first variable, then break ties with the following variables

This leads to a standard, which is important as programs knows how their input is structured. So they can take the data, transform it and return tidy data again.

From messy to tidy

To get a dataset from messy to tidy one can employ three operations:

Melting

Turns multiple columns that are variables into a column with the names of the specific columns and a column with the value.

Messy:

rowabc
A147
B257
C369

Molten:

rowcolumnvalue
Aa1
Ab4
Ac7
Ba2
Bb5
Bc8
Ca3
Cb6
Cc9

String splitting

Casting

· One min read

Why are people often times so bad when they don't have all information. And cant deal with probabilities.