Data Loading
For a custom dataset one needs to implement the Dataset class even if its the most basic dataset.
Regression tries to find the parameters of a function that represents the relationship between input and output variables with the least amount of error.
Implementation of different search algorithms in python.
Sample Array:
x = [3,1,4,5,9,6,2]
Goal:
Search 1 specific number. If not in array, return -1
. If in array return the index of the number.
Just look through every entry from left to right and check if the entry is equal to the target.
def linear(input, target):
for idx, entry in enumerate(input):
if entry == target:
return idx
return -1
Complexity:
Only works with a sorted list!
Look at the middle of the list first and check if that entry is the target. If it isn't the target, compare that number with the target. If the target is higher, repeat from the first step with the right half of the list, otherwise with the left half.
def binary(input, target, idx = None):
length = len(input)
if length == 0:
return -1
middle = length//2
if idx == None:
idx = middle
if input[middle]==target:
return idx
if input[middle]>target:
return binary(input[:middle], target, idx-((middle//2)+1))
else:
return binary(input[middle+1:], target, idx+((middle//2)+1))
Complexity:
Now the question is:
Is it better to just do linear search or sort the array and then do binary search. For one search linear search would make more sense. However in practice the same arrays often get searched multiple times. So it is better to sort them once and then do binary search multiple times on the sorted array to save time.
Some Sort Algorithms.
Implementation of different search algorithms in python.
Sample Array:
x = [3,1,4,5,9,6,2]
Goal:
Sort array from lowest to highest entry and return it.
Go through whole list and find the lowest number. Swap that number with the first number in the list. Start with one position to the right and repeat.
def selection(input):
for i in range(len(input)):
min_idx = i
for j in range(i,len(input)):
if input[j] < input[min_idx]:
min_idx = j
input[i], input[min_idx] = input[min_idx], input[i]
return input
Complexity:
Go through list and check if number is higher than the following number. If yes, swap the two numbers. If no, go to the next number. Repeat from the first step, but end one further position to the left.
def bubble(input):
for i in range(len(input)):
for j in range(len(input)-i-1):
if input[j] > input[j+1]:
input[j], input[j+1] = input[j+1], input[j]
return input
Complexity:
Divide list in middle and recursively repeat for left and right. When a list is only 1 number return it. When two of these lists got returned, they are sorted. Then they are combined again, by looking at the first entry in each list and appending the lower number to the result. Repeat until right and left are "empty".
def merge(input):
if len(input)==1:
return input
middle = len(input)//2
left, right = input[:middle], input[middle:]
left = merge(left)
right = merge(right)
result=[]
i = j = 0
while i < len(left) and j < len(right):
if left[i]<right[j]:
result.append(left[i])
i+=1
else:
result.append(right[j])
j+=1
if i < len(left):
result += left[i:]
if j < len(right):
result += right[j:]
return result
Complexity:
\(O(n \log n)\)
\(\Omega(n \log n)\)
Mostly a summary of the paper tidy data.
Example of badly formatted data:
treatmenta | treatmentb | |
---|---|---|
John Smith | - | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
Better formatted version of that data:
name | trt | result |
---|---|---|
Jane Doe | a | 16 |
Jane Doe | b | 11 |
John Smith | a | - |
John Smith | b | 2 |
Mary Johnson | a | 3 |
Mary Johnson | b | 1 |
Important guidelines:
Order:
This leads to a standard, which is important as programs knows how their input is structured. So they can take the data, transform it and return tidy data again.
To get a dataset from messy to tidy one can employ three operations:
Turns multiple columns that are variables into a column with the names of the specific columns and a column with the value.
row | a | b | c |
---|---|---|---|
A | 1 | 4 | 7 |
B | 2 | 5 | 7 |
C | 3 | 6 | 9 |
row | column | value |
---|---|---|
A | a | 1 |
A | b | 4 |
A | c | 7 |
B | a | 2 |
B | b | 5 |
B | c | 8 |
C | a | 3 |
C | b | 6 |
C | c | 9 |
Why are people often times so bad when they don't have all information. And cant deal with probabilities.