Web(cam) Development

Alvin

Alvin

November 11, 2023

Why are we making coding so easy?

Every day it seems like a new framework or layer of abstraction comes along to make us have to think less hard about our problems.

Now, while that sounds like a good thing. The problem is the phenomenon of Generational Amnesia.

we don't know how to make it meme

To put it simply, if you don't use it... you lose it.

Like a game of telephone, human knowledge is passed down by the previous generation teaching the current generation what they've accumulated over their lifetime. The great majority of that knowledge was built from the very beginnings of human civilization, such as, how to speak, write, throw an object.

One for All meme (boku no hero) "The first person cultivates the power, and then passes it on to another. The next refines it and passes it on again. In this way, those crying out to be saved and those with brave and true hearts link to form a crystalline network of power!" - Izuku Midoriya

The wisdom of our ancestors doesn't just stop there though. A great deal has changed in the last 100 years of civilization. The advent of the Information Age has lead to an explosion in innovation in the digital world.

Protocols and algorithms all developed 35+ years ago drive the foundation of the internet. With current human society incredibly dependent on these technologies. What would happen if we forgot how they worked or how to build them again?

Lost Knowledge

Douglas Adams once said, "We no longer think of chairs as technology; we just think of them as chairs. But there was a time when we hadn't worked out how many legs chairs should have, how tall they should be, and they would often 'crash' when we tried to use them."

If we all forgot what a chair was or how to build one, we would have to rediscover it all over again.

Likewise with software, countless contributions from researchers and enthusiasts in the space have managed to abstract away the tedious task of moving bits of memory to different places to the point where the code practically writes itself.

"PSHHH, so why bother worrying about how it works? We're above that now!" You might be saying.

ChatGPT Heyyyy there...

To modify the words of Ronald Reagan, Knowledge is a fragile thing and it's never more than one generation away from extinction. It is not ours by way of inheritance; it must be maintained constantly by each generation.

We have to make sure that the basics are never forgotten. In the age of AI, where we're constantly trending towards systems that can automate away tasks for us, it's especially important to keep the struggle of learning human.

Difficulty is good

If you could have any answer to any question you desired with the push of a button, would you press the button?
What if on top of that, there was something that knew what you wanted and fufilled it before you even had to begin to get frustrated?

This is the dilemma that humanity will face next. We have to be vigilant to make sure the next generation of people and every subsequent generation understands that learning is about the journey NOT the destination...as cliche as it sounds.

When you suffer through a task but have the willpower to finish what you started, that's the mentality it takes to create great things.

Charles Goodyear spent 5 years straight doing experiments on rubber, putting his family in debt to fund said experiments, in order to figure out a way to prevent rubber from melting in warm temperatures. As a result, we now can have rubber boots, belts for machinery, and tires for every vehicle on Earth.

picture of charles goodyear/tires?

Struggle aids in memory and understanding. No amount of AI generated answers compare to a lived experience. It's like describing a fruit without ever having tasted it.

People naturally trend towards laziness, whether it's a biological urge for the path of least resistance or something else; the fact of the matter is:

With enough generations, enough automation to take care of our problems, who will solve problems anymore?

picture of lazy humans and robots

Did you know animals bred in captivity can't be released into the wild? As technology progresses to the point where we no longer have to worry about our supply chains for food, we'll be no different.

In that moment, humanity, despite looking like they've made it, will be the most fragile it ever was.

If a single solar flare disrupted the machines in this dystopia from functioning for just a month, would we have the knowledge to survive on our own?

Software is hard

Now that we've established some lemmas about the importance of difficulty, let's look back at coding.

Software is built on top of a very large foundation of discrete math, information theory, graph theory, and most recently, linear algebra.

picture of discrete math

None of that is easy.

Too many people try to take shortcuts when they pick up programming.

How many people out there now think that all software development is simple? HURR DURR just go to a 30 day web bootcamp and you too can get into Google!

That's what coding means to a lot of people. Especially with AI tools like ChatGPT, it seems like anyone can code...right?

Software engineering is more than just creating a working webpage though. It require a solid understanding of writing readable code, design patterns,designing systems that scale, algorithms, and especially working with others.

I mean

Did you know that we put a man on the freaking moon with assembly?

picture of apollo code That's ALL assembly

There were no IDES, no internet, just incredible grit, understanding of fundamentals, and coordination to accomplish the programs that ran the Apollo 11 rocket.

In the face of challenge, it's amazing what people can accomplish. As a society it's important to cultivate a desire for going beyond the tip of the iceberg so honestly, the least programmers can do is understand the foundations of computer science.

If society abstracts away too much of the process of software development, let's say, have an Artificial General Intelligence start generating all the software, UI and backend, that we use on a daily basis, without making sure we remember how to develop such software in the first place, we'll be subject to the whims of our own creation.

Now I'm not saying we shouldn't do some abstraction.

Abstraction of certain duties or meaningless tasks is how we were able to have time to focus on larger and larger problems over time.

However.

Humanity is now at an inflection point where we need to be especially cautious of how much we abstract away. As technology accelerates faster than legislation or the general public can keep up with, it will become easier and easier to be caught with our pants down.

This Week's Hack

Much like how the gym is an artificial construct to keep ourselves fit in a world where we no longer have to run from saber-toothed tigers or hunt for prey, we need a system to artificially impose struggle in the digital world too.

early humans vs sabretooth tiger Man, look at their chiseled abs

That's why I propose Web(cam) Development. Where you have to write your code using your webcam!

To sum up how it works:

I collected a bunch of images of myself doing various contortions and gestures with my hands and labelled them with different programming statements and keyboard characters (like if, for, {}, (), etc).
I trained a custom model compatibile with the TensorFlow library for python using the images I collected.
I loaded the model into a script that uses the cvzone library to use my webcam and the model and type the appropriate characters/statements into my PC.

Step 1: Collecting Data

I collected the images of myself with the data_collect.py file.

data_collect.py

import cv2
from cvzone.HandTrackingModule import HandDetector
import math
import numpy as np
import time

cap = cv2.VideoCapture(0)
detector = HandDetector(maxHands=2)

offset = 20
img_size = 300
counter = 0
start_countdown = False

#Change to whatever label you want to collect images to at the current moment
img_folder = "data/Statements/PRINT"


def normalize_image(img_crop, img_white, h, w):
	if img_crop is not None and img_crop.shape[0] > 0 and img_crop.shape[1] > 0:
		aspect_ratio = h/w 
		#This just makes the images centered and constrained to image_size box (300 x 300)
		if aspect_ratio > 1:
			k = img_size/h
			w_cal = math.ceil(k*w)
			if w_cal > 0:
				img_resize = cv2.resize(img_crop, (w_cal, img_size))
				img_resize_shape = img_resize.shape
				w_gap = math.ceil((img_size - w_cal) / 2)
				img_white[:, w_gap:w_cal + w_gap] = img_resize
		else:
			k = img_size/w
			h_cal = math.ceil(k*h)
			if h_cal > 0:
				img_resize = cv2.resize(img_crop, (img_size, h_cal))
				img_resize_shape = img_resize.shape
				h_gap = math.ceil((img_size - h_cal) / 2)
				img_white[h_gap:h_cal + h_gap, :] = img_resize
	return img_white

def show_countdown(img, count):
	font = cv2.FONT_HERSHEY_SIMPLEX 
	cv2.putText(img, str(count), (img.shape[1]//2, img.shape[0]//2), font, 3, (0, 255, 0), 2, cv2.LINE_AA)
	cv2.imshow("Image", img)
	time.sleep(1)

while True:
	success, img = cap.read()
	hands, img = detector.findHands(img)
	
	if hands:
		hand = hands[0]
		x,y,w,h = hand['bbox']
		img_crop = None
		img_white = np.ones((img_size, img_size, 3), np.uint8)*255
		
		if len(hands) == 2:
			hand2 = hands[1]
			x2,y2,w2,h2 = hand2['bbox']
			
			x_min = max(0, min(x, x2) - offset)
			y_min = max(0, min(y, y2) - offset)
			x_max = max(img.shape[1], max(x+w, x2+w2) + offset)
			y_max = max(img.shape[0], max(y+h, y2+h2) + offset)
			img_crop = img[y_min:y_max, x_min:x_max]
			normalize_image(img_crop, img_white, y_max - y_min, x_max - x_min)
		else:
			img_crop = img[y-offset:y + h+offset, x-offset:x + w+offset]
			normalize_image(img_crop, img_white, h, w)
		
		#This displays the cropped/normalized image
		if img_crop is not None and img_crop.shape[0] > 0 and img_crop.shape[1] > 0:
			cv2.imshow("ImageCrop", img_crop)
			cv2.imshow("ImageWhite", img_white)
	
	# Check if we should start the countdown
	if start_countdown:
		if countdown_time > 0:
			show_countdown(img, countdown_time)
			countdown_time -= 1
		else:
			counter += 1
			cv2.imwrite(f'{img_folder}/image_{time.time()}.jpg', img_white)
			print(f'saved image: {counter}')
			if counter >= 135:
				start_countdown = False
				counter = 0
	else:
		cv2.imshow("Image", img)
	key = cv2.waitKey(1)
	if key == ord(" "):
		start_countdown = True
		countdown_time = 5

Usually for an image dataset, you need to have hundreds of images for each labeled item. It can be pretty tedious to take each picture one at a time. There's a trick though, by running this file, I can record video of each gesture which can then be broken down into hundreds of images for each without breaking a sweat.

This script works by taking 135 images of you in rapid succession like a photobooth on crack 5 seconds after you press the spacebar (it even gives you a countdown!) This is particularly useful for gestures that need both hands.

It also normalizes the images, that just means it makes sure that the images produced are always 300x300 pixels. This is particularly useful to standardize all of our images for training so we minimize outliers.

I knew I really wanted this program to be good for programming (in Java too because I hate myself). So here's some of the gestures I devised:

for pose for loop: The number 4 with my fingers seemed to be the easiest solution! Although now I have to figure out what to do for the actual number 4....

print pose print statement: I tried to pretend to hold a pencil. That's why my hand looks so fucked up.

backspace pose backspace: A X with my arms seemed like the perfect way to indicate that there was a mistake and I need to backup.

Step 2: Training the Model

With all of the images in place, we can train the model in several ways. The easiest if you're not familiar with machine learning is to use something like Google's Teachable Machine

Teachable machine interface

Simply add the images for each gesture into their own "Class" boxes and edit the "Class 1", "Class 2", to the label you want. In our case, the labels would be things like letter characters and statements like "for", "if", etc.
Click "Train Model".
Click "Export your Model", Click the "Tensorflow" tab and select "Keras" to download your model. It's literally that simple.

Step 3: Interpret Gestures

Armed with the custom image model. We save the model within our project folder under a model_program folder and reference it in our detector.py file.

detector.py

import cv2
from cvzone.HandTrackingModule import HandDetector
from cvzone.ClassificationModule import Classifier
import math
import numpy as np
import time
import pyautogui

cap = cv2.VideoCapture(0)
detector = HandDetector(maxHands=2)
classifier = Classifier("model_program/keras_model.h5", "model_program/labels.txt")

labels_dict = {}
with open("model_program/labels.txt", "r") as f:
	for line in f:
		index, label = line.strip().split()
		labels_dict[int(index)] = label

offset = 20
img_size = 300

label_list = []
counter = 0


def normalize_image(img_crop, img_white, h, w):
	if img_crop is not None and img_crop.shape[0] > 0 and img_crop.shape[1] > 0:
		aspect_ratio = h/w 
		#This just makes the images centered and constrained to image_size box (300 x 300)
		if aspect_ratio > 1:
			k = img_size/h
			w_cal = math.ceil(k*w)
			if w_cal > 0:
				img_resize = cv2.resize(img_crop, (w_cal, img_size))
				img_resize_shape = img_resize.shape
				w_gap = math.ceil((img_size - w_cal) / 2)
				img_white[:, w_gap:w_cal + w_gap] = img_resize
		else:
			k = img_size/w
			h_cal = math.ceil(k*h)
			if h_cal > 0:
				img_resize = cv2.resize(img_crop, (img_size, h_cal))
				img_resize_shape = img_resize.shape
				h_gap = math.ceil((img_size - h_cal) / 2)
				img_white[h_gap:h_cal + h_gap, :] = img_resize
	return img_white

while True:
	success, img = cap.read()
	hands, img = detector.findHands(img)
	
	if hands:
		hand = hands[0]
		x,y,w,h = hand['bbox']
		img_crop = None
		img_white = np.ones((img_size, img_size, 3), np.uint8)*255
		
		if len(hands) == 2:
			hand2 = hands[1]
			x2,y2,w2,h2 = hand2['bbox']
			
			x_min = max(0, min(x, x2) - offset)
			y_min = max(0, min(y, y2) - offset)
			x_max = max(img.shape[1], max(x+w, x2+w2) + offset)
			y_max = max(img.shape[0], max(y+h, y2+h2) + offset)
			img_crop = img[y_min:y_max, x_min:x_max]
			normalize_image(img_crop, img_white, y_max - y_min, x_max - x_min)
		else:
			img_crop = img[y-offset:y + h+offset, x-offset:x + w+offset]
			normalize_image(img_crop, img_white, h, w)
		
		#This displays the cropped/normalized image
		if img_crop is not None and img_crop.shape[0] > 0 and img_crop.shape[1] > 0:
			cv2.imshow("ImageCrop", img_crop)
			cv2.imshow("ImageWhite", img_white)
			prediction, index = classifier.getPrediction(img_white)
			print(labels_dict[index])
			label_list.append(labels_dict[index])
			counter += 1
			
			if counter == 30:
				most_common_label = max(set(label_list), key=label_list.count)
				print("TYPING: " + most_common_label)
				if most_common_label == 'BACKSPACE':
					pyautogui.press('backspace')
				elif most_common_label == 'SPACE':
					pyautogui.press('space')
				elif most_common_label == 'NEWLINE':
					pyautogui.press('enter')
				elif most_common_label == 'DOUBLEQUOTE':
					pyautogui.press('"')
				elif most_common_label == 'EQUALS':
					pyautogui.press('=')
				elif most_common_label == 'SEMICOLON':
					pyautogui.press(';')
				elif most_common_label == 'PLUS':
					pyautogui.press('+')
				elif most_common_label == 'LESSTHAN':
					pyautogui.press('<')
				elif most_common_label == 'PRINT':
					pyautogui.write("System.out.println(")
				else:
					pyautogui.write(most_common_label.lower())
				label_list.clear()
				counter = 0
	
	cv2.imshow("Image", img)

This script is responsible for plugging in the model and interfacing with the webcam and keyboard to turn our gestures into actual typing.

The way it works is very similar to how we collect the images in the first place, but since the program basically detects whatever pose I have in every frame, and isn't guaranteed to detect the pose I desire 100% of the time, to prevent me from typing the same character over and over again, or have random false positive gestures detected, I devised an algorithm.

The way it works is this:

I make a pose.
The script keeps track of the frequency of each pose within a 30 image window.
The highest frequency post will get typed with the keyboard.
Everything resets for the next 30 image window.

That way, I just need to hold a pose for a certain amount of time, and as long as the pose is recgonized by my model for the majority of the time

With this innovative technology, we put the feeling of frustration (and a little physical exercise) back into programming, just like the good ol' days.

If you have any questions on the implementation feel free to DM @uhaw_blog on X (or just ask ChatGPT lmao). Otherwise feel free to download the repo to try for yourself, or just enjoy a video of me trying to write code using this....thing.

Demo

I mean, it ONLY took me 10 minutes to write one for loop (that didn't even compile at first). It's a great exercise in mental patience! I WASN'T FRUSTRATED AT ALL!!! I FEEL LIKE A PRO ALREADY

preview of hack i'm in.

white space

Code for this project can be found here!

Do you like unhinged content? Follow me on X! Or Buy me a coffee!

Useless Hack a Week.