Uncategorized – What Makes You Humanoid

The goal of this project is to predict emotion from audio clips using machine learning techniques.

This project uses both neural networks and random forest classifiers for machine learning. In a neural network, a pattern is presented as input to a network comprised of units with various connections and weights and the goal is to predict the output. During training, inputs and outputs are known. Error between expected output and actual output is calculated and and weights on various units are updated with a function based on this error in order to reduce error in the system. Then in testing, the weights of the units are locked and the network attempts to predict the correct answer based on the inputs.

Random Forest Classifiers are a collection of uncorrelated decision trees. Decision trees take inputs and attempt to split them into groups where each group is as different as possible while the members of each group are similar. The power of random forest classifiers comes from the power of many independent computations. Each tree makes a prediction about the output’s classification and the forest’s output is simply the majority output.

The general idea is that for each audio file, the Mel Frequency Cepstral Coefficients (MFCCs) are generated and used as input features for the neural network or random forest classifier, and the output is one of a few emotions like neutral, happy, surprised, or sad. More information about how MFCCs are found can be found in How to apply machine learning and deep learning methods to audio analysis. Another way to extract features is to simply use the spectrogram of the audio which shows how frequency signals of the audio vary over time. The melspectrogram approach was used in audioCNN2.py. Calculating MFCCs requires several steps including computing the power spectrum and applying various transformations. Luckily, the Python library Librosa handles all of this for us.

MFCCs can be displayed and saved as a png file. Here is wav/id10001/1zcIwhmdeo4/00001.wav

Project Flow:

0) Download files from VoxCeleb1 dataset and senet50-ferplus-logits.mat from Emotion Recognition paper (links below)
1) Run getImagesAndTags.m
2) Choose sample and run makefile.m (or alternative makefileCSV.m)

Where to find senet50-ferlus-logits.mat if link below is broken

If makefile.m was run:
3.1.1) Run pickler.py to extract features and save object to file
3.1.2) Run either classifytest.py for CNN or treeclassify.py for RandomForestClassifier

If makefileCSV.m was run:
3.2) Run audioCNN2.py

GitHub: https://github.com/jolshan/16-264_Project

Many resources were used:
Vox Celeb Data Set http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild http://www.robots.ox.ac.uk/~vgg/research/cross-modal-emotions/
senet50-ferplus-logits.mat (from paper above) http://www.robots.ox.ac.uk/~albanie/data/cross-modal-emotions/senet50-ferplus-logits.mat
How to apply machine learning and deep learning methods to audio analysis https://towardsdatascience.com/how-to-apply-machine-learning-and-deep-learning-methods-to-audio-analysis-615e286fcbbc
Urban Sound Classification using Convolutional Neural Networks with Keras: Theory and Implementation https://medium.com/gradientcrescent/urban-sound-classification-using-convolutional-neural-networks-with-keras-theory-and-486e92785df4

Step 1: Concept

One important measure when rowing is the number of strokes the rower takes per minute. When on the indoor rowers (ergs) this value is pretty accurate and calculated every stroke. However, on the water, there is often a magnet in the boat that is used in tandem with an expensive piece of equipment. There are also standalone devices but they aren’t always consistent. I have heard of phone apps that aren’t always reliable and I was wondering how feasible this concept could be when accessing the phone’s accelerometer.

Of course, when on an erg, the erg itself is stationary, unlike a boat that will have a net direction along with the surges associated with a stroke. This video shows the erg by itself, the screen where the stroke rating is in the bottom right corner, the erg on slides to simulate boat movement, and actual movement of boats.

For this test, I put my phone in my pocket and rowed for 30 seconds at 20 and 30 strokes per minute (spm).

Then I uploaded the data to my computer and into MATLAB. Since the data was super noisy, I used a function to smooth it. I then found the number of peaks in the data and multiplied that by sixty divided the length of time of the interval. This gave me approximately the number of strokes per minute.

20.0128 spm and 30.0192 spm
smoothed data

Step 2: Creating the App

I just wanted to share some of my misadventures in the actual construction of the app. I didn’t know much about iOS app development. (I’ve done some Android, but since I have an iPhone and need to test with the accelerometers, I had to go with iOS. Looking at the developer tools for CoreMotion it didn’t seem too bad.

https://developer.apple.com/documentation/coremotion/getting_raw_accelerometer_events

However, my problems began when I got to Xcode. Because I was new with this, I was hoping to use most of the default project and just add a bit of logic. I was very wrong. The newest version of Xcode uses SwiftUI which requires iOS 13 and above. Seems like my phone was iOS 12.4.4. Ok. I’ll just update. Turns out that Apple does not support iOS 13 on iPhone 5S. It was really hard to find UI code to support iOS 12 online, so I chose the nuclear option and just downloaded an earlier version of Xcode. This was about 3 hours of work (mostly downloading the two versions of Xcode) to get a label on a blank background. Just wanted to share this pain.

Anyway, the next challenge I faced was trying to get the functionality of the MATLAB code into the app. I wanted to keep the function that smooths the data, and did not want to implement it myself. But I was not able to find anything in Swift that could do this. After a bit of research, it seemed like the best option was to convert the MATLAB code with the MATLAB Coder which turns the function into a C function and related files to use it. Then, these files could be added to the Xcode project which can use objective C (basically C) files. Unfortunately, the smooth function in MATLAB can not be converted with MATLAB Coder. I did a bit more research and found a compatible function called envelope that gave me a result that was similar but not quite as good. This is one of a few “problem areas” in the program—that is, this change could result in less accuracy that could give wrong results. More about these problem areas will be detailed in the “Further Work” section.

After figuring out a bit of Swift and how the C code can be integrated with the project, I was ready to do calculations with the data to receive and display my output.

Step 3: Methodology for Calculation

It turns out that there are actually a few ways to calculate stroke rate, however, and I looked at two ways for this project.

num_peaks * 2 (for a 30 second window)

First, I tried matching what my experiment did—collect a window of 30 seconds of erging, find the peaks and multiply to get strokes per minute. With a frequency of 100 Hz, the window would update, tossing a previous entry and inserting a new one. Then, like with the MATLAB code, the number of peaks in the 30 second window would be calculated and multiplied by 2 to get the strokes per minute. I realized quite early on that this method gave the average rating for the 30 second window, but this value would not be sensitive to quick changes and I wanted a value more sensitive to change.

60 / ((last_peak_time - second_to_last_peak_time) / 100)

Another way of finding the strokes per minute would be to collect a window of data, and find the distance between the last two peaks. This would be the number of milliseconds between strokes. By converting to seconds I could see how long a single stroke took in seconds, and divide that number from sixty to get strokes per minute. This method is a lot more sensitive to errors in the accelerometers. A bit of noise could be picked up as a stroke and send the rate skyrocketing. I also added some code to detect when the user had stopped, by checking if the last stroke was within 5 seconds from the end of the window.

Another important part of this calculation was determining what parameters to use for the smoothing function. When testing with the phone on my leg, the accelerometers experienced much more noise in their readings as my leg moves differently than a phone attached to the erg on slides. When switching to the envelope function in MATLAB using the data from the leg experiment, I needed an np parameter. According to the documentation, “the envelopes are determined using spline interpolation over local maxima separated by at least np samples.” 60 seemed like a good value for the data based on some informal experiments. However, when moving to the erg, this value seemed to completely smooth the data to where I kept seeing 0s and single digit numbers for the rating. Experimentally, I decreased the value to 5 where it saw some extreme outliers, but seemed more consistent to the rating on the erg screen (the defacto “ground truth” I used for my experiments).

Here is a video with the second method and the best np value I could find for this strategy.

I think the second method is better for the needs of this application, but the methodology could still be improved. Minor (or major) tweaks to the math could potentially make the system more accurate.

Step 4: Further Work

For further work, there are are few problem areas in the program that could be improved to have better performance.

One was the problem mentioned earlier: the smoothing function envelope might not be the best for this data. I found that there were both too high and too low outliers, and that could be because the smoothing function is not powerful enough to get the overall trend of the data. Comparing the envelope function to the smooth function, it was pretty clear the final results were not quite the same.

Another possible cause of problems is the sensor data itself. It may be the case that the accelerometers are not consistent enough to get the data I need to perform this function. It is totally possible that other phones may have better sensors and could do this task better as well.

A third area to improve is the 30 second window of data collection. While this was necessary for the first method, and I kept it to perhaps have a more consistent function for smoothing, it is very possible the 30 seconds at the beginning is unnecessary. Tweaks could also be made to some of the edge cases like when rower stops. More logic could be done to get a better idea of consistent rates and avoid some of the outliers as well.

Finally, I think my code is a bit of a mess. I’m new to Swift and iOS development, so I’m pretty sure I did some things wrong with the design and structure of the code. The main logic probably shouldn’t be in the ViewController. In general, I think changes could be made to fix readability and perhaps improve the speed of the program.

The current version of the code can be found here. Please be warned that I had some trouble with renaming things so the files are not as clearly named as I would like for a final project.

https://github.com/jolshan/StrokeCoachApp

Category: Uncategorized

Using Machine Learning to Categorize Emotion from Audio

Stroke Coach Application

Step 1: Concept

Step 2: Creating the App

Step 3: Methodology for Calculation

Step 4: Further Work

The Journey Begins