Okay, the title changed a bit, that was intentional. The goal of this episode is to explore how flash drives could get documents corrupt. A fantastic idea has been brewing up in my mind about using machine learning techniques to learn to predict if a flash drive is corrupt
I did not think up this problem, I was contacted by a lady with this problem, and I opted to find a solution to it, but asked for credits in this work and she agreed to it, so my challenge begins. To get started, I will be starting the first generating a dataset that I can use to explore this problem.
This is less of a plan and essentially the progress we have made. The first step here is to generate 10k microsoft word documents. I don't know how to do this so I got the help of my friend John Oke who completed the task - like a bad guy in under 3hrs, I was really impressed. Now I have 10k documents on my laptop containing random texts that poises the next level to the understanding the problem.
The Next thing I will do is get this files into python's HDf5 format, which data scientists have been preaching about for a while. Each file I generated is around 12kb, making it manageable for me to work with. This clean working documents will serve as my positive set. I will also copy them to the corrupting flash drive to ensure that they go bad. I have a theory that not all of them will not be unreadable, but hey, that's how things work in real life, data does not usually come back super clean.
Is this code on GitHub: Well, yes it is here: https://github.com/e911miri/word-generator
Who is the mystery lady with this project: Well that's classified :p
Do you intend to complete this: I don't know, it sounds like fun, so I am taking it up.
Where can I learn more about machine learning: There is an amazing course on Udacity