Enlarge / Turning the lens on ourselves, as it were. Is Our Machine Learning?View more stories
There's a moment in any foray into new technological territory when you realize you may have embarked on a Sisyphean task. Staring at the multitude of options available to take on the project, you research your options, read the documentation, and start to workonly to find that actually just defining the problem may be more work than finding the actual solution.
Reader, this is where I found myself two weeks into this adventure in machine learning. I familiarized myself with the data, the tools, and the known approaches to problems with this kind of data, and I tried several approaches to solving what on the surface seemed to be a simple machine-learning problem: based on past performance, could we predict whether any given Ars headline will be a winner in an A/B test?
Things have not been going particularly well. In fact, as I finished this piece, my most recent attempt showed that our algorithm was about as accurate as a coin flip.
But at least that was a start. And in the process of getting there, I learned a great deal about the data cleansing and pre-processing that goes into any machine-learning project.
Our data source is a log of the outcomes from 5,500-plus headline A/B tests over the past five yearsthat's about as long as Ars has been doing this sort of headline shootout for each story that gets posted. Since we have labels for all this data (that is, we know whether it won or lost its A/B test), this would appear to be a supervised learning problem. All I really needed to do to prepare the data was to make sure it was properly formatted for the model I chose to use to create our algorithm.
I am not a data scientist, so I wasn't going to be building my own model any time this decade. Luckily, AWS provides a number of pre-built models suitable to the task of processing text and designed specifically to work within the confines of the Amazon cloud. There are also third-party models, such as Hugging Face, that can be used within the SageMaker universe. Each model seems to need data fed to it in a particular way.
The choice of the model in this case comes down largely to the approach we'll take to the problem. Initially, I saw two possible approaches to training an algorithm to get a probability of any given headline's success:
The second approach is much more difficult, and there's one overarching concern with either of these methods that makes the second even less tenable: 5,500 tests, with 11,000 headlines, is not a lot of data to work with in the grand AI/ML scheme of things.
So I opted for binary classification for my first attempt, because it seemed the most likely to succeed. It also meant the only data point I needed for each headline (besides the headline itself) is whether it won or lost the A/B test. I took my source data and reformatted it into a comma-separated value file with two columns: titles in one, and "yes" or "no" in the other. I also used a script to remove all the HTML markup from headlines (mostlya few HTML tags for italics). With the data cut down almost all the way to essentials, I uploaded it into SageMaker Studio so I could use Python tools for the rest of the preparation.
Next, I needed to choose the model type and prepare the data. Again, much of data preparation depends on the model type the data will be fed into. Different types of natural language processing models (and problems) require different levels of data preparation.
After that comes tokenization. AWS tech evangelist Julien Simon explains it thusly: Data processing first needs to replace words with tokens, individual tokens. A token is a machine-readable number that stands in for a string of characters. So 'ransomware would be word one, he said, crooks would be word two, setup would be word three so a sentence then becomes a sequence of tokens, and you can feed that to a deep-learning model and let it learn which ones are the good ones, which ones are the bad ones.
Depending on the particular problem, you may want to jettison some of the data. For example, if we were trying to do something like sentiment analysis (that is, determining if a given Ars headline was positive or negative in tone) or grouping headlines by what they were about, I would probably want to trim down the data to the most relevant content by removing "stop words"common words that are important for grammatical structurebut don't tell you what the text is actually saying (like most articles).
However, in this case, the stop words were potentially important parts of the dataafter all, we're looking for structures of headlines that attract attention. So I opted to keep all the words. And in my first attempt at training, I decided to use BlazingText, a text processing model that AWS demonstrates in a similar classification problem to the one we're attempting. BlazingText requires the "label" datathe data that calls out a particular bit of text's classificationto be prefaced with "__label__". And instead of a comma-delimited file, the label data and the text to be processed are put in a single line in a text file, like so:
Another part of data preprocessing for supervised training ML is splitting the data into two sets: one for training the algorithm, and one for validation of its results. The training data set is usually the larger set. Validation data generally is created from around 10 to 20 percent of the total data.
There has been a great deal of research into what is actually the right amount of validation datasome of that research suggests that the sweet spot relates more to the number of parameters in the model being used to create the algorithm rather than the overall size of the data. In this case, given that there was relatively little data to be processed by the model, I figured my validation data would be 10 percent.
In some cases, you might want to hold back another small pool of data to test the algorithm after it's validated. But our plan here is to eventually use live Ars headlines to test, so I skipped that step.
To do my final data preparation, I used a Jupyter notebookan interactive web interface to a Python instanceto turn my two-column CSV into a data structure and process it. Python has some decent data manipulation and data science-specific toolkits that make these tasks fairly straightforward, and I used two in particular here:
Heres a chunk of the code in the notebook that I used to create my training and validation sets from our CSV data:
I started by using pandas to import the data structure from the CSV created from the initially cleaned and formatted data, calling the resulting object "dataset." Using the dataset.head() command gave me a look at the headers for each column that had been brought in from the CSV, along with a peek at some of the data.
The pandas module allowed me to bulk-add the string "__label__" to all the values in the label column as required by BlazingText, and I used a lambda function to process the headlines and force all the words to lower case. Finally, I used the sklearn module to split the data into the two files I would feed to BlazingText.
Follow this link:
Feeding the machine: We give an AI some headlines and see what it does - Ars Technica
- AI File Extension - Open . AI Files - FileInfo [Last Updated On: June 14th, 2016] [Originally Added On: June 14th, 2016]
- Ai | Define Ai at Dictionary.com [Last Updated On: June 16th, 2016] [Originally Added On: June 16th, 2016]
- ai - Wiktionary [Last Updated On: June 22nd, 2016] [Originally Added On: June 22nd, 2016]
- Adobe Illustrator Artwork - Wikipedia, the free encyclopedia [Last Updated On: June 25th, 2016] [Originally Added On: June 25th, 2016]
- AI File - What is it and how do I open it? [Last Updated On: June 29th, 2016] [Originally Added On: June 29th, 2016]
- Ai - Definition and Meaning, Bible Dictionary [Last Updated On: July 25th, 2016] [Originally Added On: July 25th, 2016]
- ai - Dizionario italiano-inglese WordReference [Last Updated On: July 25th, 2016] [Originally Added On: July 25th, 2016]
- Bible Map: Ai [Last Updated On: August 30th, 2016] [Originally Added On: August 30th, 2016]
- Ai dictionary definition | ai defined - YourDictionary [Last Updated On: August 30th, 2016] [Originally Added On: August 30th, 2016]
- Ai (poet) - Wikipedia, the free encyclopedia [Last Updated On: August 30th, 2016] [Originally Added On: August 30th, 2016]
- AI file extension - Open, view and convert .ai files [Last Updated On: August 30th, 2016] [Originally Added On: August 30th, 2016]
- History of artificial intelligence - Wikipedia, the free ... [Last Updated On: August 30th, 2016] [Originally Added On: August 30th, 2016]
- Artificial intelligence (video games) - Wikipedia, the free ... [Last Updated On: August 30th, 2016] [Originally Added On: August 30th, 2016]
- North Carolina Chapter of the Appraisal Institute [Last Updated On: September 8th, 2016] [Originally Added On: September 8th, 2016]
- Ai Weiwei - Wikipedia, the free encyclopedia [Last Updated On: September 11th, 2016] [Originally Added On: September 11th, 2016]
- Adobe Illustrator Artwork - Wikipedia [Last Updated On: November 17th, 2016] [Originally Added On: November 17th, 2016]
- 5 everyday products and services ripe for AI domination - VentureBeat [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Realdoll builds artificially intelligent sex robots with programmable personalities - Fox News [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- ZeroStack Launches AI Suite for Self-Driving Clouds - Yahoo Finance [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- AI and the Ghost in the Machine - Hackaday [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Why Google, Ideo, And IBM Are Betting On AI To Make Us Better Storytellers - Fast Company [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Roses are red, violets are blue. Thanks to this AI, someone'll fuck you. - The Next Web [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Wearable AI Detects Tone Of Conversation To Make It Navigable (And Nicer) For All - Forbes [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- Who Leads On AI: The CIO Or The CDO? - Forbes [Last Updated On: February 6th, 2017] [Originally Added On: February 6th, 2017]
- AI For Matching Images With Spoken Word Gets A Boost From MIT - Fast Company [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- Teach undergrads ethics to ensure future AI is safe compsci boffins - The Register [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- AI is here to save your career, not destroy it - VentureBeat [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- A Heroic AI Will Let You Spy on Your Lawmakers' Every Word - WIRED [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- With a $16M Series A, Chorus.ai listens to your sales calls to help your team close deals - TechCrunch [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- Microsoft AI's next leap forward: Helping you play video games - CNET [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- Samsung Galaxy S8's Bixby AI could beat Google Assistant on this front - CNET [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- 3 common jobs AI will augment or displace - VentureBeat [Last Updated On: February 7th, 2017] [Originally Added On: February 7th, 2017]
- Stephen Hawking and Elon Musk endorse new AI code - Irish Times [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- SumUp co-founders are back with bookkeeping AI startup Zeitgold - TechCrunch [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Five Trends Business-Oriented AI Will Inspire - Forbes [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- AI Systems Are Learning to Communicate With Humans - Futurism [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Pinterest uses AI and your camera to recommend pins - Engadget [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Chinese Firms Racing to the Front of the AI Revolution - TOP500 News [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Real life CSI: Google's new AI system unscrambles pixelated faces - The Guardian [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- AI could transform the way governments deliver public services - The Guardian [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Amazon Is Humiliating Google & Apple In The AI Wars - Forbes [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- What's Still Missing From The AI Revolution - Co.Design (blog) [Last Updated On: February 9th, 2017] [Originally Added On: February 9th, 2017]
- Legaltech 2017: Announcements, AI, And The Future Of Law - Above the Law [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- Can AI make Facebook more inclusive? - Christian Science Monitor [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- How a poker-playing AI could help prevent your next bout of the flu - ExtremeTech [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- Dynatrace Drives Digital Innovation With AI Virtual Assistant - Forbes [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- AI and the end of truth - VentureBeat [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- Taser bought two computer vision AI companies - Engadget [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- Google's DeepMind pits AI against AI to see if they fight or cooperate - The Verge [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- The Coming AI Wars - Huffington Post [Last Updated On: February 10th, 2017] [Originally Added On: February 10th, 2017]
- Is President Trump a model for AI? - CIO [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- Who will have the AI edge? - Bulletin of the Atomic Scientists [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- How an AI took down four world-class poker pros - Engadget [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- We Need a Plan for When AI Becomes Smarter Than Us - Futurism [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- See how old Amazon's AI thinks you are - The Verge [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- Ford to invest $1 billion in autonomous vehicle tech firm Argo AI - Reuters [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- Zero One: Are You Ready for AI? - MSPmentor [Last Updated On: February 11th, 2017] [Originally Added On: February 11th, 2017]
- Ford bets $1B on Argo AI: Why Silicon Valley and Detroit are teaming up - Christian Science Monitor [Last Updated On: February 12th, 2017] [Originally Added On: February 12th, 2017]
- Google Test Of AI's Killer Instinct Shows We Should Be Very Careful - Gizmodo [Last Updated On: February 12th, 2017] [Originally Added On: February 12th, 2017]
- Google's New AI Has Learned to Become "Highly Aggressive" in Stressful Situations - ScienceAlert [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]
- An artificially intelligent pathologist bags India's biggest funding in healthcare AI - Tech in Asia [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]
- Ford pledges $1bn for AI start-up - BBC News [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]
- Dyson opens new Singapore tech center with focus on R&D in AI and software - TechCrunch [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]
- How to Keep Your AI From Turning Into a Racist Monster - WIRED [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]
- How Chinese Internet Giant Baidu Uses AI And Machine Learning - Forbes [Last Updated On: February 13th, 2017] [Originally Added On: February 13th, 2017]
- Humans engage AI in translation competition - The Stack [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- Watch Drive.ai's self-driving car handle California city streets on a ... - TechCrunch [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- Cryptographers Dismiss AI, Quantum Computing Threats - Threatpost [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- Is AI making credit scores better, or more confusing? - American Banker [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- AI and Robotics Trends: Experts Predict - Datamation [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- IoT And AI: Improving Customer Satisfaction - Forbes [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- AI's Factions Get Feisty. But Really, They're All on the Same Team - WIRED [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- Elon Musk: Humans must become cyborgs to avoid AI domination - The Independent [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- Facebook Push Into Video Allows Time To Catch Up On AI Applications - Investor's Business Daily [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- Defining AI, Machine Learning, and Deep Learning - insideHPC [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- AI Predicts Autism From Infant Brain Scans - IEEE Spectrum [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- The Rise of AI Makes Emotional Intelligence More Important - Harvard Business Review [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- Google's AI Learns Betrayal and "Aggressive" Actions Pay Off - Big Think [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- AI faces hype, skepticism at RSA cybersecurity show - PCWorld [Last Updated On: February 15th, 2017] [Originally Added On: February 15th, 2017]
- New AI Can Write and Rewrite Its Own Code to Increase Its Intelligence - Futurism [Last Updated On: February 17th, 2017] [Originally Added On: February 17th, 2017]