How to make the most of your AI/ML investments: Start with your data infrastructure

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!

The era of Big Data has helped democratize information, creating a wealth of data and growing revenues at technology-based companies. But for all this intelligence, were not getting the level of insight from the field of machine learning that one might expect, as many companies struggle to make machine learning (ML) projects actionable and useful. A successful AI/ML program doesnt start with a big team of data scientists. It starts with strong data infrastructure. Data needs to be accessible across systems and ready for analysis so data scientists can quickly draw comparisons and deliver business results, and the data needs to be reliable, which points to the challenge many companies face when starting a data science program.

The problem is that many companies jump feet first into data science, hire expensive data scientists, and then discover they dont have the tools or infrastructure data scientists need to succeed. Highly-paid researchers end up spending time categorizing, validating and preparing data instead of searching for insights. This infrastructure work is important, but also misses the opportunity for data scientists to utilize their most useful skills in a way that adds the most value.

When leaders evaluate the reasons for success or failure of a data science project (and 87% of projects never make it to production) they often discover their company tried to jump ahead to the results without building a foundation of reliable data. If they dont have that solid foundation, data engineers can spend up to 44% of their time maintaining data pipelines with changes to APIs or data structures. Creating an automated process of integrating data can give engineers time back, and ensure companies have all the data they need for accurate machine learning. This also helps cut costs and maximize efficiency as companies build their data science capabilities.

Machine learning is finicky if there are gaps in the data, or it isnt formatted properly, machine learning either fails to function, or worse, gives inaccurate results.

When companies get into a position of uncertainty about their data, most organizations ask the data science team to manually label the data set as part of supervised machine learning, but this is a time-intensive process that brings additional risks to the project. Worse, when the training examples are trimmed too far because of data issues, theres the chance that the narrow scope will mean the ML model can only tell us what we already know.

The solution is to ensure the team can draw from a comprehensive, central store of data, encompassing a wide variety of sources and providing a shared understanding of the data. This improves the potential ROI from the ML models by providing more consistent data to work with. A data science program can only evolve if its based on reliable, consistent data, and an understanding of the confidence bar for results.

One of the biggest challenges to a successful data science program is balancing the volume and value of the data when making a prediction. A social media company that analyzes billions of interactions each day can use the large volume of relatively low-value actions (e.g. someone swiping up or sharing an article) to make reliable predictions. If an organization is trying to identify which customers are likely to renew a contract at the end of the year, then its likely working with smaller data sets with large consequences. Since it could take a year to find out if the recommended actions resulted in success, this creates massive limitations for a data science program.

In these situations, companies need to break down internal data silos to combine all the data they have to drive the best recommendations. This may include zero-party information captured with gated content, first-party website data, and data from customer interactions with the product, along with successful outcomes, support tickets, customer satisfaction surveys, even unstructured data like user feedback. All of these sources of data contain clues if a customer will renew their contract. By combining data silos across business groups, metrics can be standardized, and theres enough depth and breadth to create confident predictions.

To avoid the trap of diminishing confidence and returns from an ML/AI program, companies can take the following steps.

By building the right infrastructure for data science, companies can see whats important for the business, and where the blind spots are. Doing the groundwork first can deliver solid ROI, but more importantly, it will set up the data science team up for significant impact. Getting a budget for a flashy data science program is relatively easy, but remember, the majority of such projects fail. Its not as easy to get budget for the boring infrastructure tasks, but data management creates the foundation for data scientists to deliver the most meaningful impact on the business.

AlexanderLovell is head of product atFivetran.

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even considercontributing an articleof your own!