How Open Source is Driving the Future of Data Science – RTInsights

With its reliance on a community of physically dispersed individuals and flexibility of adoption, open-source data science is becoming an even more attractive choice among cash-strapped governments, non-profits, and businesses.

Over the past decade, data science and machine learning have made their way from an obscure academic discipline to widespread corporate adoption. The academic community has a natural preference towards open source. Science is a collaborative effort, and its advancement is best served by enabling as large a community as possible to build upon existing research.

Private companies, on the other hand, have a much stronger incentivefor proprietary technology. Developing software systems is an expensiveendeavor. Naturally, a business wants to make a return on this investment.Making the results of your work freely available to competitors doesnt seemlike the smartest choice if you are a business owner.

Still, in data science, several powerful incentives pull corporateinterests in the direction of favoring open-source implementations.

Open source tools offer a lower barrier to entry thanlicensed software. Companies can experiment more easily and with fewerconstraints. They are also more likely to find talent for programming languagesand data science tools that are freely available to everyone.

A case in point is Python, the dominant programming languagefor data science, which happens to be open source. It has the most versatileand extensive capabilities for manipulating data and building machine learningmodels. Python has even superseded commercial tools like MatLab in terms ofcapabilities for data science applications.

Most data science and machine learning frameworks such asTensorFlow, SciKit-Learn, or PyTorch build directly on Python and are also open-source.

Often, their creators are large companies that are alreadydominant in their respective markets. Evidently, the benefits of making alibrary like TensorFlow open-source outweigh the costs for its creator Google.

While Google gave potential competitors a powerful deeplearning tool, it probably benefits more from the massively expanded talentpool, the sprawling deep learning innovation, and the widespread adoption ofthe framework by other companies that open-sourcing TensorFlow entailed.

Other machine learning libraries, such as XGBoost,originatedas research projects in universities. For these institutions, the benefits ofopen-source software are overwhelming for the reasons discussed above.

Most machine learning models require large amounts of datato train. Modern machine learning models, especially deep neural networks usedin computer vision and natural language processing, require vast amounts ofcomputational resources to train. This would present an almost insurmountablechallenge for smaller organizations and individuals, who simply do not havethis amount of data internally, nor the budget to run expensive model trainingexperiments. If it werent for open source data, machine learning would bealmost exclusively the domain of large corporations. This may be in theinterest of the shareholders of said corporations, but certainly not of societyat large, which benefits from the innovations produced by startups andindividuals.

Even for large corporations, the widespread availability of open-sourcedata and pre-trained machine learning models has benefits.

Many of the cutting-edge models developed by researchers atcompanies like Google and Facebook have been open-sourced. Anyone can downloadthese models from Github and use them in their custom data science projects.

But why are these corporations so generous in sharing theirmodels and their data?

From the perspective of an established corporation, it makessense to avoid risky ventures and instead aim to expand market share throughmore traditional strategies.

Startups tend to be better suited for engaging in novelhigh-risk ventures because they are smaller, more agile, and have nothing tolose.

If a large company wants to enter a novel market, or obtainnew technology, acquiring a successful startup in the desired field may be asmarter move than trying to do everything from scratch in-house.

For example, Google acquired Deep Mind in 2014 for thepotential it saw in DeepMinds research in reinforcement learning andgeneral-purpose AI.

To maximize the potential for the emergence of innovativedata science and artificial intelligence startups, it makes sense to giveambitious new upstarts the tools and data they need.

Furthermore, many of the researchers working on commercialprojects come from academic settings. They bring with them a culture ofcollaboration based on open source.

Researchers and developers are naturally inclined toshowcase their work. Therefore, a commitment to open source and the opportunityfor employees to participate in open source projects can go a long way to makea company a more attractive employer for highly coveted data science talent.

The foundational knowledge for data science includesadvanced skills in mathematics, statistics, and programming. Until a few yearsago, this knowledge was deeply buried in academic textbooks and usuallyacquired by obtaining a technical university degree.

Today, an ambitious self-starter can learn all of thesethings via resources that are freely available on the web. An army of Youtubeeducators and bloggers has emerged that makes previously dry and highlyacademic topics accessible in a fun and easy-to-digest way.

These new educational resources grow the talent pool bymaking data science more accessible for a larger group of people, which alsobenefits companies.

Without open-source software and open-source data, offeringthis type of education for free would be much more difficult.

Online education platforms offer academic curricula that often match or exceed traditional university courses in terms of quality. In many cases, these courses are accompanied by Github repositories full of open source code.

Developing and maintaining a custom data science solutionfrom scratch in-house presents a major challenge to most companies. The largera software system grows, the more susceptible it is to bugs and the moredifficult it is to find problems in the source code and deploy the system intoproduction.

Building on open source software and models cansignificantly alleviate these burdens and speed up time to market. Bugs inwidely used open-source libraries are likely to have been discovered byprevious users. If bugs do occur,developers are free to go into the code and fix them without having to worryabout violating licensing agreements. If the open-source tool turns out to notbe a good fit, no money has been sunk on a failed trial.

Even for private businesses who have a commercial interestin protecting their software, there are strong incentives for using andbuilding open-source data science solutions.

More recently, the Covid-19 pandemic has put many organizations under enormous pressure to digitize data-heavy processes as quickly as possible while physically scattering technical talent. With its reliance on a community of physically dispersed individuals and flexibility of adoption, open-source data science is becoming an even more attractive choice among cash-strapped governments, non-profits, and businesses.

Originally posted here:
How Open Source is Driving the Future of Data Science - RTInsights

Related Posts
This entry was posted in $1$s. Bookmark the permalink.