Key Takeaways
Between 2008 and 2014, LinkedIns member base grew from around 16 million members to around 330 million members globally. This exponential growth within a relatively short period of time placed strains on the organizations infrastructure. As our member base grew, the need increased for more insight into how our members were best utilizing our platform and what product lines we could introduce to best serve the needs of professionals worldwide.
As such, numerous internal, web-based applications were produced that engineers, product managers, executives, and operations teams used to perform a variety of crucial tasks ranging from A/B testing, application deployment and lifecycle management, reporting, analytics, and more. As these new applications were rapidly fleshed out, likewise new approaches were taken to solve technology problems by introducing and vetting a number of different languages, frameworks, and tools. Such growth and experimentation resulted in a lack of uniformity among technologies and solutions between groups which created strains as more engineers were hired to fill roles within emerging teams.
Languages such as Python, Ruby, Java, Scala, JavaScript, and others emerged in various organic efforts or acquisitions that peppered the ecosystem with incompatible but useful solutions. Mind you, this effort of exploration was a healthy and intentional endeavor to seek out the best, long-term solutions for problems we were seeking to solve. Teams were encouraged to seek out technologies that they felt might provide benefit to the organization and this exploratory process was instrumental in helping us define what we would lean on, long-term, as a scaling and notable organization in Silicon Valley and across the globe.
By mid-2015, several dozen active projects had surfaced with various implementations, frameworks, and libraries. Because of the varying approaches, teams struggled to share code and, though repositories and artifacts were in place to provide, the implementations themselves lacked uniformity. In the case of JavaScript, some used a composite of libraries and micro-libraries like jQuery and Backbone.js, some teams used popular frameworks, while others rolled their own. A growing uncertainty loomed around how we would approach building front-ends for applications, how developers could share common logic across teams, and how we could streamline best practices for developer ergonomics and satisfaction.
As you can imagine, these growing number of varying technologies also introduced a security debt over time. Having a large number of frameworks, languages, and components made it increasingly difficult to assess the security posture of the applications built on top of it. Additionally, this undermined the efforts to move towards releasing a common framework level security solution to eliminate certain classes of vulnerabilities.
Around this same time, our complex infrastructure for the Linkedin.com website alone surpassed 3,000 submodules across over 6 million lines of code, all within a single repository trunk. This trunk approach was governed by a tedious monthly release cycle that encompassed hundreds of engineers across various teams. Prior to each release, a release candidate (RC) was selected and handed off to our testing teams for four days of manual regression testing. If any bugs were found, hotfixes were made to the RC to avoid disrupting the deployment. Engineers rushed to get their code checked in before the deadline to avoid having to wait an entire month to deliver their features or bug fixes to members.
Figure 1 - Our former deployment process
Such a serial and time-sensitive process had to be meticulously timed with product managers and marketing partners to coincide with their plans for new feature releases. It wasnt easy to iterate on member feedback because of this infrequent cadence of twelve releases per year.
Further, the need for prevention and remediation of potential security vulnerabilities had a strong dependency upon the deployment and release process. It was imperative that once a fix was identified and in place that it be within the production environment as quickly as possible. This often meant that security fixes were hotfixed against the release cycle instead of incorporated within it. It is generally a good practice to deploy a security patch in isolation; i.e. not to add other non-security related bug fixes along with a security update. This is primarily to reduce the chances of re-introducing a security vulnerability if the patch is rolled back due to a non-security related update breaking functionality on the site.
A less obvious side-effect of hyper-growth, a long release cadence, and a mixture of technologies was an emerging blocky and inconsistent User Experience (UX). As LinkedIn began to employee user research in its product development process, we found that many users felt the site was disconnected and that one page would look different from another. Because teams were releasing so far apart in cycles, the feedback loop for UI changes was delayed, impacting the quality of changes over time.
In 2014, LinkedIns mobile engineering teams began experimenting with what would become our current release model. Dubbed Train Release, this approach shifted from our monthly release cadence to one called 3x3 that would release three times per day, with no more than three hours between when code would be committed to when that code would be available to members.
The idea behind this was not just web-specific. The ultimate goal was to have all platforms running on this cadence including iOS, Android, API, and other back-end services necessary to run the LinkedIn.com experience and extended product lines and services.
Transitioning to such a release model was very challenging and required involvement from all areas of engineering, especially from our tooling and infrastructure teams. This would mean that our existing internal applications and tooling would require revisiting to ensure that developers were receiving timely information about the status of their changes in the deployment process, that they had the proper scripting and systems in place to automate much of that process, and that proper end-to-end testing could take place to ensure adequate code coverage.
Because of the short window between release cycles in this newly proposed approach, not all changes could be tested in such a short amount of time, placing a strong and necessary emphasis on testing and automation. This, among other challenges, resulted in the need for a client-side technologies that placed a natural emphasis on testing in its development lifecycle, as testing has not, historically, been in the purview of client-side engineering in the industry, particularly in web. This, combined with the many pain points listed above, became the catalyst for change in not only the Linkedin.com experience, but in the structure of our infrastructure and application-level technology stack.
While this meant that a security issue identified on the platform can now be fixed in a short period of time, it also meant that security issues could slip in fast with this frequency of code deployment. And when such a model is adopted by 100+ applications that comprise the LinkedIn ecosystem, this amplifies the need for security automation.
A simple but powerful approach towards securing web applications is to rely upon frameworks that offers inherent security features. Of course, the choice of framework is not only dependent on security, but also performance, ease of use and multiple other factors.
Our infrastructure team in collaboration with other partner teams, began extensively researching and vetting many languages, frameworks, and libraries, creating gap analyses for technologies that spanned the server-client relationship and the applications that would serve as tooling for future spin-ups in new and existing product lines and internal platforms.
Alongside this effort, our User Experience Research team established extensive focus groups and feedback efforts from members to gain a better sense of what the ideal experience for LinkedIn.com would be.
These joint efforts of product, design, and engineering resulted in project Voyager - a mobile-first, consistent User Experience (UX) across mobile platforms (iOS, Android, and Web). Focusing on mobile first gave us the opportunity to later extend to a future desktop application that would embrace the same UI patterns and theme, providing a consistent experience across all platforms.
Figure 2 - Project Voyager
As a result of this effort, two frameworks were chosen for building our API and web client - The Play Framework for Java and the Ember framework for web were chosen to be the de facto frameworks used for building web applications. LinkedIn had been previously investing efforts to contribute back to the Play Framework for some time prior to this new project and our security team performed an extensive gap analysis on the security features that were currently available in these frameworks against those features that were required for our stack.
Our analysis concluded that the Play Framework provided a secure-by-default approach along with a responsive security team, an active core developer community, extensive documentation, and a stable release cycle.
Ember shared all of these traits as well. As a Single-Page Application (SPA) framework, Ember also provided:
By moving the web to a client-side application, we were able to establish a uniform internal API for all clients (iOS, Android, Web), better aligning our infrastructure across various platforms and reducing the number of applications needed to provide data.
Embers focus on tests took us one step further towards automating deployment and was instrumental in our efforts towards embracing 3x3s. The framework provides three different types of tests - integration, acceptance, and unit tests. Integration testing allowed us to test data flows and interactions of components in the application, acceptance tests gave us user interaction testing, and unit tests provided us with ways to test application logic. As new components were generated by developers, the framework would produce the test files as well, allowing the developer to focus more on writing code and verifying its functionality in the tests, as opposed to just in the browser.
At Linkedin, our security team performs in-depth design reviews including hands on penetration test for all member facing products/features and functionalities. We are also heavily invested in security automation; However, with the 3x3 deployment architecture, we couldn't possibly scale manual penetration test for all builds, resulting in a decision to double down on security automation. Once we find a class of vulnerability that can be detected with a high level of confidence, we build automation checks for such classes of vulnerabilities. Our partner product security engineering teams have helped in building, maintaining, and scaling such automation. This allows us to focus on more interesting areas of the application/underlying framework and provides us more time to research some in-depth vulnerabilities in those areas.
As API endpoints were added to the application, a security analysis would need to occur to prevent vulnerabilities from emerging. Previously, this process was operationally cumbersome given the number of routes (paths to resources or URLs) per application but also the number of such applications that existed in the system. Our security team build tooling to detect and notify any new changes made to an application, broken down by the nature of change (addition or deletion of an external API route, modification to key code of the application etc) to assist in the evaluation of each such commit to the application. Thus, we were in a position to determine the state of an application since the last review. This allowed for targeted reviews, ensuring broader coverage and faster turnaround time for the assessment of our applications.
An established approach to security assessment is through the adoption of a security in depth principle. We do not want to be in a situation where the failure of a particular security control results in the failure of the entire chain. We love to give developers the tooling they need to avoid introducing security vulnerabilities. Our product security team built tooling to scan the code changes for potential vulnerabilities and, if any anti-patterns or discouraged practices were revealed, the developer would be notified even before committing the code, where they would have the opportunity to properly address the problem with the tool providing suggested code fixes. Once past the code review process, the changes were again analyzed and if a potential vulnerability was found, the code was rejected at the pre-commit stage of the deployment pipeline, through the adoption of custom pre-commit hooks.
Once a security issue has been identified on the platform through any channel, our goal is to prevent the same issue from surfacing back on the platform in the future. Thus, to help avoid regression issues, we build tooling and test cases that run continuously against the deployed services, detect the reintroduction of a particular instance of a security vulnerability, and send an alert to the security team for investigation.
Figure 3 - Pre-commit hook tool for scanning for potential XSS
In January of 2014, prior to the introduction of this system, we identified over 5,000 potential XSS vulnerabilities through code scans; by January of 2016, that number was less than 500. Further, the observation of pre-commit failures occurring due to the introduction of offending commits steadily declined during this same timeframe. Through our methodical approach to security automation, we saw close to 90% reduction in the presence and introduction of potential vulnerabilities.
[Click on the image to enlarge it]
Figure 4 - Our current (3x3) release cycle
To date, in our web client, we average around 50-75 commits per day from a developer body of over 100 UI engineers in a single repository. Each of these commits go through a code review system and require several separate approvals from developers on Access Control Lists (ACLs) to ensure code is at the highest quality. Code is evaluated against a style guide of best practices but is also scanned with linters for various languages to ensure developers are writing code similar to their peers. As mentioned above, these code changes also undergo an automated custom security scan to check for known classes of vulnerabilities including, but not limited to XSS, CSRF, and access control issues. Once developers receive the approvals they need and have addressed any opened issues, their code is submitted through a set of systems to ensure its health before it makes it into the set of commits bound for the next deployment.
These systems will perform a variety of tasks as well as run all of the applications tests. If these tests do not pass, the commit is rejected and the developer is notified. Once the commit passes, it enters into the trunk of the application and a separate set of tests are ran to ensure the code did not introduce performance regressions or exceed other obvious thresholds established by the application owners. Assuming all of these pass, the commit ends up in a production environment at the next available deployment. If no commits are introduced between deployment times, a deployment is still performed against the same version; this is part of the 3x3 methodology and ensures that code is rigorously tested.
The resulting shift towards a new release model has been instrumental in our ability to scale as an organization and has ushered in increased code quality, security, productivity, and member satisfaction. We are now more capable of providing our members with a safer, faster, modern experience, quickly resolving issues or bugs that are found, and innovating more rapidly.
Follow this link:
Developing a Secure and Scalable Web Ecosystem at LinkedIn - InfoQ.com