Challenges of security engineering at scale

As companies grow in size and work with massive amounts of data, the challenges faced by their security teams also grow manifold. This topic is of importance to not only the existing security community, practitioners and experts but also newcomers to the field who may not have experienced the same challenges themselves. In this post, I will be highlighting a few major challenges, sane practices and pointers on how to engineer security at scale. Note that these are not specific to any company but are rather generally applicable concepts that come into play automatically at scale. To bring everyone on the same page, when I say companies @ scale, I am referring to large tech companies who serve millions of users every day. The challenges I will be discussing are also therefore technical in nature. Lastly, this is not an exhaustive list but rather a starting point to a much broader and rather complex discussion.

Asset inventory: It is rightly said that you cannot protect what you don’t know about. Tracking a few hundred devices is very different than tracking over 100,000 machines. This not only means that you are aware that a device exists but also about its configuration, which user(s) it is assigned to, which operating system and patch level is it running, when was it last online etc. More importantly, it is critical that this inventory is kept up to date at ALL times instead of having a delay in it being updated. This is because this is one of the sources of truth that the security team (as well as other teams) rely upon.

Fleet management: This goes hand in hand with asset inventory and sometimes both of these are integrated into a single software solution. Once you know your assets, the fleet management is responsible for functionality like updating the host with the latest updates, deploying patches, gathering system information etc. Such data can be critical for instance say that application security team who is in a rush to deploy a patch to secure all hosts against the latest SSL bug. In an ideal world, the fleet management software should be able to reach the entire asset inventory and provide an easy way to manage them remotely.

Scalable Evidence Collection: In cases when a company is warding off an external or an internal attacker, having a sane, tamper proof evidence collection and storage mechanism is a life saver for a detection and response team. Evidence can be anything for example a file, the output of a command or some specific data from the system like memory dump, copy of the disk image etc. Evidence can also serve as an indicator of compromise. At scale, the challenge with evidence collection comes because your hosts can be in different states such as the following:

  • Offline
  • Stuck during the boot process
  • In the process of rebooting
  • Installing system updates before allowing the user to log in

The evidence collection mechanism should be able to handle data collection by issuing commands to effectively gather artifacts from systems when they are just not fully available (due to any of the above states). Needless to say since the collection is done remotely, having end to end encryption would be ideal. A fantastic piece of software that does this is GRR and best of all, it is open source. When hosts are unavailable, it can poll them and collect the desired artifacts when a machine comes back online. I highly recommend checking it out.

Speed vs storage: This is more of a metapoint but there will be challenges involving the decision between a faster solution that takes up more (in memory or disk) space vs a slower solution with less storage footprint. This is a typical software engineering challenge however I am bundling it up here as it is extremely common to also encounter when writing your own security software that handles petabyte scale data. It is ideal to foresee these key decision points BEFORE writing software and brainstorm them. The ideal place to document these designs are the design documents. Once those decisions are documented, are well thought of and agreed upon, building and maintaining software becomes much more effective.

Risk management: This is a high level point as everything in security revolves around managing risk however the important thing to note is that not all risk can be mitigated, especially when dealing with scale. There is where leadership and technical expertise comes together to decide which risks are accepted and which ones the team should mitigate by prevention, detection or transferring them out. For example, a company might decide to deploy a patch for a vulnerability (prevention), or by having its security team detect and respond to an attack when it happens (detection), or by outsourcing the risk to another vendor company and lastly accept the risk by making no change. This example shows a very simplistic case however risk management decisions are business decisions and different priorities and resources can result in different choices being made. At scale, these challenges just become more complex with competing needs and priorities.

Maintainable and readable code: This might appear like a best practice however it is certainly true when you have company owned internal software written out. When code is complex and hard to maintain, it leads to either engineers spending a lot of time on it (i.e. engineering resources) or it goes unmaintained and can lead to technical debt. Both of those situations are not ideal. When the size of your code base increases dramatically, these simplistic best practices become key.

Compliance: Given the large swath of country (or even state) specific laws that exist today for user’s security and privacy, this presents itself as a challenge for the security compliance team to effectively make changes to products and services to keep the company in compliance. At scale, tackling this challenge can involve making both technical (e.g. data retention policies) and non technical changes (e.g. advance notice to users, seeking consent etc). The second aspect of compliance is ensuring and proving that the company is compliant at all times. If you think about it, making a technical change may involve altering software on a certain set of hosts which is much easier to do if the existing code is easy to maintain and we have a good inventory and fleet management to know where to deploy the changes. In this example, everything comes together in a way.

Employee awareness and training: It is commonly said that security is everyone’s responsibility and this is most definitely true at scale. We can have the best defense systems and security team but all it takes sometimes is one click in a phishing email to do some serious damage. It is critical that employees are well educated to follow the best practices and adhere to internal security policies to ensure distributed protection.