Challenges of security engineering at scale

As companies grow in size and work with massive amounts of data, the challenges faced by their security teams also grow manifold. This topic is of importance to not only the existing security community, practitioners and experts but also newcomers to the field who may not have experienced the same challenges themselves. In this post, I will be highlighting a few major challenges, sane practices and pointers on how to engineer security at scale. Note that these are not specific to any company but are rather generally applicable concepts that come into play automatically at scale. To bring everyone on the same page, when I say companies @ scale, I am referring to large tech companies who serve millions of users every day. The challenges I will be discussing are also therefore technical in nature. Lastly, this is not an exhaustive list but rather a starting point to a much broader and rather complex discussion.

Asset inventory: It is rightly said that you cannot protect what you don’t know about. Tracking a few hundred devices is very different than tracking over 100,000 machines. This not only means that you are aware that a device exists but also about its configuration, which user(s) it is assigned to, which operating system and patch level is it running, when was it last online etc. More importantly, it is critical that this inventory is kept up to date at ALL times instead of having a delay in it being updated. This is because this is one of the sources of truth that the security team (as well as other teams) rely upon.

Fleet management: This goes hand in hand with asset inventory and sometimes both of these are integrated into a single software solution. Once you know your assets, the fleet management is responsible for functionality like updating the host with the latest updates, deploying patches, gathering system information etc. Such data can be critical for instance say that application security team who is in a rush to deploy a patch to secure all hosts against the latest SSL bug. In an ideal world, the fleet management software should be able to reach the entire asset inventory and provide an easy way to manage them remotely.

Scalable Evidence Collection: In cases when a company is warding off an external or an internal attacker, having a sane, tamper proof evidence collection and storage mechanism is a life saver for a detection and response team. Evidence can be anything for example a file, the output of a command or some specific data from the system like memory dump, copy of the disk image etc. Evidence can also serve as an indicator of compromise. At scale, the challenge with evidence collection comes because your hosts can be in different states such as the following:

  • Offline
  • Stuck during the boot process
  • In the process of rebooting
  • Installing system updates before allowing the user to log in

The evidence collection mechanism should be able to handle data collection by issuing commands to effectively gather artifacts from systems when they are just not fully available (due to any of the above states). Needless to say since the collection is done remotely, having end to end encryption would be ideal. A fantastic piece of software that does this is GRR and best of all, it is open source. When hosts are unavailable, it can poll them and collect the desired artifacts when a machine comes back online. I highly recommend checking it out.

Speed vs storage: This is more of a metapoint but there will be challenges involving the decision between a faster solution that takes up more (in memory or disk) space vs a slower solution with less storage footprint. This is a typical software engineering challenge however I am bundling it up here as it is extremely common to also encounter when writing your own security software that handles petabyte scale data. It is ideal to foresee these key decision points BEFORE writing software and brainstorm them. The ideal place to document these designs are the design documents. Once those decisions are documented, are well thought of and agreed upon, building and maintaining software becomes much more effective.

Risk management: This is a high level point as everything in security revolves around managing risk however the important thing to note is that not all risk can be mitigated, especially when dealing with scale. There is where leadership and technical expertise comes together to decide which risks are accepted and which ones the team should mitigate by prevention, detection or transferring them out. For example, a company might decide to deploy a patch for a vulnerability (prevention), or by having its security team detect and respond to an attack when it happens (detection), or by outsourcing the risk to another vendor company and lastly accept the risk by making no change. This example shows a very simplistic case however risk management decisions are business decisions and different priorities and resources can result in different choices being made. At scale, these challenges just become more complex with competing needs and priorities.

Maintainable and readable code: This might appear like a best practice however it is certainly true when you have company owned internal software written out. When code is complex and hard to maintain, it leads to either engineers spending a lot of time on it (i.e. engineering resources) or it goes unmaintained and can lead to technical debt. Both of those situations are not ideal. When the size of your code base increases dramatically, these simplistic best practices become key.

Compliance: Given the large swath of country (or even state) specific laws that exist today for user’s security and privacy, this presents itself as a challenge for the security compliance team to effectively make changes to products and services to keep the company in compliance. At scale, tackling this challenge can involve making both technical (e.g. data retention policies) and non technical changes (e.g. advance notice to users, seeking consent etc). The second aspect of compliance is ensuring and proving that the company is compliant at all times. If you think about it, making a technical change may involve altering software on a certain set of hosts which is much easier to do if the existing code is easy to maintain and we have a good inventory and fleet management to know where to deploy the changes. In this example, everything comes together in a way.

Employee awareness and training: It is commonly said that security is everyone’s responsibility and this is most definitely true at scale. We can have the best defense systems and security team but all it takes sometimes is one click in a phishing email to do some serious damage. It is critical that employees are well educated to follow the best practices and adhere to internal security policies to ensure distributed protection.

Security Engineering Soft Skills: Communication

Full Disclosure: Karan and Raaghav are security engineers at Google and Dropbox respectively and the following are their own opinions.

The previous posts discussed the technical aspects of the security engineer interview. In this post, I collaborated with my close friend Raaghav to discuss different aspects of an important soft skill – communication.

It is a well-known fact that communication is a key skill for professionals in any domain, and not just within security. Furthermore, with the current pandemic situation and the shift to working from home, there has been a significant increase in the usage of specific modes of communication such as chat and video calls, which brings along a unique set of challenges. It is important to evolve your communication style to adapt to these changes. This served as an impetus for us to revisit the key aspects of effective communication.

Identifying Stakeholders

The first step is to identify who you have to communicate with and as easy as that sounds, it is important enough to state as a major area. This is best explained with some examples:

  • You are handling a security incident and you need to communicate its status to relevant parties. Generally, the team/points of contact responsible for fixing the root cause of the incident are obvious stakeholders. However also thinking about others who potentially need to be involved or are affected by the incident works in your favor e.g. executives who will make decisions based on the outcomes of the incident, the relevant legal teams if it concerns them, public relations, customer support, etc.
  • You have to communicate your project updates. Obvious stakeholders involve your team members and your management. Some other non-obvious stakeholders to think about maybe extended teams who are potential customers of the project who will benefit from the updates, executives who are sponsors of the project, etc.

Choosing the mode of communication

Once you have identified the stakeholders, the next thing to decide is the mode of communication. Here are some well-known modes:

  • Chat
  • Emails
  • Video Call/Conference meetings
  • Documents, Spreadsheets etc.

The key takeaway for the mode of communication is CHOOSING the right one. This depends on the situation, for example – If you have an urgent decision to be made that involves multiple stakeholders, relying solely on asynchronous mode of communication, like email, may not be the best idea. When urgency is involved, relying on a more real-time communication mode (and potentially a combination) may work in your favor. So in this case, it would be better to set up a group chat to get every stakeholder’s attention and then set up a follow-up meeting (if needed) to get the decision made on time.

Conversely, if you require stakeholder inputs on a project proposal on a non-urgent issue, writing up a document that informs them of the proposal and sharing it with them ahead of time (or a follow-up meeting to discuss it) would work better.

Creating Content

Now that you have identified your stakeholders and the mode of communication, let’s focus on the content. The content being communicated varies from situation to situation. Having said that, the following guiding principles are key for effectiveness:

  • Transparency: Ensure that the full knowledge, visibility, and context of the situation is evident in your content. The lack of context raises questions in the audience’s minds and can create confusion. In a remote situation, it is very easy for people to feel siloed, and adequate transparency is a way to eliminate the disconnectedness. Long term, a lack of transparency can negatively impact the morale of your team and will further disconnect them from the mission.
  • Clarity: Ensure that intent of the message is clear and any asks are well spelled out. In a remote situation, people no longer have the opportunity to simply swing by someone’s desk or talk in-person to get more clarity. So it is absolutely important to invest time to be extremely clear in what you are trying to communicate and what specific ask you might have from the recipient.
  • Brevity vs Verbosity: Ensure that the content has the right balance of conciseness and verbosity. In some situations, it is important to convey the point quickly and succinctly whereas in others it helps to be elaborate to explain the issue at hand.

Following up

Once you have communicated, it is important to decide if it requires a follow-up. Most communications are one-off, however, there will be several instances that require multiple follow-ups. The key aspect is to remember when to follow up and how often. Defining these two parameters ensures that you can achieve your deliverables on time and also adapt in the event of unplanned work or delay. Communicating early and often is key.

Overall everyone is putting in a lot of effort to adapt to the remote first work environment and as such, you should strive to demonstrate empathy at every step of your communication while keeping the discussion above in mind.

Go forth and communicate!

Installing python scapy on MAC OSX Sierra

Python ships by default with OSX and resides in the path /usr/bin/python. Having run into a lot of problems with installing and using scapy and other packages with the correct version of python, pip and homebrew in the mix, I decided to write steps that anyone can follow to setup a clean and working python environment in OSX without running it in a virtual environment or fiddling with the default python installation.

Step 0: Check existing python version by the command `which python`. This should yield
$ /usr/bin/python

Step 1: Install homebrew
$ /usr/bin/ruby -e “$(curl -fsSL”

Step 2: Install python and pip using brew and pypcap using pip:
brew install python
pip install pypcap

Step 3: Setup your PATH variable correctly in ~/.bashrc to use the python installed by the above command in /usr/local/bin:
(add the above line at the end of your ~/.bashrc or ~/.zshrc, whichever shell you are using)

Step 4: Force updates to PATH variable by using source:
$ source ~/.bashrc

Step 5: At this point, if everything worked correctly, the output of `which python` and `which pip`should yield /usr/local/bin/python and /usr/local/bin/pip respectively

Step 6: Install scapy
$ pip install scapy

Step 7: Run scapy
$ scapy
If you get errors like “INFO: Can’t import packageXYZ. …”, see step 8 below to fix dependencies.

Step 8: Fix Dependencies
i) pyx
(If using python 2.7): $ pip install pyx==0.12.1 -I –no-cache
(If using python 3): $ pip install pyx

ii) pycrypto
$ pip install pycrypto

iii) ecdsa lib
$ pip install ecdsa

Step 9: Test scapy again
$ scapy
Welcome to Scapy (2.3.3)

The above shows a successful run of scapy with all dependencies correctly installed.

Wordlocks, really?

Recently one of my friends purchased a bike from walmart. At work, he was showing me the new beauty when i noticed that it was locked using a wordlock. Now these “wordlocks” are these fancy locks in which the required combination to unlock is formed by letters instead of traditional numbers or lock/key mechanism. They are designed to be “easy to remember” for the bike rider. As a sample, see the image of a wordlock below

(Image courtesy: wikimedia)


These cost anywhere from $9 to as high as $20. These locks advertise a total of 10,000 possible combinations. In terms of their “security rating” which is between 1 and 5, 5 being the safest lock, they go up to level 2.

Curious to ride my friend’s new bike I was trying my luck at guessing his word combo. Smiling at my vain efforts to guess, he gave me 24 hours to get the right password. (24 hours is waaayyyy too much, but who says no to extra time?). I went back home, picked up a sample wordlock that had 4 character combinations with 10 “random” letters in each line making a total of 10^4 = 10,000 possible combinations. A quick python script to try out all combinations and checking if they are words in the english US dictionary using the enchant library is as follows:-

import enchant
import sys
dial1 = raw_input("Enter characters on first dial:")
dial2 = raw_input("Enter characters on second dial:")
dial3 = raw_input("Enter characters on third dial:")
dial4 = raw_input("Enter characters on fourth dial:")

if len(dial1) != len(dial2):
sys.exit("dials should have same number of characters")

if len(dial2) != len(dial3):
sys.exit("dials should have same number of characters")

if len(dial3) != len(dial4):
sys.exit("dials should have same number of characters")

d = enchant.Dict("en_US")
count = 0

for c1 in dial1:
    for c2 in dial2:
        for c3 in dial3:
            for c4 in dial4:
                tempstr = c1+c2+c3+c4
                if d.check(tempstr) is True:
                    count = count + 1
                    print tempstr
print "Total combinations:" + str(count)

I ran this script with the characters on a sample wordlock as shown below


This gives only 836 possible combinations. Now to try the combinations in reality, I set the first letter to a fixed character say ‘l’, the second to ‘e’, the third to ‘e’ and then manually go over every possible combination in dictionary that my program spitted out. Lets assume i take about 3 seconds to try each valid combination, that yields 836*3 seconds = 2508 seconds / 60 = 41.8 minutes (say 45 minutes approx). We know that brute force is not the smart way of breaking things but hey, it just works here. The whole idea of “wordlocks” making it easier to remember combinations just reduced the security by (1 – 836/10,000)*100 = 91%.

I do not feel safe leaving my bike anywhere more than 45 minutes with wordlocks. As they say in information security, you are only as secure as the weakest link in the security chain, wordlocks are “literally” the weakest chain in bike security. Pwned.