SRE Organization, the Zumba at Guernsey


Context

PeopleDoc is a fast growing company. We now have an innovation lab, and a good amount of feature teams. The number of services running in production is growing, and the dependencies between each service/project at the same time.

The mission of the SRE team is to make sure our projects meets the SLA.

SRE team has been growing for a couple of months, to reach 5 members as of today, 18th April 2018, and is still hiring.

In order to succeed in this critical mission, the SRE team has a central role at PeopleDoc, communicating with the support, the ops, the feature teams…

In this article, we won't talk about the technical aspect of the day to day job as an SRE, but focus on the organization we have created at PeopleDoc, that makes the SRE team so efficient, even with most of our member working remotely.

We will call this organization the "Zumba at Guernsey". If you got curious about this name, read carefully this article to know the origin of it.

Break the legacy to build your own organization

The main force in our organization was to observe our weak point, to change accordingly, and never have too much legacy. We could call that continuous organization change, in order to remove the organization debt.

The SRE team was created more than a year ago, with a simple organization, following most of the rules from the feature team, and inspired by the Spotify Kanban organization.

After months of organization changes, we wanted to share with you our current organization, that works very well.

Our organization processes falls into one of these two categories:

  • Synchronization inside the team
  • Synchronization with the other teams

Synchronization inside the team

Ambiance and atmosphere during your day to day work is very important. If you feel comfortable and rewarded, you will be more efficient. Not only you will most likely want to stay at the company, but also be involved in the company by contributing to the process back.

A significant amount of time is spend synchronising the work from each member of your team. As a result, this is the most important facet of the organization you should pay attention to.

As soon as the SRE team was created at PeopleDoc, we created our documentation. All of the following points of this article are related to the documentation directly or indirectly. And as we are talking about organizing ourselves, the documentation itself needs a structure. We follow what is explained in a talk given by Daniele Procida, which describes four different categories in our documentation:

  • Reference (formation-oriented)
  • Discussions (understanding-oriented)
  • How-to guides (problem-oriented)
  • Tutorials (learning-oriented)

Now let's dive into the different meetings we attend to.

We start the week by a ticket triaging meeting, followed by a kick-off. The triage tickets aims to review the forgotten tickets, assign them, and update the tickets that are blocked by an external entity. The kick-off meeting is here to prioritize the work, and load balance the work during one "sprint", which is one week long. We share one of our screen to have a support during these two meetings. The Monday afternoon, if we have some change proposal that are not reviewed yet, we stop our work, and review everything. This is to clean up or unblock some work in progress from colleagues.

Every day of the week, we do a quick, 15 minutes stand-up, explaining our work in progress, what we have done, and what we'll be working on.

And to finish the week, we do a retrospective at the end of the Friday. The retrospective has a unique format. We take all of our tickets done during the week and we create a newsletter from it that we send to the R&D. This is a good communication tool to inform the R&D of our work. We also do a round table to get the feedback and mood of each team member.

To build fellowship between each member of the SRE team, we added other meetings and communication media.

We have a weekly video conference meeting, a bit similar as the virtual open space, which is there to talk about anything except the ongoing work. This builds a genuine link between everybody. This answers the problem of not knowing the character of your colleagues, and consequently how to communicate toward them.

We created a private slack channel, and also a private mailing list to communicate privately between SRE team member. This is a very good medium to get the honest reactions and feedback from each member of the team.

These last two points creates a collaborative environment.

To maintain creativity and motivation among your team member to the top, we introduced different meetings during Fridays. As you know, Fridays are a no-go for deployment and changes in the infrastructure, so we took advantage of that to do either POC Friday, or Sharing Knowledge Session Friday. POC Friday is testing new technologies, sharing and discussing your treasure with your team member. While Sharing Knowledge Session is sharing knowledge that is not yet in the documentation, in order to avoid the SPOF of only one member has the knowledge in his or her brain.

All of these meetings / processes works very well in our team of 5 SRE mostly remote, and very heterogeneous. As we say in French, "C'est une affaire qui roule".

Synchronization with the other teams

Now, let's talk of our SRE team in the context of the R&D and PeopleDoc, always keeping in mind the mission of the team which is to meet our SLA in term of availability and performance of our products.

Let's start by the same point as the previous chapter; the documentation. The SRE documentation is also a reference for all the other team. It answers question as "how the infrastructure works?". It also provides lots of information to the other team, for them to not disturb us during working time. If they don't find the information, or have a request, we have github template request to communicate with the other team almost instantly. If they really want instant communication to synchronize, we have a slack channel. As you can see, our documentation is key, and we make lots of efforts to maintain it. The energy put in the documentation is the most valuable one.

The documentation contains also post-mortem, and meeting reports. Post-mortems are reports of a downtime or production incident. The documentation is communicated by the team to all the company in order to communicate or not to the clients. Even though incidents are never fun to handle, making post-mortem reports is a very professional process. Be better.

As a software engineer, you should know naming is very important to memorize things. And to remember our post-mortem, we give them little name composed by a sport name, followed by a country. I run our generator and gave a little name to our organization: "Zumba at Guernsey".

The post-mortems are then used once a month to summarize our incidents, and see the possible improvements about our infrastructure.

The R&D at PeopleDoc is made of multiple support teams (Data, Security, SRE, Software Engineering Test), and feature teams. We synchronize the support team by a bi-weekly meeting of one hour to expose our work in going, and a monthly meeting with some representatives of the different teams.

As a personal note, I noticed that the meetings format should be 20 to 30 minutes, or 2 hours. But one hour format is most of the time a waste of time. This is also observed by Sam Altman.

Same as in the previous chapter, we also set up meetings to create fellowship among the R&D member, maintain the motivation, and keep the creativity to the top.

The first meeting to synchronize all of our R&D teams (70 members as of today, 18th April 2018), is a R&D Release Planning Day. We gather all the R&D member in our Parisian offices, and agree on a roadmap for the next quarter. I'm not an expert on the "agile" methodology, so I prefer to avoid explaining more this meeting.

The second meeting to synchronize our teams is the "Hub Meetup". Remote workers gather together by localization, and work together during one day. This is very interesting as you'll be working with member of different team, and will have time to chat about different projects inside the company. The last West French Hub gathered people from the data team, SRE team, feature team.

And last but not least, to synchronize the whole company, we have an off site once a year, during three days. We have presentation during these days from every team of the company. This is my favorite one to understand clients needs, and have a global perspective on the company you work in. Our next off site is very soon, and I have a plane to catch to meet my colleagues! So let's finalize this article.

Thank you for your careful reading, we hope this article will give you some idea to change your organization accordingly to your problems in order to be more efficient and keep a good atmosphere working in your company. We are also very eager to read your feedback to change our processes. Let's do continuous organization change everywhere. And have fun at your work.

— Florian G., SRE @PeopleDoc

More articles

Using Github CODEOWNERS file


Introduction

GitHub CODEOWNERS file is a simple way to automate away some of the pain associated with the review system on github, by automatically assigning reviewers to a pull request based on which files were modified.

How to use

It's really simple ! Just drop a file named CODEOWNERS either at …

How to detect (western) language with Python


Various options to (western) language detection

In order to optimise a NLP preprocessing pipeline, or to be able to tag a batch of documents and to present a user only with results in their preferred language, it might be useful to automatically determine the language of a text sample.

This article presents various options to do so in Python, from custom solutions to external libraries. Each solution is evaluated according to three dimensions, accuracy in language detection, execution time and ease of use.

TimescaleDB


This article is about TimescaleDB, a postgreSQL extension, specialized in storing Time Series data. The purpose is being able to easily manage data coming almost sorted on a time dimension. Log or audit events for instance.

Why TimescaleDB ?

PostgreSQL 10 introduces a real partitioning scheme. Before that, users had to …

ANSIBLE_STDOUT_CALLBACK=debug


Un nouveau plugin de type stdout callback fait son apparition dans la release d'Ansible 2.2. Ce plugin est fortement inspiré du snippet qui a tourné quelques années dans la communauté Ansible sous le nom de human_log.py. On peut l'activer avec ANSIBLE_STDOUT_CALLBACK=debug et en appelant ansible-playbook avec -v …

LXC démystifié


Cet article décortique l'installation de LXC. Le but est de pouvoir créer des conteneurs qui ont accès à internet, et qu'on peut résoudre en nom-de-conteneur.lxc. Je vais faire de mon mieux pour avoir une approche aussi instructive que ludique, un petit peu dans le style du site du zéro …

1 / 7