Creating a Sigma Backend for Fun (and no Profit)
A few months ago I decided to check and see whether there was a Sigma backend for InsightIDR, the cloud-based SIEM from Rapid7. Imagine my shock (š±) when I discovered that none existed! There are Sigma backends for more than a dozen SIEM platforms, including plenty Iād never heard of, but sadly none that would help me. I decided to investigate whether I could create one myself. I had a strong handle on Log Entry Query Language (LEQL), the query language used by InsightIDR, and a background in Python, the language that Sigma is written in. I read that Sigma had a brand new code base, called pySigma, that the co-creators wanted to move towards.
What follows is a description of the process of writing the InsightIDR backend for pySigma. Iāll start with a brief primer on Sigma for those who are new to it. Next, Iāll talk about the skills required to create the backend, the level of effort, how I overcame some challenges along the way, and what the experience was like. This isnāt a detailed set of step-by-step instructions for how to write a pySigma backend, but Iāll try to include some helpful tips and hints that I received during the project, as well as some I wish I had.
What is Sigma?
Since starting work in security a little over a year ago, one of the most interesting and useful open source tools Iāve encountered is Sigma, a generic signature format for detecting bad stuff (intrusions, misconfigurations, anomalies, malware, etc) in your network and endpoints. Security folks can create Sigma rules and convert them into SIEM-specific query languages for analysis, alerts, threat hunting, dashboards, and all the other fun things our SIEM platforms let us do. Itās an ambitious project with an interesting backstory, and I encourage anyone interested to check out this interview with Florian Roth, one of Sigmaās co-creators: https://socprime.com/blog/interview-with-developer-florian-roth/
For anyone unfamiliar with Sigma, the graphics and descriptions on the Sigma GitHub page do a great job of explaining what it does and how it works at a high level. Basically, it is a swiss army knife capable of converting detection logic from the mind of the analyst into whatever platform a defender wants to use. Sigma performs these conversions using backends, which contain all of the translation logic and functionality to go from the generic Sigma to platform-specific format. SOCPrime, a company that produces premium Sigma-based detections, has an excellent timeline depicting the evolution of the Sigma project over time, concluding with the initial release of pySigma in September 2021:
https://my.socprime.com/sigma/
Prerequisites
As I mentioned previously, Iāve worked as a security analyst for a little over a year. So if you worry that years of experience are a prerequisite, rest assured that they are not! You will need good familiarity with different types of security event logs, detection logic, and an understanding of how analysts write queries in a SIEM. Something like Boss of the SOC would be a great place to start. I also recommend reading The DFIR Report to gain an understanding of how intrusion investigations can uncover patterned behavior among attackers that can be analyzed and identified using signatures. Many DFIR Report posts also include links to relevant Sigma rules!
Speaking of Sigma rules, youāll need to understand their basic structure, although itās not necessary to be an expert. I had minimal exposure to the Sigma rule specification when I set out to create the backend. By the end of the task I had developed a stronger (but still not perfect) understanding of Sigma rules. Iāve written simple Sigma rules before, but only in a tutorial context (I am not a badass detection engineer or anything like that).
I consider myself a āgoodā Python programmer, based on the fact that I can usually accomplish what I set out to do with the language. I do not write perfect code, and I am not an expert in object-oriented programming. I have been writing Python professionally for about ten years (I worked in GIS prior to security, where Python is the most popular language for creating custom tools and analysis workflows). This was very helpful for getting the backend conversion logic correct, which Iāll go into a bit later.
In terms of time, the whole process took me about four weeks of intense work, plus a good amount of extra time spent researching, brushing up on my Python skills in certain areas, and corresponding with knowledgeable professionals who helped me along the way.
Lastly, you will need to be confident in the platform for which youāre writing the backend. Query syntax, logical operators, how the platform handles case sensitivity, null values, keyword searches, and things like regular expressions are all key to a successful backend. So get comfortable with the SIEM platform before diving into writing a Sigma backend for it!
Getting Started
To begin, I wanted to understand how the different components of pySigma fit together, and how the core pySigma codebase interacts with platform-specific backends. I read the pySigma documentation, then forked the pySigma project and explored the various modules that it contains. I also downloaded the previously-existing backends and pipelines, including the backend/pipeline for Splunk, and pipelines for Crowdstrike and sysmon.
At a high level, processing pipelines get Sigma rules ready for conversion, while backends perform the conversion itself. A big difference between pySigma and the legacy Sigma tool is that backends and pipelines in pySigma are maintained separately from the main repository. This decentralized approach to the management of backend code means that vendors or interested community members (hey, thatās me!) can maintain their own pipeline and backend independently or in collaboration with the good people at SigmaHQ.
Putting Together the Pipeline
In my case, I had identified a subset of log sources that I wanted to be āin-scopeā for my InsightIDR backend. I focused on process start, DNS query, and web proxy log sources since they were the most relevant for me, and this seemed like a manageable scope for the first step. Iād want to show the user a useful error message if they tried to convert a rule made off of a log source besides these. I also wanted to inform the user if the rule included contains a selection or filter condition using a field not present within the applicable log. Knowing that process start events in InsightIDR are based on Windows event ID 4688, I knew that some Sysmon-specific fields common through the body of Sigma rules would not be available to me. I also knew that the field names in InsightIDR differ quite a bit from the way they are written in the Sigma rule specification.
Below, Iāve included the ProcessingItem object definition that I used to perform field mapping from Sigma rule specification to InsightIDR syntax:
For each Sigma rule processed by the pipeline, I also wanted to translate the logsource properties (category, service, and product) to values that are consistent with the way they are described in LEQL-based detection rules. So, where a Sigma rule might have a logsource definition of category: proxy
,
ā¦I wanted the pipeline to output a logsource property for the rule matching what is shown in the LEQL detection rule logic:
Whenever I work on a technical project, I try to gain a common-language understanding of what I am trying to do. So, to summarize, I wanted my pipeline to:
- Perform field mapping between the generic Sigma field names and InsightIDR field names
- Trigger error messages if unsupported fields (fields that donāt show up in InsightIDR event logs) are present in the rule as selection/filters
- Change the ruleās logsource properties to match InsightIDR detection rule logic
Finally, I needed the pipeline to trigger an error message for the user whenever they attempt to convert a rule for an unsupported logsource.
Pipelines in pySigma are essentially a list of sequentially-applied processing items (like the one in the code block above) that apply to each Sigma rule being converted. These processing items need an identifier, a transformation (what change or action is actually applied to the rule), and conditions used to determine which rules receive or trigger the transformation. There are a number of useful properties that can customize the behavior of each processing item to suit the authorās needs. With this understanding, I was able to put together a set of three processing items for each of my logsources to accomplish the three items listed above.
The most interesting challenge was the final pipeline requirement, triggering an error for unsupported rule types. Because there are dozens of different Sigma rule types and logsource definitions, I initially enumerated these by parsing logsource properties for the entire directory of Sigma rules, and then adding a processing item with an individual RuleFailureTransformation transformation for each combination of unsupported logsource properties. Not only was this time-consuming, it was also inefficient and cumbersome, because each time a new Sigma rule was added with a novel category, service, or product property, my pipeline would be instantly out of date!
The solution was to create a single processing item, applied last in the pipeline, that used a group of RuleProcessingItemAppliedCondition conditions to track which processing items had been applied thus far. By setting the processing item properties of rule_condition_linking=any
and rule_condition_negation=True
, I was able to identify rules to which none of my previous logsource-based processing items had been applied. Basically, this would flag any rules that my previous logsource-specific processing items had missed, because they didnāt meet any of the applied conditions. This way I am able to trigger the ārule unsupportedā-type error without enumerating every single logsource type, which would have left my pipeline cluttered and unwieldy.
Building the Backend
With the pipeline in place, it was time to develop the InsightIDR backend that would perform the actual conversion of Sigma detection items into components of a complete and properly-formatted LEQL query. The starting point for backends which produce text-based queries (which I imagine is most, if not all backends) in pySigma is the TextQueryBackend class found in the base.py module. This class contains properties that define, at a granular level, how queries are formed in the target language. For instance, in Splunk Processing Language, there is an implied āandā operator between expressions. So, a valid Splunk query might be: field1=foo field2=bar
which would mean, show me log entries where field 1 equals āfooā and field 2 equals ābarā. So, the āand tokenā in the pySigma backend for splunk is just an empty space: and_token : ClassVar[str] = ā ā
. In InsightIDR, the and operator is represented by AND
, which is expressed in the class variables in the screenshot below.
The existing properties and functionality of the TextQueryBackend class can be borrowed and customized by your new backend class using the concept of class inheritance. Just pass in the TextQueryBackend class as an argument of your new backend class as shown below, then customize anything you want. For me, this looked like this:
The real challenge of writing the backend came from what I call the āgroup-basedā comparison operators used in LEQL. For instance, selecting logs where a certain field contains one or more of a group of substrings would look like greeting ICONTAINS-ANY ["hello!", "hi there!", "hiya"]
. This could be represented by a Sigma detection like:
detection:
selection:
greeting|contains:
- 'hello!'
- 'hi there!'
- 'hiya'
condition: selection
Out of the box, the TextQueryBackend class would return a query looking something like greeting="*hello!*" OR greeting="*hi there!*" OR greeting="*hiya*"
. Wildcard characters (*) are not really a thing in LEQL, plus I like queries that use grouping operators for readability, brevity, and performance. To implement this custom grouping behavior, I followed the advice of Sigma co-creator Thomas Patzke and overrode the convert_condition_or() method. What followed wasnāt pretty, but in the end I got it working the way I wanted!
The key was understanding how Sigma detection logic comes out of the pipeline and is processed by the backend. Basically, the pipeline process breaks Sigma detections into SigmaCondition objects with types like
āConditionORā, āConditionANDā, or āConditionFieldEqualsValueExpressionā. These conditions are then processed sequentially by the backend. By understanding the different properties of these conditions, I was able to create the branched logic and grouping of values that allowed for the fairly concise, readable queries that I prefer to work with.
In the case of the ICONTAINS-ANY
problem above, I first had to determine whether the participating arg
properties of the OR condition contain actual log field values, like āhello!ā, āhi there!ā, and āhiyaā, as opposed to additional nested AND or OR conditions. With the ability to add multiple selections and set how they are evaluated in Sigma, the nesting options can get pretty complex.
I then tested whether the fields operated on by the child arguments are all the same. Basically, that all of the field values apply to the field āgreeting,ā rather than a mix, like greeting = "hi there!" OR salutation = "hello!"
. With these conditions met, I could use the presence of wildcard characters (which are themselves an artifact of pipeline processing) to determine whether the output query should use ICONTAINS-ANY, ISTARTS-WITH-ANY, or āends with any,ā which requires a regular expression in InsightIDR LEQL, because LEQL doesnāt yet have an IENDS-WITH-ANY operator!
Of course, the above āgrouping operatorā challenge is just one example of the types of problems pySigma backend authors will have to overcome. But my hope is that by stepping through my thought process, other security specialists hoping to create new backends will have a slightly easier time building a backend and getting the Sigma rules to generate queries to their liking. By the way, I tweaked and tested my code throughout using a test script that is essentially the same as the script in the Usage section on the GitHub repo for the InsightIDR backend. Having a simple, consistent, and repeatable test script pointed to my in-development source code was very helpful throughout the authoring process.
Wrapping it Up
Once the pipeline and backend functionality provide your needed processing to convert conditions into the right query components, it is a good idea to think about any special output options your backend may require. By default the TextQueryBackend class provides a finalize_query() method that outputs the various query components for the user. However, a lot of SIEMs may require, or offer, different query formats for different uses. InsightIDR offers an āAdvancedā LEQL query mode that many analysts (including myself) prefer. However, the custom alert form requires queries in the Simple format.
pySigma offers backend authors this level of flexibility with fairly minimal effort. Just use the āout of the boxā finalize_query() method ā optionally customized to your like ā for the simplest output type needed, and add additional finalize query methods with names formatted like finalize_query_<options>()
to implement the different types of outputs.
For the InsightIDR backend, I wanted a Simple output option, an Advanced output option, and an option that mimicked the rule logic listed in InsightIDRās built-in detection definitions. Besides the basic finalize_query() method, I added a finalize_query_leql_advanced_search() method and a finalize_query_leql_detection_definition() method to achieve the type of output formatting I wanted.
Conclusion
Creating a new backend for the pySigma codebase is a significant challenge. However, with patience, persistence, and a willingness to ask for help it can be very rewarding and illuminating. I benefited tremendously from helpful advice and encouragement from Thomas Patzke, and would happily offer the same to anyone looking to create their own pySigma backend, so donāt hesitate to reach out! I had never contributed to an open source security project prior to this, and it was overall an extremely rewarding and fun process, even if I now have a quite a few more grey hairs to show for it!
As always, happy analyzing! š§