When to use a new index in Graylog (Elasticsearch)?
Is an index set in Graylog just the settings about the retention strategy?
Don't forget that an Index Set has a direct impact on Indices in the underlying ElasticSearch infrastructure, you should take that into account because ElasticSearch is all about Indices and their Shards (data distribution, replica,...).
Data Type and Fields are a thing too: you can't (shouldn't) have the same field with mixed data type in the same Index Set (e.g if the field device
exists as Integer
because System1 uses a device number but System2 requires the type Text
for this field because the device identifier is a string, then you should either store everything as string or create a separate index set to keep both datatypes and their respective benefits under the same field name).
That's typically the reason why you probably don't want to store Windows Logs in the same Index Set that anything else (apply this to your use case, this may be true for your ERP/WMS data sources?...) because they can easily lead to hundreds of differents fields (and it's recommended to avoid exceeding the 1000 fields per index limit).
So, no, it's not just about the retention strategy. As a starting point for your reflection I recommend that you consider grouping various data source types in their Index Set (An Index Set for Windows Log, another one for Linux servers, another one for the firewalls for example, because it makes sense from a datatype point of view).
What impact does it have if I have a lot or a few indices?
It depends on your ElasticSearch infrastructure, and "a lot" is undefined... take a look at Sizing ElasticSearch and Size your shards. Keeping in mind what kind of queries you'll perform and over which time range may help to find the right balance between index size and the number of indices ElasticSearch will have to query to fulfil your request.
Unfortunately, there is no one-size-fits-all sharding strategy. A strategy that works in one environment may not scale in another. A good sharding strategy must account for your infrastructure, use case, and performance expectations.[...]
How many index sets should I have for my use case?
The stream is configured with one index set, you can't set multiple index sets for one stream. Regarding the other points, I already answered above.
However, note that you can configure multiple Streams on the same Index Set, this is very useful if you want to use these streams with the same underlying data and just want to restrict access to a subset of logs for certain users: you can route messages based on the conditions you want between various streams and if those streams all shares the same Index Set you'll not duplicate the messages.