Page MenuHomeMiraheze

Setup centralised logging for services
Closed, ResolvedPublic

Description

Currently, access/audit and error logs for all of our services are stored locally, which makes exploring logs harder. Miraheze should aim for a centralised (probably ELK, though there are other options) logging stack providing role-based access control.

Rollout (last update: 2022-02-21 14:46 UTC) status:

  • bacula2.miraheze.org: syslog daemon present and remote logging (graylog) enabled, needs @John to verify all bacula logs are sent to syslog instead of local files
  • cloud3.miraheze.org: syslog daemon present and remote logging (graylog) enabled, needs @Paladox to verify all proxmox logs are sent to syslog instead of local files
  • cloud4.miraheze.org: syslog daemon present and remote logging (graylog) enabled, needs @Paladox to verify all proxmox logs are sent to syslog instead of local files
  • cloud5.miraheze.org: syslog daemon present and remote logging (graylog) enabled, needs @Paladox to verify all proxmox logs are sent to syslog instead of local files
  • cp20.miraheze.org
  • cp21.miraheze.org
  • cp30.miraheze.org
  • cp31.miraheze.org
  • db101.miraheze.org: syslog daemon present and logging, MariaDB logging NOT DONE yet
  • db111.miraheze.org: syslog daemon present and logging, MariaDB logging NOT DONE yet
  • db112.miraheze.org: syslog daemon present and logging, MariaDB logging NOT DONE yet
  • gluster3.miraheze.org: syslog daemon present and remote logging (graylog) enabled, needs @Paladox to verify all gluster logs are sent to syslog instead of local files
  • gluster4.miraheze.org: syslog daemon present and remote logging (graylog) enabled, needs @Paladox to verify all gluster logs are sent to syslog instead of local files
  • graylog121.miraheze.org: syslog daemon present and remote logging (graylog) enabled, graylog internal logging not checked yet
  • ldap2.miraheze.org: syslog daemon present and remote logging (graylog) enabled, slapd logs seem to be fine
  • mail2.miraheze.org: syslog daemon present and remote logging (graylog) enabled, postfix/dovecot/roundcubemail logs are sent to syslog
  • mon2.miraheze.org: NOT done yet, dependency check for icinga logs (ie are local logs needed for icinga-miraheze IRC bot?)
  • mw8.miraheze.org: syslog daemon present and remote logging (graylog) enabled, nginx logging DONE, php-fpm logging DONE
  • mw9.miraheze.org: syslog daemon present and remote logging (graylog) enabled, nginx logging DONE, php-fpm logging DONE
  • mw10.miraheze.org: syslog daemon present and remote logging (graylog) enabled, nginx logging DONE, php-fpm logging DONE
  • mw11.miraheze.org: syslog daemon present and remote logging (graylog) enabled, nginx logging DONE, php-fpm logging DONE
  • ns1.miraheze.org: syslog daemon present and remote logging (graylog) enabled, GDNSD logging done, confirmed working
  • ns2.miraheze.org: syslog daemon present and remote logging (graylog) enabled, GDNSD logging done, confirmed working
  • phab2.miraheze.org: syslog daemon present and remote logging (graylog) enabled, phd logs go to syslog
  • puppet3.miraheze.org: syslog daemon present and remote logging (graylog) enabled, probably still a lot of puppet daemons logging to local files (@Paladox )
  • rdb3.miraheze.org: syslog daemon present and remote logging (graylog) enabled, redis logging done, confirmed working
  • rdb4.miraheze.org: syslog daemon present and remote logging (graylog) enabled, redis logging done, confirmed working
  • services3.miraheze.org: syslog daemon present and remote logging (graylog) enabled, citoid/proton/restbase/electron logging DONE
  • services4.miraheze.org: syslog daemon present and remote logging (graylog) enabled, nginx logging DONE, citoid/proton/restbase/electron logging DONE
  • test3.miraheze.org: syslog daemon present and remote logging (graylog) enable, nginx logging DONE, php-fpm logging DONE

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Paladox triaged this task as Normal priority.Dec 31 2019, 16:02
Southparkfan renamed this task from Setup logstash and send the logs to it to Setup centralised logging for services.Jan 29 2020, 23:34
Southparkfan updated the task description. (Show Details)

Conversation from the staff channel is:

@Zppix brought up that the WMF use logstash, @Southparkfan brought up graylog. Based on reading up, graylog is less to install then the logstash setup.

The logstash setup is made for big data, whereas graylog is made for logs.

Heres one source https://medium.com/@logicify/advantages-of-graylog-grafana-compared-to-elk-stack-a7c86d58bc2c

Some questions we can use for logstash vs graylog:

<+SPF|Cloud> what are the differences between solutions X and Y? what are the pros and cons of both? does either solution lack something we consider to be critical? what are the recommendations of people on internet in discussions comparing both solutions?

<+SPF|Cloud> is one solution more secure than the other (support for transit and at-rest encryption? security track record)? are there performance differences (requires less resources to do the same thing)? how easy is it to setup new log sources? is it to be expected that one of the solutions requires less maintenance than the other?

<+SPF|Cloud> this list is not exhaustive, you may find some questions to be irrelevant or missing (use your own judgement) - but they allow you to make a good comparison between these solutions

Note from staff channel: ELK stack vs graylog (though both ones use Elasticsearch).

I've deployed the new logging infrastructure to jobrunner[12] and mw[45].

Sent an email to the team notifying of the deployment to mw[45].

I've deployed the new logging infrastructure to mw[67]

@Paladox are you able to give a look over the ones that SPF has marked for you to review please?

mon1 marked as done, Icinga logs need to be local for IRC bots however I set up icinga logs to go to graylog separately under T6798

Quite a few actions are blocked on you.

Puppet-agent logs to syslog in addition to it logs to logging to a file.

See application_name:"puppet-agent"

Also cron seems to log application_name:"CRON".

So we only need to parse puppetdb/puppetserver files.

We can do https://puppet.com/docs/puppet/7.4/server/config_logging_advanced.html for puppetserver (including its access logs). We can also probably do the same for puppetdb (as it also uses logback).

We switched off syslog-ng logging on the cloud servers. Not sure if we want to switch it back on @John @Southparkfan ?

So I've created and merge this pull https://github.com/miraheze/puppet/pull/1695. Essentially logs for puppetserver/puppetdb are now read and sent to graylog.

I've created 3 new streams:

https://graylog.miraheze.org/streams/6046c9f7259fd27d6737dc89/search

https://graylog.miraheze.org/streams/6046c9c4259fd27d6737dc3d/search

https://graylog.miraheze.org/streams/6046c9de259fd27d6737dc63/search

But we have an issue now. Syslog-ng seems to have an issue when the file is rotated. So we have some choices.

We have a cron which restarts at midnight (syslog-ng). We use rsyslog, doesn't appear to have this issue? We don't rotate the log puppetserver/puppetdb side but then we'll likely have this issue with other services when we try to load their logs (if they rotate). Gluster rotates its logs which cannot be disabled.

@Southparkfan

I was reading up on this and it says you have to restart https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.16/administration-guide/86 granted that's if you are using the softwares inbuilt log rotate but seems similar to the issue I'm having where I have to restart it to get it to read logs.

We switched off syslog-ng logging on the cloud servers. Not sure if we want to switch it back on @John @Southparkfan ?

Yes, let's see if we can receive proxmox logs without further tweaking.

@Southparkfan

I was reading up on this and it says you have to restart https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.16/administration-guide/86 granted that's if you are using the softwares inbuilt log rotate but seems similar to the issue I'm having where I have to restart it to get it to read logs.

You can invoke postrotate scripts in logrotate. For example, for ufw.log we do the following on puppet3:

{
        rotate 4
        weekly
        missingok
        notifempty
        compress
        delaycompress
        sharedscripts
        postrotate
                invoke-rc.d rsyslog rotate >/dev/null 2>&1 || true
        endscript
}

After each file rotation, a SIGHUP signal is sent to rsyslog in order to reopen the log file, otherwise rsyslog won't deal nicely with the rotation. In the past, we have had issues with mariadb keeping old slow log file descriptors open, so deleted files were still 'present' and thus unused disk space is shown as claimed.

The syslog-ng's systemd unit reload command ensures a SIGHUP signal is sent to syslog-ng, which will reopen the source file. Changing the postrotate script to:

postrotate
        invoke-rc.d syslog-ng reload >/dev/null 2>&1 || true
endscript

should fix your issue, but you'll add it to every logrotate configuration file.

I will try and finish this now (for cloud*)

Added pve* logging via https://github.com/miraheze/puppet/pull/1713

Needs to be fixed to parse the timestamp but other then that this is done (cloud*)

there's one other log I didn't think we need to send for proxmox (wasn't really any info we needed I think).

I could look into taking this over from @Paladox. Is there anything not on this task that I should be aware of if I do?

Moving over to new goal period. Feel free to remove if it isn't wanted to be moved over.

@Paladox has raised concerns with centralised-only logging. We should explore these concerns before pushing for things like nginx access logs as these are critical for debugging some traffic influx/DoS attacks.

I agree with that. At least for some logs it's definitely useful to have logs stored locally in case something goes wrong and the logs don't get transmitted to graylog.

In T5044#156437, @John wrote:

@Paladox has raised concerns with centralised-only logging. We should explore these concerns before pushing for things like nginx access logs as these are critical for debugging some traffic influx/DoS attacks.

+1

T7740 is likely to be influenced by work done on this task.

Plan for resolving this task:

  • All services will have their logs ingested into Graylog, this isn't negotiable.
  • Where logs are ingested, we will maintain 24-48 hours of *local* logs on the server. This will be supported by log rotation.

Longer term work (not this task and not this half):

  • Split ES out from graylog - (likely will be done as part of T7740 if we proceed with that)
  • Introduce a graylog instance on each cloud server
  • Distribute log collection amongst the collection of graylog servers.

New server list for checking the above plan against:

  • bacula2.miraheze.org
  • cloud3.miraheze.org
  • cloud4.miraheze.org
  • cloud5.miraheze.org
  • cp20.miraheze.org
  • cp21.miraheze.org
  • cp30.miraheze.org
  • cp31.miraheze.org
  • db11.miraheze.org
  • db12.miraheze.org
  • db13.miraheze.org
  • gluster3.miraheze.org
  • gluster4.miraheze.org
  • graylog2.miraheze.org
  • jobchron1.miraheze.org
  • ldap2.miraheze.org
  • mail2.miraheze.org
  • mem1.miraheze.org
  • mem2.miraheze.org
  • mon2.miraheze.org
  • mw8.miraheze.org
  • mw9.miraheze.org
  • mw10.miraheze.org
  • mw11.miraheze.org
  • mw12.miraheze.org
  • mw13.miraheze.org
  • mwtask1.miraheze.org
  • ns1.miraheze.org
  • ns2.miraheze.org
  • phab2.miraheze.org
  • puppet3.miraheze.org
  • test3.miraheze.org

I am going to start progress on this task, firstly by cleaning up how we define all of this in puppet. I'll introduce simply logging stanzas that we can define over and over again for each log file, that handles all of the syslog-ng logic + logrotate configuration for the new system.

After this, services will be moved slowly over one at a time to ensure the system works effectively and issues identified early on, are fixed before wider deployment.

This task has taken a back foot, over other work which has higher priority currently such as T8469 T8350

@Paladox less than a week until end of goal period - do we have an update on this?

Paladox updated the task description. (Show Details)

Resolved