Total Pageviews

Sunday, December 25, 2016

Golden Gate Latency monitoring with PowerShell, influxDB and Grafana – PowerShell to InfluxDB

The SQL Server which is installed on the Windows Server is receiving data from “Tandem Non-Stop System” by using Oracle Golden Gate replication. This downstream data feed is critical to smooth business functionality. So, it needs to be monitored proactively.

Although monitoring this environment is critical, however, by executing various Golden Gate commands (STATUS ALL, INFO ALL, LAG) in the GGSCI shell manually to understand Latency is very annoying and tedious. Therefore, the automated process comes into the picture to monitor performance in real-time for our die-hard operational team.

Two REPLICATs we are interested in:
We are interested in monitoring two of the REPLICATs’ performance constantly as they are critical to our healthcare business activities along with some SQL Server and Windows performance counters.

The following dashboard has been developed with the open source technology to visualize performance data in real-time so one can glue his/her eyes 24/7 on the dashboard. The dashboard process (the Grafana) can also trigger alerts/notifications if a particular monitoring object exceeds the threshold for a substantial period of time.

Performance metrics dashboard in Real-time:


Technology Stack Used:
The PowerShell script streams data in every 10 seconds interval and sends it to influxDB using the REST API method. The script runs by the windows Task scheduler continuously.

InfluxDB is a time-series based database technology which is high performance and can be massively scaled up. It is written in GO language and is open source.

Grafana is an open source metric analytics & visualization suite. It is most commonly used for visualizing time series data for infrastructure and application analytics.

PowerShell Script and Dashboard download:

PowerShell Script:
Following is the complete PowerShell Script which is feeding data into influxDB. This script collects two Golden Gate REPLICATs’ latency, as well as couples SQL Server and Windows Performance metrics every 10 seconds.  Where, in the script, the server represents the following:

SRV0009 – the Windows server where the Golden Gate process is running. The PowerShell Script is also hosted here.
STV1075 – where the InfluxDB and the Grafana have been installed and are running.

<#####################################################################################
       .NOTES
       ===========================================================================
        Created with:      PowerShell ISE
        Created on:        12/25/2016 9:39 AM
        Created by:        Sarjen Haque
       ===========================================================================
       .DESCRIPTION
              function to collect and parse Golden Gate status data from
              a local or remote server installed on Windows
######################################################################################>

function Get-RealTimeStatus()
{
       $result = winrs -r:SRV0009 "CMD /c echo status all | H:\OracleGoldenGate\ggsci"
       $raw = $result -match 'REPLICAT'
      
       [StringSplitOptions]$Options = "RemoveEmptyEntries"
      
       # loop through each line and break
       foreach ($line in $raw)
       {
              $wrd = $line.Split(" ", $Options)
              $lg = $wrd[3].Split(":")
              $tm = $wrd[4].Split(":")
              $result = @{
                     "Program" = $wrd[0];
                     "Status" = $wrd[1];
                     "Name" = $wrd[2];
                     "Lag" = $wrd[3];
                     "LagSec" = [int]$lg[0] *(60 * 60) + [int]$lg[1] *60 + [int]$lg[2];
                     "ChkPt" = $wrd[4];
                     "ChkPtSec" = [int]$tm[0]*(60 * 60) + [int]$tm[1] *60 + [int]$tm[2];
              }
             
              $obj = New-Object -TypeName PSObject -Property $result
              Write-Output $obj
       }
}

#######################################################################################
# FUnction to collect and format Golden gate and Windows/SQL Server PerfMon data
#######################################################################################

Function Get-ggPerf
{
       # Windows and SQL Server Instance
       $WinServer = 'SRV0009'
       $SQLServer = 'SQLServer' # use for default instance
       # $SQLServer = 'MSSQL$SQL2014' # use for a named instance. Replace SQL2014 with the isntance name
      
       # Pull Windows and SQL stats
       [System.Collections.ArrayList]$counters = @()
      
       # Windows reelated metrics
       $counters.Add("\\$WinServer\logicaldisk(d:)\avg. disk sec/transfer") | out-null #0
       $counters.Add("\\$WinServer\logicaldisk(e:)\avg. disk sec/transfer") | out-null #1
       $counters.Add("\\$WinServer\logicaldisk(h:)\avg. disk sec/transfer") | out-null #2
       $counters.Add("\\$WinServer\logicaldisk(l:)\avg. disk sec/transfer") | out-null #3
       $counters.Add("\\$WinServer\logicaldisk(_total)\avg. disk sec/transfer") | out-null #4
       $counters.Add("\\$WinServer\Memory\Available MBytes") | out-null #5
       $counters.Add("\\$WinServer\Processor(_total)\% Privileged Time") | out-null #6
       $counters.Add("\\$WinServer\Processor(_total)\% User Time") | out-null #7
       $counters.Add("\\$WinServer\Processor(_total)\% Processor Time") | out-null #8
      
       #SQL Server related metrics      
       $counters.Add("\\$WinServer\" + $SQLServer + ":Resource Pool Stats(default)\CPU usage %") | out-null #9
       $counters.Add("\\$WinServer\" + $SQLServer + ":General Statistics\Processes blocked") | out-null #10
       $counters.Add("\\$WinServer\" + $SQLServer + ":Buffer Manager\page life expectancy") | out-null #11
       $counters.Add("\\$WinServer\" + $SQLServer + ":General Statistics\User Connections") | out-null #12
       $counters.Add("\\$WinServer\" + $SQLServer + ":SQL Statistics\Batch Requests/Sec") | out-null #13
       $counters.Add("\\$WinServer\" + $SQLServer + ":Memory Manager\total server memory (kb)") | out-null #14
       $countersall = (Get-Counter -Counter $Counters -SampleInterval 1).countersamples.CookedValue
      
       # Format SQL data into Influxdb Line Protocol
       $postParams = "sql_perf,host=" + $WinServer + " avgtransfer_d=" + $countersall[0] + ",avgtransfer_e=" + $countersall[1] `
                     + ",avgtransfer_h=" + $countersall[2] + ",avgtransfer_l=" + $countersall[3] `
                     + ",avgtransfer_total=" + $countersall[4] + ",available_mbytes=" + $countersall[5] `
                     + ",pct_priviledge_time=" + $countersall[6] + ",pct_user_time=" + $countersall[7] `
                     + ",pct_processor_time=" + $countersall[8] + ",sql_cpu_time=" + $countersall[9] `
                     + ",processess_blocked=" + $countersall[10] + ",page_life_expectancy=" + $countersall[11] `
                     + ",user_connections=" + $countersall[12] + ",batch_resc_per_sec=" + $countersall[13] `
                     + ",total_server_memory_kb=" + $countersall[14]
      
       # get the output from the Golden Gate function
       $result = Get-RealTimeStatus | select Program, Status, Name, Lag, LagSec, ChkPt, ChkptSec
       $r1 = $result | ?{ @("RP9CRIT2") -contains $_.Name } | select -First 1
       $r2 = $result | ?{ @("RP9MAIN2") -contains $_.Name } | select -First 1
      
       if ($r1.Status -eq 'RUNNING') { $r1status = 1 }
              else { $r1status = 0 }
      
       if ($r2.Status -eq 'RUNNING') { $r2status = 1 }
              else { $r2status = 0 }
      
       # Format GG data into influxdb Line Protocol
       $postParams_r1 = "gg_RP9CRIT2,host=" + $WinServer + " LagSec=" + $r1.LagSec + ",ChkptSec=" + $r1.ChkptSec + ",Status=" + $r1status
       $postParams_r2 = "gg_RP9MAIN2,host=" + $WinServer + " LagSec=" + $r2.LagSec + ",ChkptSec=" + $r2.ChkptSec + ",Status=" + $r2Status
      
       # Post to influxdb API
       $uri = 'http://SRV0175:8086/write?db=ggperfdb'
      
       # Create web connection with influxdb URL
       $ServicePoint = [System.Net.ServicePointManager]::FindServicePoint($uri)
       $ServicePoint.ConnectionLimit = 3
      
       # Use influxdb authentication to logon
       $authheader = "Basic " + ([Convert]::ToBase64String([System.Text.encoding]::ASCII.GetBytes("admin:admin")))
      
       # Post data to influxDB
       try
       {
              Invoke-RestMethod -Headers @{ Authorization = $authheader } -Uri $uri -Method POST -Body $postParams -DisableKeepAlive
              Invoke-RestMethod -Headers @{ Authorization = $authheader } -Uri $uri -Method POST -Body $postParams_r1 -DisableKeepAlive
              Invoke-RestMethod -Headers @{ Authorization = $authheader } -Uri $uri -Method POST -Body $postParams_r2 -DisableKeepAlive
             
              #Close web connection
              $ServicePoint.CloseConnectionGroup("") | Out-Null          
       }
      
       catch
       {     
              throw 'Could not POST to InfluxDB API endpoint'            
       }
}

#######################################################################################
# Run the collector function in a loop, continuously 
#######################################################################################

while ($true)
{
       # execute the collector
       Get-ggPerf
       # pause for X seconds
       start-sleep -Seconds 8
}

# end of script 

References:
PowerShell Way: Automating Golden Gate Replication Monitoring on Windows
http://sqltouch.blogspot.ca/2014/08/powershell-way-automating-golden-gate.html


Grafana: http://grafana.org/



No comments:

Post a Comment