AWS EC2 Best Practices
This topic does not apply to MemSQL Helios.
This document summarizes findings from the technical assistance provided by MemSQL engineers to customers operating a production MemSQL environment at Amazon EC2. This guide is designed for someone familiar with MemSQL fundamentals as well as technical basics, terminology, and economics of AWS operations.
MemSQL Aggregator Node - What is a MemSQL Aggregator?
MemSQL Leaf Node - What is a MemSQL Leaf?
MemSQL Cluster Provisioning Considerations
MemSQL is a shared-nothing MPP cluster of servers. In the context of EC2, a MemSQL server is an EC2 instance. As a general rule, a MemSQL cluster should be operated within a single EC2 VPC/Region/Availability Zone and should utilize identically configured instances of the same type. The amount of EBS storage attached to the MemSQL EC2 instances, however, may differ depending on the type of node(s) hosted by the instance; MemSQL aggregators require a minimal amount of EBS storage (aggregators may store reference table data) while EBS storage for leaves has to be provisioned as per user data capacity requirements.
The recommended decision making method is a two-step process:
- Select the proper instance type for the MemSQL server. This will be a MemSQL cluster’s “building block”.
- Determine the required number of instances to scale out the cluster capacity horizontally to meet storage, response time, concurrency, and service availability requirements.
The basic principle when provisioning virtualized resources for a MemSQL server is to allocate CPU, RAM, storage I/O and Networking in a balanced manner so no particular resource becomes a bottleneck, leaving other resources underutilized.
EC2 users should keep in mind that “Amazon EC2 dedicates some resources of the host computer, such as CPU, memory, and instance storage, to a particular instance. Amazon EC2 shares other resources of the host computer, such as the network and the disk subsystem, among instances.” AWS Instance Types
Selecting Instance Type (CPU, RAM)
MemSQL recommends provisioning a minimum of 8GB of RAM per physical core or virtual processor. For faster CPUs, 16-24GB per core is commonly selected.
When selecting an instance type as a building block of a MemSQL cluster in a production environment, users should only consider instances with 8GB or more RAM per vCPU.
Several available instance types meet the above guideline. The best instance type for a particular application can only be selected per specific performance and availability requirements.
FAQ: Is it better to have a smaller number of bigger machines or a larger number of smaller machines?
Answer: Though there is no “one size fits all” answer, the smallest number of “medium” size machines may be a good baseline. There is rarely a good reason to select larger than a 2 socket underlying server in an EC2 environment (e.g. 32 vCPU instances), while 4 vCPU or smaller instances are only suitable for low concurrency database workloads. An 8 vCPU “medium” size instance may be a good starting point. We at MemSQL experimented extensively with r4.2x large instances (8 vCPUs, 64GB RAM - meets MemSQL guidance) and often find it a potent option with reasonable commercial terms. For heavy columnstore or concurrent workloads, 16 core+ machines are recommended.
Note: If you select a multi-socket EC2 instance, you must enable NUMA and deploy one leaf node per socket. See Configuring MemSQL for NUMA.
We recommend 10 Gbps networking and deploying MemSQL clusters in a single availability zone on a VPC for latency and bandwidth considerations.
You can set up replication in a secondary cluster for DR configuration in a different availability zone in the same region or a different region. When secondary clusters are deployed in a different VPC than the primary you can leverage AWS Transit Gateway or VPC Peering, so that replication traffic never passes through the public internet.
Nitro-based Instance types support network bandwidth up to 100 Gbps based on the instance type selected. MemSQL recommends optimizing the workload for the requirement. The key decision points for optimization are:
- Instance type selection based on Memory requirement of the Aggregator node and Leaf node.
- EBS volume IOPS should also be supported by the EC2 instance type selected.
As a general rule, all EC2 instances of a MemSQL cluster should be configured on a single subnet. This means all MemSQL nodes will be within a single VPC/Region/Availability Zone.
All AWS customers get a VPC allocated to the account. Each EC2 instance must be assigned to a subnet in a VPC. Customers are expected to bind a subnet to an AZ and then place the instance in the subnet. An instance can only exist in one subnet and therefore one AZ.
For DR configurations or geographically distributed data services, customers can provision two or more clusters, each in a dedicated VPC, typically in separate regions, with VPC peering between VPCs (regions).
The following examples illustrate use cases leveraging VPC peering:
- A MemSQL environment includes a primary cluster in one region and a secondary cluster in a different geography (region) for DR. Connectivity between the primary and secondary sites is provided by VPC peering.
- A MemSQL cluster is ingesting data subscribing to a Kafka feed. A customer would typically set up a Kafka cluster in one VPC and a MemSQL cluster in a different VPC, with a VPC peering to connect the MemSQL database to Kafka.
For VPC peering setup, scenarios, and configuration guidance see the VPC Peering Guide.
Partition Placement Groups
Partition placement groups help reduce the likelihood of correlated hardware failures for your application. When using partition placement groups, Amazon EC2 divides each group into logical segments called partitions. Amazon EC2 ensures that each partition within a placement group has its own set of racks. Each rack has its own network and power source. No two partitions within a placement group share the same racks, allowing you to isolate the impact of hardware failure within your application.
MemSQL recommends using Partition Placement Groups with MemSQL Availability Groups to implement HA solutions.
Aligning AWS Availability Zones with MemSQL Availability Groups
If you are considering using AWS Availability Zones and MemSQL Availability Groups to add “AZ level robustness” to operating environments, note the following:
MemSQL operates most efficiently when all nodes of a cluster are within a single subnet. Separate AWS Availability Zones require separate subnets and as such are not optimal for MemSQL performance. The current guidance from MemSQL is to deploy all AWS nodes of a cluster into the same AWS Availability Zone.
When provisioning EBS volumes’ capacity per application data retention requirements, MemSQL EC2 administrators need to include a “fudge” factor and ensure that no production environment is operated with less than 40% free storage space on the data volume. Free disk space is required for temporary data materializations during database operations and to support continuous data growth. Users should also be aware that EBS performance generally tracks with EBS volume size.
For rowstore MemSQL deployments, provision a storage system for each node with at least 3 times the capacity of main memory and for columnstore workloads, provision SSD based storage volumes.
Please note that an under-configured EC2 storage is a common root cause of inconsistent MemSQL EC2 cluster performance.
MemSQL is a shared-nothing MPP system, i.e. each MemSQL server (EC2 instance) manages its own “internal” storage.
To ensure a permanent MemSQL server storage, users need to provision EBS volumes and attach them to MemSQL servers (EC2 instances).
EBS is a disk subsystem that is shared among instances, which means that MemSQL EC2 users may encounter somewhat unpredictable variations in I/O performance across MemSQL servers and even performance variability of the same EBS volume over time. EBS performance characteristics may be affected by activities of co-tenants (a “noisy neighbor” problem), by file replication for availability, by EBS rebalancing, etc.
The elastic nature of EBS means that the system is designed to monitor utilization of underlying hardware assets and automatically rebalance itself to avoid hotspots. This has both a positive and negative impact on end-user operations. Users are assured that EBS will reasonably promptly resolve severe contention for I/O. But on the other side, relocation of files to new storage nodes during rebalancing adversely affects EBS volume performance.
To maximize consistency and performance characteristics of EBS, MemSQL encourages users to follow the general AWS recommendation to attach multiple EBS volumes to an instance and stripe across the volumes. This technique is widely employed by EC2 users. Since AWS charges by the EBS volume capacity, there should be no economic penalties for using multiple smaller EBS vs. one large EBS of the same total size.
Users can consider attaching 3-4 EBS volumes to each leaf server (instance) of the cluster and present this storage to the host as a software RAID0 device.
Studies show that there is an appreciable increase in RAID0 performance up to 4 EBS volumes in a stripe, with flattening after 6 volumes.
For more information, see the AWS documentation section “Amazon EBS Volume Performance on Linux Instances”, in particular:
MemSQL EC2 customers with extremely demanding database performance requirements may consider provisioning enhanced EBS types such as io1, delivering very high IOPS rates.
The General Purpose SSD (gp2) option provides a good balance of performance and cost for most deployments. It delivers single-digit millisecond latencies and the ability to burst to 3,000 IOPS for extended periods. It provides a consistent baseline performance of 3 IOPS/GiB, for example an EBS gp2 volume of 1000 GiB has a maximum IOPS of 3,000 (16 KiB I/O size). You can consider joining multiple EBS volumes together in RAID 0 configuration to increase available throughput.
You do not have to over provision your EBS volumes based on your future expected workloads. You can benefit from the EBS Elastic Volume feature, which allows changes to the type, size, and IOPS with no downtime.
Storage Level Redundancy
As a reminder, MemSQL provides native out-of-the-box fault tolerance.
In MemSQL database environments running on physical hardware, MemSQL recommends supplementing a cluster’s fault tolerance with storage level redundancy supported by hardware RAID controllers. It’s a cost effective approach diminishing the impact of a single drive failure on cluster operations.
However, in EC2 environments storage-level redundancy provisions are not applicable because:
- EBS volumes are not statistically independent (they may share the same physical network and storage infrastructure).
- Studies and customer experience show that performance of software RAID in a redundant configuration, in particular RAID5 over EBS volumes is below acceptable levels.
For fault tolerance, MemSQL EC2 users can rely on MemSQL cluster level redundancy and under-the-cover mirroring of EBS volumes provided by AWS.
Instance (Ephemeral) Storage
EC2 instance types that meet recommendations for a MemSQL server typically come with preconfigured temporary block storage referred to as instance store or ephemeral store. Since ephemeral storage is physically attached to the host computer, it delivers superior I/O performance compared to network-attached EBS.
However due to instance storage’s ephemeral nature, proper care must be taken (configure HA, understand the limitations and potential risks) when deploying persistent data storage in a production environment.
The use of instance storage for MemSQL data is typically limited to scenarios where the database can be reloaded entirely from persistent backups or custom save points. For example, as a “development sandbox”, or for one-time data mining/ad hoc analytics, or when data files loaded since the last save point are preserved and may be used to restore the latest content, etc.
Encryption of Data at Rest
MemSQL recommends enabling EBS encryption. If an NVMe instance store is used, the data is encrypted at rest, by default.
Backup and Restore
MemSQL recommends backing up to an S3 bucket with cross-region replication enabled to protect against region failure and to meet disaster recovery requirements.
Load Balancing of Client Connections
Application clients access a MemSQL database cluster by connecting to aggregator nodes. Normally multiple aggregator nodes are provisioned for fault tolerance and performance considerations. A good practice is to spread client connections evenly across all aggregator nodes of a MemSQL cluster. This can be achieved with either or both the following methods:
- Application side connection pool. Sophisticated connection pool implementations offer load balancing, failover and failback, and even multi-pool failover and failback.
- NLB, Network Load Balancing service.
Health Check Considerations
Expiring security certificates can be a security risk. AWS Certificate Manager (ACM) helps manage renewal for your Amazon-issues SSL/TL certificates and we recommend using it to mitigate the risk.
Enable AWS CloudTrail Logs for extending the retention of logs
AWS CloudTrail logs are enabled by default for management events and provide visibility into the past 90 days of account activity. Configure AWS CloudTrail logs for the MemSQL hosts if you wish to store the AWS management events logs over 90 days.
Create a IAM Role
AWS users should not use the AWS root user to create or manage MemSQL clusters. Use an IAM user with the MemSQL Cluster Management Role to deploy and manage MemSQL clusters on AWS.
Here is the minimum required privilege IAM Policy needed for the MemSQL Cluster Management Role:
"aws-marketplace:ListBuilds", "aws-marketplace:StartBuild", "aws-marketplace:Subscribe", "aws-marketplace:ViewSubscriptions", "cloudformation:CreateChangeSet", "cloudformation:CreateStack", "cloudformation:CreateStackInstances", "cloudformation:CreateStackSet", "cloudformation:CreateUploadBucket", "cloudformation:DescribeChangeSet", "cloudformation:DescribeStackEvents", "cloudformation:DescribeStackInstance", "cloudformation:DescribeStackResource", "cloudformation:DescribeStackResources", "cloudformation:DescribeStacks", "cloudformation:DescribeStackSet", "cloudformation:GetStackPolicy", "cloudformation:GetTemplate", "cloudformation:GetTemplateSummary", "cloudformation:ListStackInstances", "cloudformation:ListStackResources", "cloudformation:ListStacks", "cloudformation:ListStackSetOperationResults", "cloudformation:ListStackSetOperations", "cloudformation:ListStackSets", "cloudformation:SetStackPolicy", "cloudformation:UpdateStack", "cloudformation:UpdateStackInstances", "cloudformation:UpdateStackSet", "cloudformation:UpdateTerminationProtection", "ec2:AssociateRouteTable", "ec2:AttachInternetGateway", "ec2:AttachVolume", "ec2:AuthorizeSecurityGroupIngress", "ec2:CreateInternetGateway", "ec2:CreateKeyPair", "ec2:CreateRoute", "ec2:CreateRouteTable", "ec2:CreateSecurityGroup", "ec2:CreateSubnet", "ec2:CreateTags", "ec2:CreateVolume", "ec2:CreateVpc", "ec2:DescribeAccountAttributes", "ec2:DescribeAvailabilityZones", "ec2:DescribeInstanceAttribute", "ec2:DescribeInstances", "ec2:DescribeInstanceStatus", "ec2:DescribeInternetGateways", "ec2:DescribeKeyPairs", "ec2:DescribeRegions", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeTags", "ec2:DescribeVolumeAttribute", "ec2:DescribeVolumes", "ec2:DescribeVpcAttribute", "ec2:DescribeVpcs", "ec2:DetachInternetGateway", "ec2:ImportKeyPair", "ec2:ModifyVpcAttribute", "ec2:RebootInstances", "ec2:RunInstances", "ec2:StartInstances", "ec2:StopInstances", "ec2:TerminateInstances", "elasticloadbalancing:AddTags", "elasticloadbalancing:CreateListener", "elasticloadbalancing:CreateLoadBalancer", "elasticloadbalancing:DescribeListeners", "elasticloadbalancing:DescribeLoadBalancers", "elasticloadbalancing:DescribeTargetGroups", "elasticloadbalancing:DescribeTargetHealth", "iam:ListRoles", "sns:CreateTopic", "sns:ListTopics", "sns:TagResource"
When deploying MemSQL via Cloud Formation script, MemSQL recommends using the MemSQL Cluster Management Role to deploy and manage the MemSQL cluster.
AWS administrators should rotate the user access key and secret periodically if IAM Roles are used to manage the users.
The MemSQL AWS team is actively soliciting customer feedback and would appreciate hearing from you. Please send your comments and suggestions to firstname.lastname@example.org.