This is just a dump from some OneNote pages I took over that last few years. It's completely unstructured and here for my own use, and for anyone else that might find some value in it.
- Big data: Hadoop
- Database: SQL Azure
- Storage: Tables, Blobs, Files, DocumentDB
- Traffic: Traffic Manager, Networking
- Caching: Redis, AppFabric
- Messaging: Service Bus, Queues
- Identity: AD, ?
- Media: Media Services and streaming
- Hosting: CDN, Websites, Cloud Service Worker Roles, WebJobs
- Starts with an Console Application.
- Install NuGet: Microsoft.Azure.WebJobs
- Depends on Azure Storage, so brings in dependencies.
- Add connection strings to storage and dashboard, use two storage accounts.
- Need to setup the accounts in Azure portal, empty is fine.
- You can run this locally.
- Setup the queue.
- New queue, "helloworld"
- Added a new message via context menu on queue in Azure Management Studio.
- Hit F5 and Job host starts and runs the block.
- Peeks and triggers and pulls and sends into the HelloWorld method, no queue-reading code is needed!
- That's it.
- "Always On" needs a paid website.
- Console Application, right click it, Publish as WebJob…
- Continuously, Scheduled, Demand.
- Continuously is for queue trigger.
- Create a new website, or attach it to existing.
- Only a WebJob is hosted if new site.
- Can also add a WebJob as a zip file via portal.
- Can use other languages, like a PowerShell script.
- With C# we can use diagnostics, logging etc.
- We used a CloudService, a worker role. Still can.
- A cloud service is a way to package up applications for Azure, in Roles.
- Inherits from RoleEntryPoint, needs OnStart, Run, OnStop, lots of manual coding for reading config file settings, connecting to cloud storage account, queue account, creating it if not exist, also connecting to blob storage, container, etc.
- Running in a perpetual loop with own sleep/delay logic.
- VM with manually installed cache like Couchbase.
- New! In the store, there's Add-ons, Memcachier: New > Store > Memcachier, select plan, failover etc.
- Install NuGet package, Enyim Memcached.
- Redis Cache Solutions for Azure: available as a service, has grouping and cached lists addressable via own item index, pub/sub messaging, batch transactions,
- Redis Service has tiers for throughput and failover, 250MB-53GB, master-slave basic option for replication + auto-failover, scale size up instantly. For dev, use smallest cache that has IOPS limit.
- Can use it from anywhere, is public.
- Portal: New > Redis Cache
- VM with manually installed cache like Couchbase.
- Azure Automation is a service on Azure, a hosted run space, backed by PowerShell Workflow.
- System Center Orchestrator is very similar.
- Centralized Store: values, credentials, PS modules.
- API for Management coming.
- Reporting and history.
- Automation accounts are tied to regions.
- Might want to split accounts for security/access to production credentials.
- Assets stored items are shared across the account.
- Runbook: a set of scripts to execute, becomes a 'job'.
- Schedules: daily, hourly, once.
- Priced on job run time and number of runbooks, per account.
- For a call to be made to a HTTPS service (external), need to download and install the certificate since there is not access to a root cert store.
- By default you get the PS module for Azure.
- Assets: click import module and browse to a zip file containing the modules.
- Can now pass a PSCredential object to authenticate with Azure AD using an "organizational account".
- Can edit the runbook within the portal on the Author 'tab'.
- Add-AzureAccount is what's used to authenticate, must be stored statically in the PS runspace/runscope/process.
- Membership, Leaderboards, Achievements, Downloadable Content, Game Statistics, Game Presence, Cheating & Banning, Multi-Player Game Stats.
- Tricky to choose the right services, especially as they're changing all the time and getting new capabilities.
- Telemetry, Inquiries, Commands, Notifications
- Uses Service Bus pub/sub topics to get telemetry into worker roles for processing.
- Uses Relay Service to get notifications to other drivers (presumably hosting WCF in game) but says that it's not actually very scalable, use something else.
- Storage, uses tables for lap times and telemetry, blobs for binary lap replay.
- Uses ASP.NET MVC website(s) for lap time display, telemetry API for inquiries and website itself.
- PartitionKey, partitions have SLA, and order by RowKey
- Telemetry: sending per sector, once per 10 seconds, sampling data 100ms and batching to send 1/second (interpolation on receive to smooth jumps).
- Service Bus and a worker process to digest statistics, user stats.
- Authentication, XBox sits on secure network tunneling over the internet.
- Security gateways, XBox secure Protocol, UDP based, have to use SG to talk to public internet.
- "Title infrastructure"
- XBox has limited local storage slots and RAM, need to offload temp data, partial statistics.
- The keep session state and sessions marked as complete, so crashed servers can resume session state reload.
- Massive scale testing, Azure Service Bus team had to ask Halo team to stop!
- Scale testing had to be seriously invested in, no standard tools, record, mutate and playback. Record the real traffic. Hard to fake generate certain data types.
- Like a server side version of fiddler.
- Use scheduled service bus messages, dump millions of messages but scheduled for delivery.
- James Hamilton, lessons learned from building Windows Live,
- Partitioning your application.
- Optimising for density.
- Caching
- Millions of users, 200,000++ ops per second, 1000s of cores, 100s of databases.
- Redundancy and Fault Recovery
- Commodity hardware slice.
- Single version software.
- Multi-tenancy.
- Support geo-distribution.
- Automatic provisioning and installation.
- Configuration and code as a unit.
- Manage roles, not servers.
- Deal with multi-system failures.
- Recover at the service level.
- Stateless is the goal.
- Small code optimizations can have massive impact on your cloud bill.
- Typical Workloads
- Content Delivery: websites and services, session state, transient state, shopping cart.
- Content Exploration: Per-user content view, per user-stateful progress, doesn't touch other user data, fairly simple to scale.
- Social Graph and Content: comments, likes, global reach between users, loosely consistent, async updates to n customers, I must see my comment immediately but its okay for it to take a short time for others to see it.
- Interactive Gaming: n user content view, game actions, session, global reach, state updates shared to n players.
- Capacity, adding for demand, partitioning scheme.
- Optimize, resource usage, efficiency
- Shift, trade durability, queryability, consistency for throughput, latency.
- Play to strengths of components available.
- Azure compute, fairly easy to scale up and out
- Azure storage, 100TB, 5000iops per partition, 3Gbps, normally hit iops limit first, more partitions or more accounts.
- Azure SQL Database, 150GB, 305 threads, 400 concurrent reqs, hard to partition because the query semantic doesn't account for partitions/cost of operation.
- Horizontal partitioning, shards, split by rows, needs balanced part key.
- Vertical partitioning, split by columns, can be done across storage types easily on the cloud.
- Hybrid, shard + dimension data on other storage mediums.
- Select part value, Last Name, must consider field that won't change.
- Convert to part key, like hash it, speed vs. collisions vs. distro, mod by bucket count
- Map key to logical partition.
- Map logical partition to physical partition.
- End up with a connection string.
- Range Based, ranges adjusted to even out the parts.
- Logical Buckets, assign to logical bucket and assign to physical store, can have more than one logical per physical.
- Lookup assignment, lookup table to physical resource.
- Twitter, two tiers of people, normal people with 300 followers, celebrities.
- Querying over shards, gather and query, query is done in data tier.
- Eventual consistency can be done, geo scale with local write that the writer customer can see, then background task write elsewhere, or pop on queue.
- Submit queries to all nodes manually, gather results.
- SQL Azure Federations, does sharding for you and live splits, works for some problems, the central gateway becomes the choke point.
- Consider rush hour in a region, consider using region in quiet region.
- Memcached clients are aware of servers and keys.
- Windows Azure cache knows Azure, cache is deployed as a worker role.
- Partitioning is driven by server, has high avail option and perf monitor counters.
- Can add instances, auto handles it, but cannot remove easily.
- Dual write so reliable with small overhead. Does your app care, need cache hits?
- The importance of designing for insight, instrumentation, performance and reliability.
- Design for failure, part of the system being offline, ignore or queue, retry, backlog.
- Putting trace or logging config in a config file won't work in the cloud, need to design a remote config system.
- There is a good chance of long periods, minutes, or downtimes per month, and be within SLA.
- Deal with it.
- More components, more chance of something being down.
- Hiccups, retry a few times, then mark as down.
- Node down, service down, entire region hit by act of God.
- CloudFX library, has retry policy, then throws a transient.
- RETRIES MUST HAVE RANDOM DELAY
- Retries should be coordinated with other retires stacking up, only one call retrying and the others either queuing or failing completely without even trying.
- Semaphore around the retry resource, object.
- Load needs to be spread over regions.
- Route away from failures.
- Press Association deployed to 8 datacentres.
- Traffic Manager has route poor performance, get closest DC by IP, but routed when bad.
- Location is not the same as IP latency, use IP latency.
- Traffic Manager has custom health probing in SDK.
- Queues duped in different regions, processors local, sucking from local and dupe queues.
- How quickly should I react to new insight?
- Do I know the question or am I exploring data?
- KPI, time series, scalar stat, trending, ratios.
- How much data is required to gain insight?
- Perf stats against app stats, like total users, active users.
- How much of the source signal do I need for insight?
- Local computation vs. global system computation?
- Requests queued is your most important metric.
- New Relic works on Azure by agent.
- OpsTera
- PagerDuty
- WAD Windows Azure Diagnostics
- WAD has challenges, won't give 3rd party diag, perf data is written to table storage with 60 time based partition key, and so IOPS is bottlenecked when monitoring many servers, have to turn down the sampling.
- Queue based means alerts can be slow to propagate.
- Stores are not very queryable, table store!
- Stores performance counter and application log data.
- General max through is 1000 entities per partition per table per account.
- Same cap on the out.
- Split data by history and realtime, push to a logging service that splits.
- High value: filter, aggregate, publish to anything written is actionable; alerts, dashboards, operational intelligence.
- High volume: batch, partition, archive; trends, root cause, mining.
- WAD is very configurable; verbose written to file and then forwarded to blob storage. Blob storage can sustain this sort of load up to 1000 instances per storage account.
- Keep storage accounts separate for instrumentation data.
- Create a custom data source in WAD, monitoring a folder, if I put the file here, you put the file there.
- Log4Net: Rolling files is all you need, do all async writes.
- Affinity Groups, under settings; group resources as objects that work together, Azure provisions them to work together.
- Availability Sets ensure resources do not get shut down together (updates, outages per rack).
- Virtual Network; vNet and subnets and DNS servers, your playground, can have same addressing space because they're separate. Hard to change later/impossible.
- Can link to premises via perm VPN via hardware, or can put replicated AD or new AD and trust, can also point-site via Windows client.
- Can set AD server as DNS server for your vLan, though VMs must be DHCP assigned by Azure, though lease is like infinite.
- Can configure network infra in Azure via XML files.
- Don't put TempDB on local Azure disk anymore, Azure practices change fast.
- SQL Server Gallery Images have licensing implications; for Windows Server, your license is inclusive of time up. For MSSQL, this is the same.
- License mobility lets you move on-prem license to Azure, so use a vanilla Windows gallery image and load on.
- Can upload a VHD, even use SysPrep.
- Backup to cloud (from on-prem):
- Remove unused endpoints on the VM.
- Use virtual networks instead of public RDP ports to administer your VMs.
- Use VPN tunnel to connect to database servers.
- Carefully plan virtual networks to avoid re-configuration; have to tear down and rebuild everything if the network needs resizing.
- Use Availability Sets and Affinity Groups with VMs.
- Use mixed mode authentication when not in a domain; Windows mode is default but not always best idea.
- Add new port endpoint and add load balancing to it via the portal.
- Not sure if balancer is aware of downed node.
- Make sure Windows Update times are staggered to avoid downtime, even if in same Availability Group.
- Enable database connection encryption, not default.
- Run ALTER SERVICE MASTER KEY REGENERATE because gallery uses same image.
- Queues, part of Azure Messaging services.
- Topics, pub/sub event aggregator
- Relays
- Notifications
- With Azure queues, if the content of the message is not XML-safe, then it must be Base64 encoded. If you Base64-encode the message, the user payload can be up to 48 KB, instead of 64 KB.
- Each message is comprised of a header and a body. Cannot exceed 256 KB.
- Max concurrent TCP connections to a single queue 100 shared between senders and receivers, limit not imposed using REST.
- Queue size between 1 and 80 GB.
- Azure queues and Service Bus queues: 2,000 msg/s with 1KB.
- Azure queues: 10ms latency with no nagling.
- SB queues: 20-25ms.
- For decoupling, load leveling, scale out.
- Topics allow for:
- Broadcast and partition
- Content based routing
- Messaging Patterns
- Message Lock Renewal, for slow processing.
- Entity queries, in C# and REST, see code example below.
- Forward Messages between entities, trees of queues composed together for supporting 1000s topics, topic forwards to 100 topics, each forwards to 100 etc.
- Batch APIs
- Browse sessions
- Updating entities, enable/disable
- ConnectionString config file key based setup supported.
- Scalable, cross platform, push notification.
- Shared Access Secrets (SAS key), namespace and entity level, via C# or Azure portal, regen/revoke keys.
- Auto-delete Idle Entities, clean up idle topic, idle sub clients auto clean, good for auto scale down cleaning up subs not used, or test debris.
- Event-Driven Model, to remove hardship of writing correct receive loop, now SDK can have observers for receive, exception.
- Tasked-based Async API
- Browsing Messages
- AMQP, JP Morgan standardised messaging protocol.
- Paired Namespaces
- Max 12 rules per entity.
- Before this, needed to use "Users" or AD federation.
- For querying when you have many queues, topics etc.
- Use case: filter for unused queues.
- Efficient, binary.
- Reliable, fire forget, exactly once delivery
- Portable data reppresentation
- Flexible, client-client, client-broker, broker-broker
- Broker-model independent
- See also Blobs, Drives, Azure Queues, Files.
- Primary and secondary access keys (also now supports direct REST access)
- Data items called 'entities'
- Fixed PartitionKey, RowKey and Timestamp properties
- 252 additional properties of any name, schemaless.
- PK and RK form clustered index.
- AtomPub REST and .NET APIs
- .NET uses the concept of a context and changes are made to the context and saved, can thus be batched/transaction. Similar entity change tracking to EF.
- Null values are ignored by storage engine.
- Queries are begun using context.CreateQuery and look like EF Linq queries.
- Scanning a part or range of parts done using .CompareTo("Key") >= 0
- Use a new context for each op, context object is not thread safe.
- Can use IgnoreResourceNotFoundException and use null return to avoid exception overhead on empty lookup 404.
- Scans depend on row size, not just rows in partition, rows in where set.
- Research whether best to run a single query spanning range of parts, vs. running concurrent queries on each part?
- Partitions served from single server.
- Avoid hot partitions, unbalanced schemes.
- See "Lessons Learned" above for tips on shard key mapping algos.
- Row size: 1MB
- 200TB per table
- 1,000 rows per query response, use continuation token, no snapshot consistency.
- 500TB per storage account.
- 20,000 entities or messages/second per account.
- 10Gbit/s in 20 out for geo redundant, 20 in 30 out for local redundant.
- 2,000 entities/second per partition.