Throughout the week, I read a lot of blog-posts, articles, etc., that has to do with things that interest me
- data science
- data in general
- distributed computing
- SQL Server
- transactions (both db as well as non db)
- and other “stuff”
This is the “roundup” of the posts that has been most interesting to me, for the week just gone by.
SQL Server
- Unsung SQLOS: The SOS_UnfairMutexPair behind CMEMTHREAD waits. Ewald has fired up WinDbg to look at the synchronization object behind memory allocations:
SOS_UnfairMutexPair
. Very, very interesting if you are into SQL Server’s plumbing. - Creating Self Building SQL Server Data Tools Pipelines Using Jenkins and GIT. Interesting article how to create a self-building SQL Server pipeline. Or more accurately, for SQL Server Data Tools projects.
- SQLskills SQL101: Query plans based on what’s in memory. Paul Randal from SQLskills posts about the SQL Server query optimizer doesn’t or care what’s in the buffer pool when creating a query plan.
Streaming
- Corfu: A distributed shared log. Adrian Colyer covers Corfu, which is a global log which clients can append-to and read-from over a network. It seems like a cross between Kafka and an in-memory database.
Distributed Computing
- In-memory Caching: Curb Tail Latency with Pelikan. A presentation about Pelikan - a framework to implement distributed caches such as Memcached and Redis.
Data Science
- End-to-End Scenarios Enabled by the Data Science Virtual Machine: Webinar Video. Blog-post pointing to a webinar about end-to-end scenarios you can do with the Microsoft Data Science Virtual Machine (DSVM). I really need to find some time and do some work with DSVM.
- rxExecBy Insights on RxSpark Compute Context.
rxExecBy
is a function introduced in Microsoft R Server 9.1. It can be used to partition input data source by keys and apply user defined function on individual partitions. It comes in very handy when you have huge datasets - you can partition the dataset into many small partitions and train models on each partition. The blog-post shows examples of how it can be used on Spark. - Performance: rxExecBy vs gapply on Spark. Above I pointed to a blog-post about
rxExecBy
. This post compares performance ofrxExecBy
andgapply
. - Real-time scoring with Microsoft R Server 9.1. A blog-post by Revolution Analytics, discussing how you can do real-time scoring using Microsoft R Server 9.1.
Shameless Self Promotion
- Microsoft SQL Server R Services - Internals V. I finally managed to finish the Internals - V post. In this post I cover how parallelism works in SQL Server R Services.
That’s all for this week. I hope you enjoy what I did put together. If you have ideas for what to cover, please comment on this post or ping me.