caching,cuda , Understanding Memory Replays and In-Flight Requests

Understanding Memory Replays and In-Flight Requests


Tag: caching,cuda

I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2)

__global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) {
    unsigned int ix = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;

    if (ix < nx && iy < ny) {
           out[iy*nx + ix] = in[ix*ny + iy]; // 
           // out[ix*ny + iy] = in[iy*nx + ix]; // for by row

This is what I don't understand: The load throughput for for transposeNaiveCol() is 642.33 GB/s and for tranposeNaiveRow() is 129.05 GB/s. The author says:

The results show that the highest load throughput is obtained with cached, strided reads. In the case of cached reads, each memory request is serviced with a 128-byte cache line. Reading data by columns causes each memory request in a warp to replay 32 times (because the stride is 2048 data elements), resulting in good latency hiding from many in-flight global memory reads and then excellent L1 cache hit ratios once bytes are pre-fetched into L1 cache.

My question: I thought that aligned/coalesced reads were ideal, but here it seems that strided reads improve performance.

  1. Why is reading a cache line conducive to reduced performance in this case?
  2. Aren't replays in general a bad thing? It mentions here that it results in "good latency hiding".


Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric that really matters is the wall clock time that your kernel takes to completion, of which the authors make no mention.

That being said, kernels usually fall into two categories:

Matrix transpose being of very low compute intensity, it is therefore I/O bound, and as such to get better performance you should try to increase bandwidth usage.

Why is the column transpose better at maximizing bandwidth usage?

In the case of the row transpose, reads are coalesced: a single 128 bytes transaction is served per warp, that is 4 bytes per thread. Those 128 bytes are put in cache but are never reused, so the cache is effectively of no use in this case.

In the case of the column transpose, reads are not coalesced: each warp gets served 32 transactions of 128 bytes, all of which will get into L1 and will be reused for the next 31 replays (assuming they didn't get kicked out of cache). That is very low load efficiency for very high effective load throughput, and maximal cache usage.

You could of course get the same effect in the row transpose by simply requesting more data per thread (for example by loading 32 float, or 8 float4 per thread) or using CUDA's prefetch capabilities.


yii2 disable page cache on post request

I have a page where I submit a form that I want to cache but only for get requests. I cannot figure out if there is a way to do this but the Yii2 guide seems to hint at it$enabled-detail, it says you can enable it only for Get...

How to use a dynamic value for cache key with ActiveModel::Serializers (v0.10.0.rc1)

I am using a database-driven solution for labels and translations that I would like to cache at the serializer level. Here is my serializer. class AppLabelSerializer < ActiveModel::Serializer cache key: 'app_label', expires_in: 3.hours attributes :id, :key, :label, :label_plural def key object.app_label_dictionary.key end end The problem is that I need to...

Caching Views in XAML changes appearance?

I'm caching Views in XAML in order to improve performance (don't need to load the controls again). Does this make sense / improve performance of loading controls? Problem is, although I'm caching / loading the same thing, it seems when I use the Cached View, it has a different appearance....

How many parallel threads i can run on my nvidia graphic card in cuda programming?

Operating System: Windows 8.1 Single Language, 64-bit DirectX version: 11.0 GPU processor: GeForce 840M Driver version: 353.06 Direct3D API version: 11.2 Direct3D feature level: 11_0 CUDA Cores: 384 Core clock: 1029 MHz Memory data rate: 1800 MHz Memory interface: 64-bit Memory bandwidth: 14.40 GB/s Total available graphics memory: 4096 MB...

How to load data in global memory into shared memory SAFELY in CUDA?

My kernel: __global__ void myKernel(float * devData, float * devVec, float * devStrFac, int Natom, int vecNo) { extern __shared__ float sdata[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; float qx=devVec[3*idx]; float qy=devVec[3*idx+1]; float qz=devVec[3*idx+2]; __syncthreads();//sync_1 float c=0.0,s=0.0; for (int iatom=0; iatom<Natom; iatom += blockDim.x) { float rtx =...

Understanding Memory Replays and In-Flight Requests

I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2) __global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) { unsigned int ix =...

Is prefix scan CUDA sample code in gpugems3 correct?

I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...

How can I cache my website in the user's browser?

I've found many explanations about caching, some of them even have examples but, it is kind of foggy to understand it and how to use it. I've tried to use it many times, but I've failed (I want to improve speed, I want only the necessary to be loaded from...

Update a D3D9 texture from CUDA

I’m working on a prototype that integrates WPF, Direct3D9 (using Microsoft’s D3DImage WPF class), and CUDA (I need to be able to generate a texture for the D3DImage on the GPU). The problem is, CUDA doesn’t update my texture. No error codes are returned, the texture just stays unchanged. Even...

Hibernate Query cache invalidation

I am using Hibernate(with JPA) in an application that has a high write-read ratio. For caching I have enabled query-cache and hibernate second level cache(ehcache). The problem I am facing is due to automatic query cache invalidation when an update is done. Is there any way to configure query cache...

Reduce by key on device array

I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values . For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2. CODE: struct compare_int2 : public...

Django localmem size

What is the default size of the local memory cache for Django. does not mention any. says it is 300, but the following code always returns a different value: for i in range(0, 10000): cache.set(i, i) first = cache.get(0) if first is None: print i break I have...

Using a data pointer with CUDA (and integrated memory)

I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds...

cudaMalloc vs cudaMalloc3D performance for a 2D array

I want to know the impact on performance when using cudaMalloc or cudaMalloc3D when allocating, copying and accessing memory for a 2D array. I have code that I tried to test the run time on where on one I use cudaMalloc and on the other cudaMalloc3D. I have included the...

Cache ruining Jquery Code?

My problem might be naive, but basically I have a simple Jquery script running on a site to determine the margin height of divs based on the contained image's height: <script type="text/javascript"> $('.artobject').css('margin-bottom',$('.img-wrap').height()); </script> The problem is that it runs once, but after the page is reloaded the code begins...

Can an unsigned long long int be used to store the output from clock64()?

I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?

What does `expires -1` mean in NGINX `location` directive?

Given the sample location example below, what does -1 mean for expires? Does that mean "never expires" or "never caches"? # cache.appcache, your document html and data location ~* \.(?:manifest|appcache|html?|xml|json)$ { expires -1; access_log logs/static.log; }

Cache inconsistency - Entity not always persisted in cached Collection

I'm having an issue where a Validation instance is added to a Collection on a Step instance. Declaration is as follows: Step class: @Entity @Table @Cacheable @Cache(usage = CacheConcurrencyStrategy.READ_WRITE) public class Step extends AbstractEntity implements ValidatableStep { @OneToMany(fetch = FetchType.LAZY, cascade = CascadeType.ALL, orphanRemoval = true) @JoinColumn(name = "step_id", nullable...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine...

Prevent Caching of PDFs in ASP Classic

I just started to manage a website for the company I work for and it is still running with ASP Classic so I have had to learn quite a bit. Recently our managment has started to post a dynamically updated PDF to the website every few hours. The problem I...

How does CUDA's cudaMemcpyFromSymbol work?

I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?

PlayFramework 2.2.6. Default cache expiration

I would like to store Java object in cache for user session. I would like to know what is default cache storage timeout. I user log in on 8.00 am I will store his user object in cache. Will it keep f.e. 10 hours until user will logout? Will be...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

There's an old Parallel ForAll blog post that demonstrates using streams and async memcpys to generate overlap between kernels and memcpys, and between HtoD and DtoH memcpys. So I ran the full Async sample given on my GTX Titan X, and here's the result: As you can see, when...

App Not Downloading Newest Version Of File [Java]

Okay, I've been trying to figure this out for a few hours and it's starting to kill me. I wrote a primitive version checker for an app I work on every once and awhile. It's just a simple for fun project. The version checker has been a pain though. It...

Yii2 : how to cache active data provider?

In my PostSearch model I have this code : public function search($params) { $query = Post::find()->where(['status' => 1]); $dataProvider = new ActiveDataProvider([ 'query' => $query, 'sort'=> ['defaultOrder' => ['id' => SORT_DESC]], 'pagination' => [ 'pageSize' => 10, ] ]); if (!($this->load($params) && $this->validate())) { return $dataProvider; } $query->andFilterWhere([ 'id' =>...

PHP How to not cache generated HTML but cache static data like images/js/css

Many PHP developers add the no-cache header on top of their PHP pages, so do I, for obvious reasons. Since PHP generated content is usually dynamic, having the browser cache them results in outdated data being presented to the user. To avoid this caching is usually disabled. <?php //no cache...

Angular ng-repeat cache (avoid re-rendering on state change)

We have huge rendering spikes with ng-repeat in Angular application. Main page shows a huge list of cover images ("::" and "track by" are in place). On first load it works acceptable. But if user changes the state (we use UI-Router) and goes back to the home page afterwards then...

scalacache memoization asynchronous refresh

I'd like to do a TTL based memoization with active refresh asynchronously in scala. ScalaCache example in the documentation allows for TTL based memoization as follows: import scalacache._ import memoization._ implicit val scalaCache = ScalaCache(new MyCache()) def getUser(id: Int): User = memoize(60 seconds) { // Do DB lookup here... User(id,...

Cache expires although explicitly set not to expire

I have plenty of available ram (about 25 GB of free memory) and I don't want the cache to expire and I just remove and recache items when there is a change.As my website is in testing process it has 1 or 2 KBs of cached items but when I...

Detemine memory used by Hazelcast cache

I'm implementing the backend of a game in Java and I was told to use Hazelcast to cache some data from the database. How could I find out how much memory is used by the cache (necessary to determine the business cost of this solution)?

Does setting beresp.ttl to 0s replace previous cache?

I have this issue, I use Varnish 3.0 and I want to avoid caching errors. I'm aware that this piece of code should do it: if (beresp.status >= 500) { set beresp.saintmode = 2m; if (req.request != "POST") { set beresp.ttl = 0s; return(restart); } else { set beresp.ttl =...

Caching in Webview is not working in Android

When I open webview, it runs great when there is a network connection, but I have enabled caching in it, and without a network connection, it shows an error and loaded cached web page is not showing. I have enabled caching and also set cache mode, but it's not working....

Caching issue in angularJS application

I've an admin panel which is a pure angularJS application, which uses REST api for data manipulation. REST api is built using SlimAPI framework and Laravel's Eloquent ORM. I'm facing caching issue in admin panel. Even though if a add new content to the system, it'll not show up in...

how to generalize square matrix multiplication to handle arbitrary dimensions

I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200. #include...


I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes. // This does...

Tesla k20m interoperability with Direct3D 11

I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...

How can clear apache cache?

How can I clear apache cache in xammp? I tried the 'htcacheclean -r' command, but it's always generated error. If I know well the apache can't cache the files/ scripts, but a system administrator said this: 'The apache casheing the site, so clear the apache(!) cache.'. ...

What is version of cuda for nvidia 304.125

I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is $cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)...

Is processor cache flushed during context switch in multicore?

Recently, I discussed why there is a volatile mark at seq in Java Actors demo @volatile private var seq = 0L private def nextSeq: Long = { val next = seq seq += 1 next } One answer was that threads can be migrated and variables lost (other cores will...

“client not initialized” error when using SSMCache with AWS elasticache autodiscovery

I am using Spring cache with AWS elasticache provider. I get this warning: WARN c.g.code.ssm.spring.SSMCache - An error has occurred for cache defaultCache and key java.lang.IllegalStateException: Client is not initialized at net.spy.memcached.MemcachedClient.checkState( ~[elasticache-java-cluster-client.jar:na] at net.spy.memcached.MemcachedClient.enqueueOperation( ~[elasticache-java-cluster-client.jar:na] at net.spy.memcached.MemcachedClient.asyncGet(

Unzipping Multiple Files - Java

I'm working on an auto updater for a game client and I've run into an issue. What I need it to do: Download and Extract to cacheDir and extract to the same location they are running the jar(game). What it does right now: Downloads both

Generic Object Cache

I am working on a project where I plan on using Redis as persistent data storage, however the task at hand, I am working on a generic Object cache. and as a huge fan of LINQ I have started designing a cache which does support this. public ConcurrentBag<Object> Cache =...

How can I pass a struct to a kernel in JCuda

I have already looked at this which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...

Understanding Dynamic Parallelism in CUDA

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid...

How to call posts from PHP

I have a website, that uses WP Super Cache plugin. I need to recycle cache once a day and then I need to call 5 posts (URL adresses) so WP Super Cache put these posts into cache again (caching is quite time consuming so I'd like to have it precached...

cuda-memcheck fails to detect memory leak in an R package

I'm building CUDA-accelerated R packages, and I want to debug with cuda-memcheck. So in this minimal example (in the deliberate_memory_leak GitHub branch), I create a memory leak in someCUDAcode.c by commenting out a necessary call to cudaFree. Then, I see if cuda-memcheck can find the leak. $ cuda-memcheck --leak-check full...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

How to omit creating static Map for data caching?

I want to know if it is possible to omit creating cache with static Map instance. Here is snippet of my class: public class XpathEvaluator { private DocumentBuilder builder; private XPath path; private Document document; private static Map<String, List<String>> cachedXpaths = new HashMap<>(); private XpathEvaluator() throws ParserConfigurationException { DocumentBuilderFactory factory...

How do you build the example CUDA Thrust device sort?

I am trying to build and run the Thrust example code in Visual Studio 2010 with the latest version (7.0) of CUDA and the THURST install that comes with it. I cannot get the example code to build and run. By eliminating parts of the code, I found the problem...