JNI Marshalling Performance: Single Round Trip

An Empirical Taste of the Cost of Kotlin to C++ JNI Communication on Android.

Much of my development experience has been Kotlin/Java on ART and C/C++ on Windows. With the Java Native Interface (JNI), there exists an opportunity to merge these two worlds. But Kotlin is a fun language and has answers to nearly all of a typical Android application's needs. It has the advantage of full access to the Android SDK. When creating a native UI that can easily be updated to the latest Material guidelines, Kotlin with Jetpack Compose is the obvious answer. When the need for a database arises, utilizing Android’s Room persistence library in Kotlin is a great choice. After all, Room is already an example of the best of both worlds, as it is essentially a very nice wrapper around a C++ SQLite database library. So...

Why C++?

Why might we want to use C++ at all? Some (intersecting) reasons are...

Potential performance gains.
- Generally reduced overhead across the board.
  - No just-in-time compiler (JIT), no garbage collection, faster array element access, etc.
- Access to SIMD/Vector operations.
  - ARM's Neon being most relevant for Android development
- Memory management for better cache performances.
  - Despite ART being a memory-managed environment, you can still intelligently manage memory usage in Kotlin. However, not as easily, fully, or explicitly.
Existing C/C++ libraries.
- Libraries with no functional equivalent available to Kotlin.
- Libraries with functional equivalents available in Kotlin but which lack in performance.
(Greater) Access to low-level system components.
- Digital Signal Processor (DSP)
- Graphics Processor (GPU)
- Hardware specific instruction set extensions (ex: Neon/SSE/AVX)
- Direct memory access

But Using C In Android Isn’t Free

This article focuses on one question related to potential performance gains. When researching the JNI, you will quickly be warned that marshalling data across the JNI is non-negligible. Essentially, communication between the native side (C++) and the managed side (Kotlin) is expensive. The programmer should not only reduce *how much* data they send to one another but should also reduce *how often* they send data to one another. My question here is the simplest I could think of that would provide helpful insight into the cost of marshalling data across the JNI:

The Experiment

To understand the cost of a JNI call, we will measure the performance of a single round trip through the JNI. Beginning in Kotlin on ART, calling out through the JNI to a native C++ function, and returning back through the JNI to Kotlin. The cost of this round trip will be measured across various integer array payload sizes (of both the arguments and return values).

Disclaimer

To the benefit of the author and reader, in this article… “C” and “C++” are interchangeable. And "Kotlin" can often be replaced with specifically "Kotlin/Java 8 running on the Android Runtime (ART)".

Refresher

Milliseconds (ms) = 10⁻³ seconds
Microseconds (µs) = 10⁻⁶ seconds
Nanoseconds (ns) = 10⁻⁹ seconds

Testing Environment

Iteration tests are utilized to acquire performance data. Each test runs until it goes through 100 consecutive iterations of no new minimum or maximum recorded measurements. Reported measurements are the minimum recorded measurements of these iteration tests. Although imperfect, the justification of using minimum values is to ignore iterations where the system interferes with the execution speed of the tests. The experiments were handled with care but the results are still flawed and will require looking at the overarching trends of the data over any individual measurement.

Three Android devices are used:

Motorola Z Play
- Released September 2016
- Android 8.0 (Oreo) - API 26
- 3GB RAM
- Snapdragon 625
Samsung Galaxy S10+
- Released February 2019
- Android 12 (S) - API 31
- 8GB RAM
- Snapdragon 855
Samsung Galaxy A23 5G
- Released Sept 2022
- Android 14 (U) - API 34
- 4GB RAM
- Snapdragon 695 5G

And these are the compiler settings used:

Android
- Min SDK: 26
- Target SDK: 34
- NDK: 26.2.11394342
- AGP: 8.4.0-alpha13
- Build Type: Release
  - Minify enabled
  - Modified ProGuard to maintain native methods/classes
Kotlin
- Compiler: 1.9.22
- KTX: 1.12.0
C
- Compiler: Clang 17.0.2 (included with NDK version listed above)
- Flags: -O3
- ABI: arm64-v8a

Before testing was performed, the devices were restarted, put on airplane mode, set up with a fresh install of the test app, disconnected from the computer, and plugged into a charger. Results are pulled from a .csv file that is written to external storage after all testing is performed.

For profiling, it makes little sense to use different methods for C++ profiling and Kotlin profiling. So, for consistent timing, the most obvious option for a high-resolution timer is System.nanoTime() from Kotlin. A second option is creating a JNI wrapper around clock_gettime() from C++. An honest attempt at using clock_gettime() (no arguments/return values with @CriticalNative annotation) proved that System.nanoTime() had a smaller and more consistent overhead. This is no fault of clock_gettime(), the cost of the JNI communication itself simply diminished any potential gains. (Already learning the costs before any test results came in!)

The C++ library is loaded in the init{} of the custom Application class of the project to ensure it is ready at the earliest possible moment. Tests are run inside of a Kotlin coroutine, launched with Dispatchers.Default for CPU-bound work.

Iterative Performance Tests

`System.nanoTime()` Timer Overhead

The overhead of the timing method is calculated with two consecutive calls to Java's System.nanoTime() and is simply the difference between these two values.

After looking at these times, it's clear they are rough measurements but they are certainly good enough to be useful. As you might notice with further results, for the Samsung devices, System.nanoTime() can only provide time in about ~52ns increments. Hence, why it was able to measure a performance of zero nanoseconds.

No-Op / No-Argument / No-Return-Value JNI Call Overhead

What is the cost of a JNI round trip that carries no payload and does no work?

To answer this question completely, there are three, performance-related, variations in need of testing. Kotlin on Android offers two special annotations for JNI call performance, @FastNative and @CriticalNative. In short @FastNative experiences “faster JNI transitions from managed code to the native code and back” and “functions that access the managed heap or call managed code also have faster internal transitions”. The cost of which is that “garbage collection cannot suspend the thread for essential work and may become blocked”. @CriticalNative is similar but with even faster JNI transitions and an inability to access/return managed objects (primitives and arrays of primitives are still accessible).

This is how the functions look:

// C
void nopNormalC(JNIEnv* env, jclass){}
void nopFastC(JNIEnv* env, jclass){}
void nopCriticalC(){}

// Kotlin JNI Glue
@JvmStatic
external fun nopNormalC()

@FastNative
@JvmStatic
external fun nopFastC()

@CriticalNative
@JvmStatic
external fun nopCriticalC()

As you can see, nopNormalC() & nopFastC() are not without arguments but it is as close as one can get. In Android, JNI functions must accept a JNIEnv* and a jclass argument unless they are tagged as @CriticalNative.

And here is the measured overhead of such calls:

Keeping in mind that this data is still very close to the limits of our timer overhead and thus prone to significant precision errors, the results are expected. Using @FastNative may shave upwards of 100ns off of the JNI transition from there to @CriticalNative may additionally shave upwards of 60ns.

More accurate measurements may be available in the Android developers documentation, where they supposedly recorded the transition overhead of 115ns for a normal JNI call to 35ns for a @FastNative to 25ns for a @CriticalNative. All of this is on some unspecified device in 2016.

None of these measurements are significant for a single JNI function called one time in an application but it can become non-negligible if the JNI function is called many times within a critical loop. It’s also worth re-stating that @FastNative doesn’t only improve this measured transition speed but also improves performance for accessing the managed heap or managed code from the native side. So the advantages of @FastNative measured here is only the floor of potential savings over a non-annotated JNI function.

No-Op With Int Array Parameter

What is the cost of a JNI round trip that receives an int array argument but which does no work?

Here is how the code looks:

// C
void nopIntArrayC(JNIEnv* env, jclass, jintArray javaIntArrayHandle){
  jsize size = env->GetArrayLength(javaIntArrayHandle);
  jint* body = env->GetIntArrayElements(javaIntArrayHandle, NULL);
  env->ReleaseIntArrayElements(javaIntArrayHandle, body, 0);
}

// Kotlin JNI Glue
@FastNative
@JvmStatic
external fun intArrayNopC(nums: IntArray)

And the results:

Technically, the C code isn’t entirely a no-op, but these function calls are the bare minimum required to use an integer array argument. Compared to the no-op & no-argument JNI calls, there is certainly a cost to simply acquire an integer array as a parameter. And this cost seems to rise as the integer array gets bigger. In some comparisons, the cost of this no-op is convincingly tripling.

But, on every device, the additional cost drops off when the array parameters grow to ten thousand elements or more. It seems it is cheaper to send one million integers than it is to send one. No matter the size of the array, the same three functions are being called. So why does the JNI cost grow until a steep drop off at ten thousand elements?

Integer Array Parameter `isCopy`?

In the last example, we sent NULL as the second argument to GetIntArrayElements(), which is perfectly valid and a common use case. This second argument is jboolean* isCopy and can serve as a second return value from the function. This boolean provides information on whether ART has returned a pointer to a copy of the array or if the jint* returned points to the actual memory that ART uses to store the underlying data of the Kotlin IntArray. Investigating how this parameter changes with respect to different-sized arrays brings light to the unexpected results seen above.

By slightly modifying the code from above, we can return that boolean for analysis.

// C
jboolean arrayArgIsCopyC(JNIEnv* env, jclass, jintArray javaIntArrayHandle){
  jboolean isCopy = false;
  jint* body = env->GetIntArrayElements(javaIntArrayHandle, &isCopy);
  env->ReleaseIntArrayElements(javaIntArrayHandle, body, 0);
  return isCopy;
}

// Kotlin JNI Glue
@FastNative
@JvmStatic
external fun intArrayArgIsCopyC(nums: IntArray): Boolean

And the results:

As shown, there exists a threshold in the size of an integer array argument (between 1,000 & 10,000) where native code accessing that array via GetIntArrayElements() transitions from receiving a pointer to a copy of the integer array to receiving a pointer to the actual underlying data of the IntArray in managed memory land. Since copying an array takes more time than not copying an array, the JNI overhead for the no-op function became more costly when receiving one thousand integers than when receiving one million.

UPDATE: It seems the answer to this question has to do with how the ART created the IntArray at inception on the Kotlin side. The array is not, on the fly, determined to be too large for a copy. The “GetPrimitiveArray()” code in the Android project can be viewed here. The most relevant code being:

if (Runtime::Current()->GetHeap()->IsMovableObject(array)) {
  if (is_copy != nullptr) {
    *is_copy = JNI_TRUE;
  }
  // make and return copy
} else {
  if (is_copy != nullptr) {
    *is_copy = JNI_FALSE;
  }
  // return pointer to underlying array data
}

Where isMovableObject(array) is determined by whether the Space (memory allocated for managed objects) the integer array exists in canMoveObjects(). Which was ultimately determined when the ART allocated the IntArray in some specified Space(Maybe a LargeObjectSpace? One would have to do more digging.).

Sum of an Int Array

With these baseline measurements out of the way, we can start comparing the performance of work accomplished via a round trip to native through the JNI and pure Kotlin.

We’ve measured doing no work, now we want to actually interact with the array in some way. Keeping things simple, how might the performance compare when acquiring a sum?

The code:

// C
jint sumC(JNIEnv* env, jclass, jintArray javaIntArrayHandle){
  jsize size = env->GetArrayLength(javaIntArrayHandle);
  jint* body = env->GetIntArrayElements(javaIntArrayHandle, NULL);
  jint sum = 0;
  for(jsize i = 0; i < size; i++) {
    sum += body[i];
  }
  return sum;
}

// Kotlin JNI Glue
@FastNative
@JvmStatic
external fun sumC(nums: IntArray): Int

// Kotlin: C-Style
var sum = 0
for(i in 0..intArray.lastIndex) { sum += intArray[i] }

// Kotlin: IntArray.sum()
val sum = intArray.sum()

Results:

Individual performance readings of this test seem to shine a light on the imperfections of the testing strategy but I think as a whole it tells a very convincing story.

It seems that for just about any size integer array, the native code is either significantly faster or just about breaking even. We’ve gone over what might be slowing it down at the small end (extra array copy & JNI transition cost). But, why does performance also seem to drop at the large end? There are a couple of things we need to investigate first. For starters, why was is C faster at all?

To answer that, we’ll dive into some assembly!

// Simplified C Code
int sumC(int* body, int size){
  int sum = 0;
  for(int i = 0; i < size; i++) {
    sum += body[i];
  }
  return sum;
}

// ARM Assembly generated with Clang 17.0.1 (-O3)
.LBB0_5: // loop jump label

// ldp = load a pair of registers
// q2, q3 = into two quadword (128-bit) registers q2 & q3
// from the address in register x8 minus 16
ldp q2, q3, [x8, #-16]

// add 32 bytes (128-bit times two) to address register x8
add x8, x8, #32

// subtract 8 from loop array index stored in register x11
subs x11, x11, #8

// add register values in v2.4s and v0.4s and store in v0.4s
// where v2.4s is our loaded quad word register q2 interpreted
// as 4 signed 32-bit numbers. And v0.4s is a running sum.
add v0.4s, v2.4s, v0.4s

// [see comments above]
add v1.4s, v3.4s, v1.4s

// branch to beginning of loop if subs didn’t produce 0
b.ne 

// add rolling sum register v1.4s into rolling sum v0.4s
add v0.4s, v1.4s, v0.4s

// fold by adding rolling sum quadword into single 32-bit register s0
addv s0, v0.4s

… // individually add remaining values, if array was not divisible by 8

You can check it out yourself in Compiler Explorer (Godbolt) too!

In summary, the assembly code above is performing vector additions of 8 values per loop iteration as opposed to just the one addition that was originally written. This is the reason performance skyrockets so quickly for compiled C in this summation test.

So, if C is capable of performing 8 additions per loop iteration, how does Kotlin ever catch up?

SIMD Waiting on the D

The iterative performance tests tend to set the cache in a favorable state for that which is being tested. For example, if a test calculates the sum of one thousand numbers for the first time, the data backing the array might not be cached or it might be in a slower cache or it could be in a mix of the two. But if that data is immediately used after to calculate the sum again, those numbers will almost surely be primed and ready in the fastest cache available to the CPU core running the test. But that isn’t true for any quantity of data.

Once the amount of integers, or the size of the backing data, reaches a certain threshold it can no longer fit in the fastest cache available to the core. As the amount gets larger, eventually the data doesn’t fit in even the slowest cache and can only be fetched from main memory (RAM). As far as I know, this is valid for any system aside from maybe an embedded system in an automatic cat feeder. (However, there is a uniqueness to how Android handles memory too large for even the system’s RAM. You can read more about zRAM, memory-mapped, and memory strategies used in the Android developers documentation here.)

So another thing to ask is what are the cache thresholds and how do they compare to the size of our integer arrays?

For simplicity, let’s only analyze the cache sizes of the Samsung Galaxy S10+ and its Snapdragon 855’s CPU cores.

Integers in Kotlin are 4 bytes in length. Below is the size of the underlying data of the integer arrays in bytes, kibibytes, and mebibytes.

And the cache sizes on the Samsung Galaxy S10+…

The cache on this device isn't unusual for a heterogeneous multi-core processor. The L1 cache is per-core. The L2 cache is per-core as well, but there are microarchitectures where the L2 cache is shared across an entire CPU cluster. The L2 cache sizes also vary across clusters. The L3 cache is shared among all CPU clusters/cores. And main memory (RAM) is shared amongst the whole system (CPU, GPU, DSP, other processors on SoC).

So what percentage of our caches is the integer array consuming?

As shown above, our L1 data cache serves us perfectly well all the way up to ten thousand integers. At 100,000 integers the L1 cache can no longer store the array. Nor can the L2 on one of the lower-end cores. And by one million integers, not even our L3 cache is big enough.

By one million integers, our iterative tests are not priming our caches at all. In fact, due to the Least Recently Used (LRU) nature of caches, each iteration test is flooding all cache levels with data that will never have a chance to be utilized. This flooding of the cache levels are felt on both sides of the JNI barrier.

So if C is performing faster than Kotlin at 100 thousand integers and both are slowed down at 1 million integers. How does Kotlin manage to catch up? I mean, regardless of how hot the cache is, compiled C is still crunching 8 integer adds for every 1 integer add in Kotlin, right?

Well.. effectively, no. Regardless of the cache, C is still adding 8 integers per loop, that is true. And Kotlin is still only adding 1 integer per loop, that is also true. But the data for these additions, the eight integers for C and the one integer for Kotlin, have to come from main memory. And, compared to the speed at which a CPU can perform integer addition, acquiring data from main memory is terribly slow.

To explain why this slows both C++ and Kotlin down to the same speed, let me paint a picture (with bogus numbers) to better describe the essence of the problem.

Memory flows from RAM to registers and registers to RAM by a fixed amount, which is called a “cache line”. A typical size for a cache line is 64 bytes.

Let’s say we can transfer one full cache line from RAM to the registers for use every 200 cycles. A cache line, 64 bytes, is equivalent to sixteen 32-bit integers. So every 200 cycles, we have access to 16 integers. Now, let’s say C takes only 10 cycles to do 8 integer adds. That means after each cache line is received it takes C 20 cycles to process. Let’s also say Kotlin takes 10 cycles to do just 1 integer add. That means after each cache line is received, it takes Kotlin a whopping 160 cycles to process. If only processing 64 bytes, C takes s total of 220 cycles and Kotlin takes a total of 360 cycles. But the additions and the fetching of the next cache line can actually be done in parallel.

So, if processing 128 bytes (2 cache lines) in C, C must first wait 200 cycles to receive the first cache line. Then it simultaneously requests the next cache line and starts processing the first cache line. After 20 cycles it finishes processing the first cache line. After another 180 cycles, it receives the second cache line. Finally, after another 20 cycles, it finishes processing the second cache line for a total of 420 cycles. Now let's say we want to process 64,000 bytes (1,000 cache lines) of data. Given the logic described previously, C would complete in 200,020 cycles and Kotlin would complete in 200,160 cycles. With C being a mere 0.07% faster. The speed of acquiring the data from RAM to CPU registers has suddenly become a choke point.

It does not matter how many additions compiled C code can do per loop iteration when it does not have integers to add. At one million integers, both Kotlin and native C are suffering from this same problem and this problem is reflected in the summation performance test results.

Copy Int Array

So far we have only looked at single return values from native to managed environment. So what about returning various sizes of data?

Code:

// C
jintArray copyIntArrayC(JNIEnv* env, jclass, jintArray javaIntArrayHandle){
  jsize size = env->GetArrayLength(javaIntArrayHandle);
  jint* body = env->GetIntArrayElements(javaIntArrayHandle, NULL);
  jintArray result;
  result = env->NewIntArray(size);
  env->SetIntArrayRegion(result, 0, size, body);
  env->ReleaseIntArrayElements(javaIntArrayHandle, body, 0);
  return result;
}

// Kotlin JNI Glue
@FastNative
@JvmStatic
external fun copyIntArrayC(nums: IntArray): IntArray

// Kotlin
val copy = intArray.copyOf()

Results:

Most is as expected. The smaller array size measurements are being dominated by the overhead of JNI communication (transition + extra copy). But at 10,000 or more elements, it seems that C breaks even or performs slightly better, on all devices.

Since this test is about making a second copy, the working memory for the test has doubled. This means that if the compiled C code utilizes vector operations like the summation test did, it will be hindered by a colder cache at smaller-sized array arguments. This means that the L1D$ is already flooded by 10,000 integers and the L2 cache is guaranteed to be flooded by 100,000 integers.

Incrementing Int Array Elements

What is the performance of modifying the integer array argument?

// C
void plusOneC(JNIEnv *env, jclass, jintArray javaIntArrayHandle) {
  jsize size = env->GetArrayLength(javaIntArrayHandle);
  jint *body = env->GetIntArrayElements(javaIntArrayHandle, NULL);
  for (jsize i = 0; i < size; i++) {
    body[i] += 1;
  }
  env->ReleaseIntArrayElements(javaIntArrayHandle, body, 0);
}

// Kotlin JNI Glue
@FastNative
@JvmStatic
external fun sumC(nums: IntArray): Int

// Kotlin: In-Place
for(i in intArray.indices) {
   intArray[i] += 1
}

// Kotlin: IntArray.map()
// UPDATE: Explicit return type
val incrementedIntArray: List<Int> = intArray.map { it + 1 }

Results:

Before anyone is upset, IntArray.map() was not measured as a functional equivalent to the other methods and, as such, should not be held against Kotlin. But it is worth investigating how map() compares, as it is not an uncommonly used method. And it seemed worthy of presenting. For some reason, it takes hardly a whole frame of a 60fps application to create an incremented array of a million integers. We saw earlier that copying an array isn’t nearly that expensive on either side of the JNI barrier. And we see here that simply adding 1 to all elements does not take that long either. ~~My guess is that the time is lost on the overhead of calling the~~ map()’s transform ~~lambda argument a million times.~~

UPDATE: The transform lambda argument sent to IntArray.map() should be inlined and we shouldn’t have to pay a penalty for it. The biggest false equivalency with comparing map() is that it actually returns a List<Int> (specifically an ArrayList<Int>) and not an IntArray. The creating/updating of an ArrayList<Int> has different (and more expensive) costs than an IntArray.

That aside, the results are interesting. For starters, the tests at the small end run so fast that they are at the limits of the timer and are essentially garbage. They also further demonstrate that the return values of system.nanoTime() on the Samsung devices can only provide timestamps in ~52 nanosecond increments.

Overall, this information is very reminiscent of the summation performance test results. As soon as extra copies are no longer created by GetIntArrayElements(), it is only a benefit to use C to increment an entire array.

Sorting an Array

This last experiment is not a perfect apples-to-apples comparison but one that shows the potential of comparing real work done on a varying-sized integer array. I have chosen to use the default sorting algorithms available on each platform. Specifically, std::sort() in C++ and IntArray.sort() in Kotlin. std::sort() uses a hybrid sorting algorithm called Introsort which is a fusion of quicksort, heapsort, and insertion sort. Whereas IntArray.sort() is defined to use Dual-Pivot Quicksort according to the documentation.

The code:

// C
void sortC(JNIEnv* env, jclass, jintArray javaIntArrayHandle){
  jsize size = env->GetArrayLength(javaIntArrayHandle);
  jint* body = env->GetIntArrayElements(javaIntArrayHandle, NULL);
  std::sort(body, body + size);
  env->ReleaseIntArrayElements(javaIntArrayHandle, body, 0);
}

// Kotlin JNI Glue
@JvmStatic
external fun sortC(nums: IntArray)

// Kotlin
numbers.sort()

Results:

This test is far from perfect but it is another useful data point amongst the others. It is an example of the performance benefits one may acquire utilizing C for heavier workloads on large amounts of data. None of the other test results maintain as strong of a performance multiplier at the larger end of the spectrum as these sort performance test results.

Conclusion

One appealing aspect of using C in Android is the potential performance gains. However, communication between Kotlin and C through the JNI is not free. A rough number for the base cost of a JNI call is estimated anywhere between ~0 to ~104ns on modern devices. Sending an integer array can tack on a base cost upwards of 300ns on modern devices. Sending an integer array smaller than 10,000 elements may be received in C as a copy, which has additional cycle and memory costs. Clang may compile your C code to use vector operations that can help produce incredible performance gains. Although, vector operations may struggle to provide benefits when the cache is not primed and ready or is utilized poorly based on the structure of a specific solution.

These iterative tests, though highly flawed, have aided in analyzing the cost of sending various sizes of integer array data across the JNI. Improving the understanding of the impact JNI communication has on potential performance gains. The results seem to indicate that the larger the integer arrays, the less chance there is for any significant performance impact from the JNI communication. The data collected seems to conclude that processing an integer array larger than 1,000 elements has great odds of benefiting from the potential performance improvements compiled native code has to offer.

Imperfections of the Experiment

GetPrimitiveArrayCritical() is an alternative to GetIntArrayElements() that will make stronger attempts at avoiding the copying of array data sent through JNI to C.
“Marshalling” may be too strong a word for what is tested in this article, as sometimes a pointer to the actual backing integer array data on ART is simply handed over to C.
Direct byte buffers would probably be the route you want to take when efficiently sending data across the JNI barrier. At that point, the cost of marshalling the data is put more directly into the hands of the developer, not the ART.
Each device tested has eight CPU cores, of which all are not created equal even on the same device, with individual frequencies that fluctuate over time depending on the circumstances of the device. Testing based on nanoseconds is highly flawed for this reason, as we don’t know what core was tested on or what the clock rate was at the time of the test.
The iterative performance testing approach will have the cache in an ideal state for most tests performed in this article. Cache misses will surely be much higher when run alongside the rest of a codebase and may greatly impact real-world results.
All tests are single-threaded. Multi-threading could boost the performance of both native and managed code.
For some number crunching, it could be a waste of time to go any route other than GPU compute with Vulkan, OpenGL ES, OpenCL, etc. Some of which may already have accessible wrappers for Kotlin.
For other problems, like image, audio, and video processing, it could be a terrible idea going with any solution that doesn’t utilize a device’s Digital Signal Processing hardware. However, using C is often required to access this additional specialized hardware.

Future Experiments

Explore JNI communication costs outside of non-primitive array data, like Class members & methods, or Strings.
Performance testing of equivalent C/Kotlin implementations of heavier workloads on the same data.
Java Microbench Harness (JMH) or Android Microbench library could be valuable as microbenchmark tools.
Kotlin Native uses an LLVM-based back-end, just like the Clang C compiler. Implying that it may have the potential to offer the same or similar results to compiled C/C++. However, Kotlin Native does not circumvent the JNI. It will still have to communicate to managed code through the JNI just as C/C++. Paying the same marshalling and overhead costs.
Intelligently using SIMD through Neon/SSE/AVX extensions may give better results for C++. Howerever, as shown, the Clang compiler already utilizes ARM's vector instructions quite nicely.
This experiment only dealt with the cost of a single round trip of various payload sizes. It may be worth exploring the impacts of many small payloads being sent across the JNI boundary. However, that information may be inferred simply from the measured JNI overhead costs.
Exploring the performance of JNI communication using direct byte buffers.
Use Google’s FlatBuffers to test the performance of serializing complex objects and sending them through the JNI as direct byte buffers.
Determine the exact threshold in the size of an integer array argument in which the ART switches from providing a copy to providing the actual underlying data.
Java 22 introduced a Foreign Function & Memory API (FFM) which may be worth testing independently if curious about JNI performance on other platforms or if FFM ever becomes available on Android.

References

JNI tips on Android Developers
Hardware insight Android app: DevCheck Device & System Info by flar2
WikiChip

The Code

https://github.com/Lucodivo/JNIMarshalingPerformanceTesting/tree/1.0

Note: Some good and simple concepts like DRY are completely ignored in favor of removing any additional overhead from the tests. This repository should not be interpreted as a model for good programming methodologies.

JNI Marshalling Performance: Single Round Trip

An Empirical Taste of the Cost of Kotlin to C++ JNI Communication on Android.

Why C++?

But Using C In Android Isn’t Free

The Experiment

Disclaimer

Refresher

Testing Environment

Iterative Performance Tests

System.nanoTime() Timer Overhead

No-Op / No-Argument / No-Return-Value JNI Call Overhead

No-Op With Int Array Parameter

Integer Array Parameter isCopy?

Sum of an Int Array

SIMD Waiting on the D

Copy Int Array

Incrementing Int Array Elements

Sorting an Array

Conclusion

Imperfections of the Experiment

Future Experiments

References

The Code

Mirror / Discussions

`System.nanoTime()` Timer Overhead

Integer Array Parameter `isCopy`?