
CudaMiner release February 28th 2014 - Speedup release
-------------------------------------------------------------

***************************************************************
If you find this tool useful and like to support its continued 
          development, then consider a donation.

   LTC donation address: LKS1WDKGED647msBQfLBHV3Ls8sveGncnm
   BTC donation address: 16hJF5mceSojnTD3ZTUDqdRhDyPJzoRakM
   YAC donation address: Y87sptDEcpLkLeAuex6qZioDbvy1qXZEj4
   VTC donation address: VrjeFzMgvteCGarLw85KivBzmsiH9fqp4a
   MAX donation address: mHrhQP9EFArechWxTFJ97s9D3jvcCvEEnt
  DOGE donation address: DT9ghsGmez6ojVdEZgvaZbT2Z3TruXG6yP
 PANDA donation address: PvgtxJ2ZKaudRogCXfUMLXVaWUMcKQgRed
   MRC donation address; 1Lxc4JPDpQRJB8BN4YwhmSQ3Rcu8gjj2Kd
***************************************************************

>>> Introduction <<<

This is a CUDA accelerated mining application for most of your
AltCoin mining needs. Your nVidia cards can be very efficient
miners - don't believe the AMD fanboy myth that nVidia cards
suck at mining! We've recently had a watercooled 780Ti break
900 kHash/s at scrypt (N=1024) mining.

This application is currently supporting
1) scrypt mining with N=1024 (LiteCoin and many, many clones)
2) scrypt-jane mining (Yacoin and several clones)
3) scrypt mining with larger N (VertCoin)
4) NEW: MaxCoin mining (SHA-3 i.e. Keccak256)  <<< SEE MAXCOIN SPECIFICS SECTION

You should see a notable speed-up compared to OpenCL based miners.

We're not supporting Quark, ProtoShares (Momentum) or any other
highly specialized "CPU-only" coin. And certainly no BitCoin: 
This train has left the station quite some time ago!


>>> Command Line Interface <<<

This code is based on the pooler cpuminer 2.3.2 release and inherits
its command line interface and options.

  -a, --algo=ALGO       specify the algorithm to use (default is scrypt)
                          scrypt       scrypt Salsa20/8(1024, 1, 1), PBKDF2(SHA2)
                          scrypt:N     scrypt Salsa20/8(N, 1, 1), PBKDF2(SHA2)
                          scrypt-jane  scrypt Chacha20/8(N, 1, 1), PBKDF2(Keccak)
                          scrypt-jane:Coin
                                       Coin must be one of the supported coins.
                          scrypt-jane:Nfactor
                                       scrypt-chacha20/8(2*2^Nfactor, 1, 1)
                          scrypt-jane:StartTime,Nfmin,Nfmax
                                       like above nFactor derived from Unix time.
                          sha256d      SHA-256d (don't use this! No GPU acceleration)
                          keccak       Keccak256 as used in MaxCoin
  -o, --url=URL         URL of mining server (default: " DEF_RPC_URL ")
  -O, --userpass=U:P    username:password pair for mining server
  -u, --user=USERNAME   username for mining server
  -p, --pass=PASSWORD   password for mining server
      --cert=FILE       certificate for mining server using SSL
  -x, --proxy=[PROTOCOL://]HOST[:PORT]  connect through a proxy
  -t, --threads=N       number of miner threads (default: number of processors)
  -r, --retries=N       number of times to retry if a network call fails
                          (default: retry indefinitely)
  -R, --retry-pause=N   time to pause between retries, in seconds (default: 15)
  -T, --timeout=N       network timeout, in seconds (default: 270)
  -s, --scantime=N      upper bound on time spent scanning current work when
                        long polling is unavailable, in seconds (default: 5)
      --no-longpoll     disable X-Long-Polling support
      --no-stratum      disable X-Stratum support
  -q, --quiet           disable per-thread hashmeter output
  -D, --debug           enable debug output
  -P, --protocol-dump   verbose dump of protocol-level activities
  -B, --background      run the miner in the background
      --benchmark       run in offline benchmark mode
  -c, --config=FILE     load a JSON-format configuration file
  -V, --version         display version information and exit
  -h, --help            display this help text and exit


Additional cudaminer specific command line options are:

--no-autotune    disables the built-in autotuning feature for
                 maximizing CUDA kernel efficiency and uses some
                 heuristical guesswork, which might not be optimal.

--devices        [-d] gives a comma separated list of CUDA device IDs
                 to operate on. Device IDs start counting from 0!
                 Alternatively give string names of your card like
                 gtx780ti or gt640#2 (matching 2nd gt640 in the PC).

--launch-config  [-l] specify the kernel launch configuration per device.
                 This replaces autotune or heuristic selection. You can
                 pass the strings "auto" or just a kernel prefix like
                 F or K or T to autotune for a specific card generation
                 or a kernel prefix plus a lauch configuration like F28x8
                 if you know what kernel runs best (from a previous
                 autotune).

--interactive    [-i] list of flags (0 or 1) to enable interactive
                 desktop performance on individual cards. Use this
                 to remove lag at the cost of some hashing performance.
                 Do not use large launch configs for devices that shall
                 run in interactive mode - it's best to use autotune!

--batchsize      [-b] comma separated list of max. scrypt iterations that
                 are run in one kernel invocation. Default is 1024. Best to
                 use powers of 2 here. Increase for better performance in
                 scrypt-jane with high N-factors. Lower for more interactivity
                 of your video display especially when using the interactive
                 mode.

--texture-cache  [-C] list of flags (0 or 1 or 2) to enable use of the 
                 texture cache for reading from the scrypt scratchpad.
                 1 uses a 1D cache, whereas 2 uses a 2D texture layout.
                 Cached operation has proven to be slightly faster than
                 noncached operation on most GPUs.

--single-memory  [-m] list of flags (0 or 1) to make the devices
                 allocate their scrypt scratchpad in a single,
                 consecutive memory block. On Windows Vista, 7/8
                 this may lead to a smaller memory size being used.
                 When using the texture cache this option is implied.

--hash-parallel  [-H] scrypt also has a small SHA256 or Keccak component:
                      0 hashes this single threaded on the CPU.
                      1 to enable multithreaded hashing on the CPU.
                      2 offloads everything to the GPU (default)

--lookup-gap     [-L] values > 1 enable a tradeoff between memory
                 savings and extra computation effort, in order to
                 improve efficiency with high N-factor scrypt-jane
                 coins. Defaults to 1.

--time-limit     Exit the miner after given number of seconds mining.
                 Useful for doing some round robin pool (or worker)
                 hopping from an external controls script.


>>> Examples <<<


Example for Litecoin Mining on coinotron pool with GTX 660 Ti

cudaminer -d gtx660ti -l K28x32 -C 2 -i 0 -o stratum+tcp://coinotron.com:3334 -O workername:password


Example for Yacoin Mining on yac.coinmine.pl pool with GTX 780

cudaminer -s 10 --algo=scrypt-jane -d gtx780 -L 3 -l T9x21 -b 4096 -C 0 -i 1 -m 0 -o stratum+tcp://yac.coinmine.pl:9088 -O workername:password


Example for VertCoin mining on vtcpool.co.uk with GTX 660 Ti (assuming N=2048, changes over time)

cudaminer --algo=scrypt:2048 -d gtx660ti -l K14x32 -C 2 -i 0 -o stratum+tcp://vtcpool.co.uk:3333 -O workername:password


You will have to adjust the -d parameter. In most cases a -d 0 will work
for you. Specifying video cards by name is best when you often swap your
video cards. The device IDs tend to change a lot, whereas the names 
are much more consistent.

If you are not sure what configuration your video card might need, then
leave away the -l option and let cudaminer autotune.

For scrypt-jane coins with high N factor using a lookup gap with values
greater than 1 will likely boost your performance. Best to try  -L 1
first and work your way up.

For solo-mining you typically use -o 127.0.0.1:xxxx where xxxx represents
the RPC portnumber specified in your wallet's .conf file and you have to
pass the same username and password with -O as specified in the wallet's
.conf file. The wallet must also be started with the -server option and
the server flag in the wallet's .conf file set to 1



>>> About CUDA Kernels <<<

CUDA kernels do all the computation. Which one we select and in which
configuration it is run greatly affects performance. The CUDA kernel
launch configurations are given as a character string, e.g. F27x3

             Prefix    Blocks   x   Warps per block

Available kernel prefixes are:
F or f - Fermi and Legacy cards (Compute 1.x and 2.x)
K or k - Kepler cards (Compute 3.0)
T or t - Titan, GTX 780, GK208 and GM107 based cards (Compute 3.5 or later)

Upper case kernel prefixes mean high register count kernels.
Lower case kernel prefixes mean low register count kernels.

so F27x3 means: use Fermi kernel with high register count
                run 27 blocks in total
                each block consisting of 3 warps or 96 threads
                (a warp is a group of 32 threads)

You will want to pick lower case letters for scrypt-jane based coins
with a high N-factor (N being 12 and above...) because the performance
can be much better.

If you do not specify a kernel to use, autotune will pick a kernel
that might be best for your hardware and selected algorithm.

You can also override the autotune's automatic kernel selection,
e.g. pass

-l F
or
-l K
or
-l T

in order to autotune the Fermi, Kepler or Titan kernels overriding
the automatic selection.



>>> Table of CUDA Kernels <<<

Different CUDA kernels are identified by their Prefix letter. In some cases
an alias letter is also accepted, to ensure the existing launch configs
from the December 2013/January 2014 development versions are still running.

Prefix  Alias  Compute Req.  Registers   use for
F       L      1.0           64          scrypt & low N-factor scrypt-jane
K       Y      3.0           63          scrypt & low N-factor scrypt-jane
T       Z      3.5           80          scrypt & low N-factor scrypt-jane

f       X      1.0           32          high N-factor scrypt-jane
k              3.0           32          high N-factor scrypt-jane
t              3.5           32          high N-factor scrypt-jane

the old "Legacy" kernel has been replaced with the F kernel, which will also
be faster on Compute 1.0 legacy devices in many cases. Therefore the F kernel
has been compiled to require only Compute 1.0 capability.



>>> scrypt-jane Specifics <<<

scrypt-jane coins are designed to become more memory-hard over time. Some
of the older coins like Yacoin already approach very high N-factors. Yacoin
at the time of writing requires 4 MB per hash at N-factor 14. This means that
a card with 4 GB video RAM can only compute 1024 hashes in parallel. This
is a way too low number to fully occupy all the card's computational
resources.

Hence it is best on these cards to use kernels that run 4 threads per hash.
These are the low register count kernels, all starting with lower case
prefix letters (f, k, t).

Additionally GPUs with enough computational reserves benefit from enabling
the lookup-gap feature with the -L option and passing values > 1. This
cuts the memory use per hash and allows us to run more hashes (threads)
simultaneously. However additional computations have to be made to compensate
for the reduced lookup tables: any missing intermediate values have to be
recomputed "on the fly".

Use -H 2 with any low N-factor (below 12) scrypt-jane coins, otherwise your
CPU performance may be seriously limiting your hash rates.


The following coin parameters have been hardcoded into cudaminer and can
be given as a coin specifier with the --algo=scrypt-jane option

[YBC] YBCoin, [ZZC] ZZCoin, [FEC] FreeCoin, [ONC] OneCoin, [QQC] QQCoin,
[GPL] GoldPressedLatinum, [MRC] MicroCoin, [APC] AppleCoin, [CPR] Copperbars,
[CACH] CacheCoinm, [UTC] UltraCoin, [VEL] VelocityCoin, [ITC] InternetCoin,
[RAD] RadioactiveCoin

e.g. --algo=scrypt-jane:YBC or --algo=scrypt-jane:YBCoin

To mine new coins with different chain start time and minimum and maximum
N-factors, you can pass the parameters to the --algo option like this:

-algo=scrypt-jane:1389196388,4,30



>>> MaxCoin Specifics <<<

MaxCoin support was possible on short notice only because the coin's launch
was delayed by 24 hours and a cpuminer source code was posted to bitcointalk
prematurely. The keccak hashing feature is not complete and user friendly
yet, but it should be good enough to get you instamining AT COIN LAUNCH
with several dozen megahashes at least.

GTX 780 devices break 200 MHash/s. Yay!
A GTX 660Ti can do nearly 100 MHash/s.

The CudaMiner Windows binary release is made around 19:00 GMT on February 6th.
You have about 30 minutes to find suitable launch configurations for all your
cards before the coin actually launches.

The MaxCoin wallet will be available from 19:30 GMT onwards on
http://maxcoin.co.uk

The following problems remain with the keccak algorithm:
-The scrypt scratchpad is still allocated on the GPU. You must use a large -L
 (lookup gap) value in the order of 64 or 128 to allow the scratchpad to fit into
 your GPU memory when using large launch configurations.
-The K kernel works better on Compute 3.5 devices than the T kernel. Duh?

The F, K, T kernels support keccak mining. Maybe the K kernels run faster on
Compute 3.5 devices. Best to try. Launch configs should look somewhat like this
(do not exceed the warp figures given here)

-l F1000x16
-l K1000x32
-l T1000x24   << now fastest on my GTX 780

Best to replace the 1000 blocks with the number of your card's CUDA
cores, or even twice that value. It seems fastest that way,

Pick a pool or solo-mine. Good luck!


Example session on GTX 780Ti:

cudaminer.exe --algo=keccak -d gtx780 -i 0 -m 1 -l T2304x24 -o stratum+tcp://maxpool.1gh.com:17333 -u WALLETADDRESS -p x

           *** CudaMiner for nVidia GPUs by Christian Buchner ***
                     This is version 2014-02-18 (beta)
        based on pooler-cpuminer 2.3.2 (c) 2010 Jeff Garzik, 2012 pooler
            Cuda additions Copyright 2013,2014 Christian Buchner
          LTC donation address: LKS1WDKGED647msBQfLBHV3Ls8sveGncnm
          BTC donation address: 16hJF5mceSojnTD3ZTUDqdRhDyPJzoRakM
          YAC donation address: Y87sptDEcpLkLeAuex6qZioDbvy1qXZEj4
[2014-02-28 22:56:00] Starting Stratum on stratum+tcp://maxpool.1gh.com:17333
[2014-02-28 22:56:00] 1 miner threads started, using 'keccak' algorithm.
[2014-02-28 22:56:01] GPU #0: GeForce GTX 780 with compute capability 3.5
[2014-02-28 22:56:01] GPU #0: interactive: 0, tex-cache: 0 , single-alloc: 1
[2014-02-28 22:56:01] GPU #0: 32 hashes / 4.0 MB per warp.
[2014-02-28 22:56:01] GPU #0: using launch configuration T2304x24
[2014-02-28 22:56:02] GPU #0: GeForce GTX 780, 196998 khash/s
[2014-02-28 22:56:02] accepted: 1/1 (100.00%), 196998 khash/s (yay!!!)
[2014-02-28 22:56:04] GPU #0: GeForce GTX 780, 200200 khash/s
...
[2014-02-28 22:56:05] GPU #0: GeForce GTX 780, 202637 khash/s
[2014-02-28 22:56:05] accepted: 4/4 (100.00%), 202637 khash/s (yay!!!)
[2014-02-28 22:56:07] GPU #0: GeForce GTX 780, 203136 khash/s
[2014-02-28 22:56:07] accepted: 5/5 (100.00%), 203136 khash/s (yay!!!)


>>> Additional Notes <<<

This code should be running on nVidia GPUs ranging from compute capability
1.0 up to compute capability 3.5. Just don't expect any hashing miracles
from your old clunkers.

Compute 1.0 through 1.3 devices seem to run faster on Windows XP or Linux
because these OS'es use a more efficient driver model.

Some coins mine a bit faster with the 32 bit cudaminer versions, other mine
faster with the 64 bit cudaminer version. If your computer runs a 64 bit OS,
try running both versions and compare the mining speeds!

To see what autotuning does, enable the debug option (-D) switch.
You will get a table of kHash/s for a variety of launch configurations.
You may only want to do this when running on a single GPU, otherwise
the autotuning output of multiple cards will get all mixed up.

Note that mining through N-factor changes is risky and might fail. Typically
a new N factor requires a different kernel launch configuration. I am trying
to compensate for an N-factor increase by doubling the current lookup-gap
value. This will allow the miner to keep working with the same memory buffers,
but the additional lookup-gap requires more computations to be made. It is
best to re-tune your kernel configuration after every N-factor change.



>>> RELEASE HISTORY <<<

  The February 28th release adds speed increases on Yacoin (high N-factor
  scrypt-jane) and on keccak for Kepler and Maxwell devices. That is all ;)

  The February 18th release fixes a crash when encountering newer
  compute capabilities such as 3.7 or 5.0. As 5.0 is reported by Maxwell
  chips, this will improve the user experience. This release also contains
  some minor fixes, for example after a CTRL-C or after expiry of the time
  limit the program will terminate faster. I also made autotune work with
  Keccak, even though its usefulness is kind of limited.

  The February 9th release adds performance improvements and a fix
  for stratum rejects in maxcoin (duplicate submitted shares).

  The February 7th release adds fixes for stratum pool mining.

  The February 6th release added a first, but working keccak algorithm
  support to instamine at the MaxCoin launch. The miner worked, and we
  basically owned the coin.

  The February 4th release fixes a problem with very apparently incorrect
  autotune measurements and it also repairs the multi-GPU support. So you
  can again use a single cudaminer to drive all your GPUs. It wasn't the
  driver's fault after all, I was sloppy about some initialization of
  constant memory on the GPUs.

  The Febrary 2nd 2014 release supports scrypt-jane for the first time
  and includes faster scrypt kernels kindly submitted by nVidia.
  Most Dave Andersen and nVidia derived kernels now support -C 1 and -C 2
  texture caching options providing a speed benefit in some cases.

  The December 18th 2013 milestone transitions cudaminer to CUDA 5.5, which
  makes it require newer nVidia drivers unfortunately. However users of
  Kepler devices will see a significant speed boost of 30% for Compute 3.0
  devices and around 10% for Compute 3.5 devices. This was made possible
  by David Andersen who posted his more efficient miner code under Apache
  license. This release marks a first step of integrating his work.

   ... some history removed ...

  April 4th 2013 initial release.



>>> TODO <<<

Usability Improvements:
- better CUDA error handling and recovery from errors
- add failover support between different pools
- smarter autotune algorithm
- temperature and GPU utilization control features
- an external API for system monitoring like that of CGMiner

Further Optimization:
- fix a problem in CUDA issue order that prevents overlapping of
  memory transfers and kernel launches (this could bring 5% more
  speed when fixed!)
- check the hashes against target on the GPU, saving memory
  transfers over the PCI-x bus
- reduce the thread divergence in the lookup-gap feature by
  sorting threads by loop trip count (requires a swapping of
  the thread state in the entire thread block)
- allow for more optimized Keccak or SHA2 implementations for
  specific hardware (like Compute 3.5 using the funnel shifter)
- make a direct port of the nv_kernel.cu to Fermi, using warp
  shuffle emulation with shared memory.


>>> AUTHORS <<<

Notable contributors to this application are:

Christian Buchner (Germany): original CUDA implementation

Alexis Provos (Greece): submitted a faster Salsa 20/8 round function.

David Andersen (USA, Carnegie Mellon University): designed a low
                        register count scrypt Kepler kernel with
                        improved memory access. Currently best
                        performing with high N-factor scrypt-jane.

Alexey Panteleev (Moscow, nVidia): submitted a kernel with improved
                        memory access functions for Kepler devices 
                        providing the fastest scrypt performance
                        and provided further patches for other kernels.

and also many thanks to anyone else who contributed to the original
cpuminer application (Jeff Garzik, pooler) !


Source code is included to satisfy GNU GPL V2 requirements.


With kind regards,

   Christian Buchner ( Christian.Buchner@gmail.com )
