# 9. TN Quantum-Circuit Simulation¶

Here we show how to simulate quantum circuits using the tensor network capabilities of quimb. We’ll do this by taking one of the google random circuits from here for 7x7 = 49 qubits with a gate depth of 1 + 30 + 1, and exactly computing a single output amplitude.

The main relevant object is Circuit. Under the hood this uses gate_TN_1D(), which applies an operator on some number of sites to any notionally 1D tensor network (not just an MPS), whilst maintaining the outer indices (e.g. 'k0', 'k1', 'k2', ...). . The various options for applying the operator and propagating tags to it (if not contracted in) can be found in gate_TN_1D(). Note that the ‘1D’ nature of the TN is just for indexing, gates can be applied to arbitrary combinations of sites within this ‘register’.

[1]:
%config InlineBackend.figure_formats = ['svg']
import quimb as qu
import quimb.tensor as qtn

## 9.1. Basic Usage¶

First of all here is a basic example of using the Circuit object with nearest neighbour only gates:

[2]:
# 10 qubits and tag the initial wavefunction tensors
circ = qtn.Circuit(N=10, tags='PSI0')

for i in range(10):
circ.apply_gate('H', i, gate_round=0)

# 8 rounds of entangling gates
for r in range(1, 9):

# even pairs
for i in range(0, 10, 2):
circ.apply_gate('CNOT', i, i + 1, gate_round=r)

# Y-rotations
for i in range(10):
circ.apply_gate('RY', 1.234, i, gate_round=r)

# odd pairs
for i in range(1, 9, 2):
circ.apply_gate('CZ', i, i + 1, gate_round=r)

for i in range(10):
circ.apply_gate('H', i, gate_round=r + 1)

circ
[2]:
<Circuit(n=10, n_gates=172, gate_opts={'contract': 'auto-split-gate', 'propagate_tags': 'register'})>

The tensor network representing the state is stored in the .psi attribute, which we can then visualize:

[3]:
circ.psi.graph(color=['PSI0', 'H', 'CNOT', 'RY', 'CZ'])

Note by default the CNOT gates have been split into two parts acting on each site seperately. We can also graph the default (propagate_tags='register') method for adding site tags to the applied operators:

[4]:
circ.psi.graph(color=[f'I{i}' for i in range(10)])

Or since we supplied gate_round as an keyword (which is optional), the tensors are also tagged in that way:

[5]:
circ.psi.graph(color=['PSI0'] + [f'ROUND_{i}' for i in range(10)])

[6]:
# select the subnetwork of tensors with *all* following tags
circ.psi.select(['CNOT', 'I3', 'ROUND_3'], which='all')
[6]:
<TensorNetwork(tensors=1, indices=3, structure='I{}', nsites=10)>

The full list of currently implemented gates is here:

[7]:
list(qtn.circuit.GATE_FUNCTIONS)
[7]:
['H',
'X',
'Y',
'Z',
'S',
'T',
'X_1_2',
'Y_1_2',
'Z_1_2',
'W_1_2',
'HZ_1_2',
'CNOT',
'CX',
'CY',
'CZ',
'IS',
'ISWAP',
'IDEN',
'SWAP',
'RX',
'RY',
'RZ',
'U3',
'FS',
'FSIM']

To check the gates were entangling, we can contract the whole network into a dense representation of the state:

[8]:
psi_dense = circ.psi ^ all

# check the half chain entanglement
psi_dense.entropy(left_inds=['k0', 'k1', 'k2', 'k3', 'k4'])
[8]:
2.660737743656919

## 9.2. QASM Format and Advanced Manipulation¶

A common format to specifying quatum circuits is the QASM format. An example from https://github.com/sboixo/GRCS can be found in quimb/docs/examples/inst_7x7_31_0.txt for a 7x7 qubit circuit of depth 1 + 30 + 1 (the 0th and 31st round of gates being un-entangling). We will exactly compute a single amplitude for this circuit.

A Circuit can be instantiated directly from a QASM file or URL, which automatically applies all the gates:

[9]:
circ = qtn.Circuit.from_qasm_file('inst_7x7_31_0.txt', tags='PSI_i')

We’ll compute the output amplitude of a random bitstring:

[10]:
import random
random.seed(42)

bitstring = "".join(random.choice('01') for _ in range(49))
print(bitstring)

# the squeeze removes all size 1 bonds
psi_sample = qtn.MPS_computational_state(bitstring, tags='PSI_f').squeeze()
0010000010000000101100111001001011101010110000100

The amplitude tensor network is then given by:

[11]:
c_tn = circ.psi & psi_sample
c_tn
[11]:
<TensorNetwork(tensors=1402, indices=1669, structure='I{}', nsites=49)>

(Note if we instead left some indices open we could compute slices of amplitudes.)

We can now view the same three tagging schemes as earlier to visualize the network with:

[12]:
gate_tags = ['PSI_i', 'H', 'CZ', 'T', 'X_1/2', 'Y_1/2', 'PSI_f']
site_tags = [c_tn.site_tag(i) for i in range(c_tn.nsites)]
round_tags = ['PSI_i'] + [f"ROUND_{i}" for i in range(31)]
[13]:
c_tn.graph(color=gate_tags)  # could be slow
[14]:
c_tn.graph(color=site_tags)  # could be slow

Clearly this is a complex network (1363 tensors) which it will be difficult to find a contraction scheme for! One thing we might try is contracting all the rank-1 and rank-2 tensors. Doing this never increases the complexity and it is in this sense that single qubit operations are ‘free’. This can be done with the .rank_simplify method:

[15]:
c_tn_derank = c_tn.rank_simplify()
c_tn_derank
[15]:
<TensorNetwork(tensors=534, indices=801, structure='I{}', nsites=49)>
[16]:
c_tn_derank.graph(color=site_tags)

Every tensor is now rank-3 or more. We can check $$\log_2$$ of the size of the largest tensor produced in the contraction by calling:

[17]:
c_tn_derank.contraction_width(optimize='random-greedy-128')
[17]:
44.0

(Note that for this particular contraction the 'random-greedy' approach is quite un-optimal, and one could use more advanced methods - see cotengra).

However this corresponds to far too much memory to naively contract so instead we will give the contraction some manual help, following the rough ideas set out in https://arxiv.org/abs/1811.09599. This involves several steps:

1. Contract vertical slices for each site, yielding a simple flat network

2. ‘Cutting’ some bonds of the network to trade memory for time

We’ll also initially take of copy of the network in single precision:

[18]:
c_tn_flat = c_tn.astype('complex64')

# inplace contract every tensor tagged with each site tag
for i in range(c_tn.nsites):
c_tn_flat ^= i

# inplace fuse multi-bonds between tensors
c_tn_flat.squeeze_(fuse=True)

# compress every bond to about 'complex64' precision
c_tn_flat.compress_all(cutoff_mode='rel', cutoff=1e-6)
[18]:
<TensorNetwork(tensors=49, indices=84, structure='I{}', nsites=49)>

We should now have a ‘flat’ tensor network of 49 tensors joined in a grid:

[19]:
# we can now easily specify grid positions as well
fix = {
c_tn_flat.site_tag(i): (i // 7, i % 7)
for i in range(c_tn_flat.nsites)
}

c_tn_flat.graph(site_tags, fix=fix)

Note that the new bonds are larger (and variable) but the sizes of the tensors is still manageable:

[20]:
{qu.log2(tensor.size) for tensor in c_tn_flat}
[20]:
{7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0}

And we have reduced the contraction width as well (it’s now small and planar enough that we can use the near-optimal 'dp' contraction path finder):

[21]:
c_tn_flat.contraction_width(optimize='dp')
[21]:
27.0

Although a single tensor with dtype='complex64' of this size only takes 2GB of memory, taking overheads into account we probably still want to reduce this memory requirement a bit - to do this we will ‘cut’ some bonds.

### 9.2.1. Cutting bonds¶

We can think of any bond in a tensor network as a trace over the rest of the network. In other words, we can enumerate then sum all the different values that bond can take, turning the network into a sum of simpler networks (as long as the tensor network with that bond removed has better contraction width).

For example lets make two incisions at the sides (an educated guess to make the ‘cut’ network simpler):

[22]:
# get bonds between 14-21 and 45-46
bnds = (
qtn.bonds(c_tn_flat[14], c_tn_flat[21]) +
qtn.bonds(c_tn_flat[45], c_tn_flat[46])
)
bnds
[22]:
('_7a0b7800003C9', '_7a0b7800003A9')

We can view what it would look like to cut these bonds by ‘selecting’ values for them, here 0:

[23]:
selector = {ix: 0 for ix in bnds}
c_tn_cut = c_tn_flat.isel(selector)

c_tn_cut.graph(color=site_tags, fix=fix)

The product of the sizes of the bonds is the number of tensor networks we will need to sum over:

[24]:
d_cut = qu.prod(c_tn_flat.ind_size(ix) for ix in bnds)
d_cut
[24]:
64

And we can check this ‘cut’ network has better contraction width:

[25]:
c_tn_cut.contraction_width(optimize='dp')
[25]:
21.0

In fact, this cut is optimum in the sense that the reduction in contraction complexity ($$2^{27} - 2^{21} = 64$$) exactly matches the number of independant tensor networks produced, d_cut.

To exactly compute the amplitude $$c = \langle \psi_i | \psi_f \rangle$$ we need to sum over every combination of values the bonds can take. The cut_iter() method automatically generates all the tensor networks we need:

[26]:
gen_cuts = c_tn_flat.cut_iter(*bnds)

# use tqdm to show progress / time (can comment this out)
from tqdm import tqdm
gen_cuts = tqdm(gen_cuts, total=d_cut)

# sum over each value the bond can take
c = sum([
c_tn_cut.contract(all, optimize='dp')
for c_tn_cut in gen_cuts
])
100%|██████████| 64/64 [00:06<00:00,  9.44it/s]
[27]:
c
[27]:
(-5.52523513053238e-10-2.4078453356168872e-08j)

### 9.2.2. Parallel Contract¶

Since each contraction is independant, we can also trivially parallelize the work. The above calculation would have already been multi-threaded so we don’t expect much speedup, but we could for example distribute the work across many computers.

Here, we’ll just use a loky pool (rather than the mpi4py pool from quimb.get_mpi_pool which can’t pickle interactively defined, local functions):

[28]:
from joblib.externals import loky

# make sure we single-thread BLAS on each process
def initializer():
import os

pool = loky.get_reusable_executor(initializer=initializer)
[29]:
%%time

c = sum(pool.map(
lambda c_tn_cut: c_tn_cut.contract(all, optimize='dp'),
c_tn_flat.cut_iter(*bnds)
))
CPU times: user 2.47 s, sys: 173 ms, total: 2.64 s
Wall time: 7.37 s
[30]:
c
[30]:
(-5.525352259061478e-10-2.4078468149890675e-08j)

Which as expected matches (to single precision) the non-parallel version, although is in fact a bit slower.

Another convenient way to parallelize our computation and reduce its memory footprint is to use dask. Instead of many independant parallel runs, we then get a single task graph meaning that no bits of the calculation are ever redundantly computed.

The memory reduction comes from dask’s ability to slice, or ‘chunk’, arrays across specified dimensions such that the whole array is never needed in memory at once. Which dimensions should we chunk? Well, we know that the indices we chose to ‘cut’ earlier are critical indices that probably appear on the largest intermediate tensors, so those are a wise choice.

Finally how dask stores it blocks of arrays is also customizable, and one possibility here, if the blocks are not too big, is to use the GPU. Here we’ll make use of cupy arrays which should make all the numeric operations much faster! (You can comment out the relevant lines if you have no GPU).

[31]:
import cupy as cp

# copy the *uncut* network to modify each tensor
dc_tn_flat = c_tn_flat.copy()

for t in dc_tn_flat:
# get the current tensor data
data = t.data

# convert data to GPU
data = cp.asarray(data)

# convert to dask array - only chunk along 'cut' bonds
chunks = [1 if ix in bnds else None for ix in t.inds]
data = da.from_array(data, chunks, asarray=False)

# modify the tensor's data inplace
t.modify(data=data)

We now have a tensor network of tensors backed by dask arrays, each of which is itself backed by possibly many cupy arrays.

To perform the full contraction however, we just need to specify 'dask' as the backend to opt_einsum:

[32]:

You will notice this was essentially instant- so far c this is just a lazily represented computation graph with unknown value:

[33]:
c_graph
[33]:
 Array Chunk 8 B 8 B () () 1521 Tasks 1 Chunks complex64 cupy.ndarray

In this way, dask abstracts the computational graph from the method used to compute it, meaning we can choose to actually perform the calcuation in a number of ways (for example on a distributed server). Here we’ll just use the simplest, in process, scheduler:

[34]:
%%time
c = c_graph.compute(scheduler='sync')
CPU times: user 436 ms, sys: 36 ms, total: 472 ms
Wall time: 471 ms
[35]:
c
[35]:
array(-5.5261307e-10-2.407851e-08j, dtype=complex64)

So about 15x faster than the parallel contract!