9. TN QuantumCircuit Simulation¶
Here we show how to simulate quantum circuits using the tensor network
capabilities of quimb
. We’ll do this by taking one of the google random
circuits from here for 7x7 = 49
qubits with a gate depth of 1 + 30 + 1, and exactly computing a single output
amplitude.
The main relevant object is Circuit
.
Under the hood this uses gate_TN_1D()
, which
applies an operator on some number of sites to any notionally 1D tensor network
(not just an MPS), whilst maintaining the outer indices (e.g. 'k0', 'k1', 'k2', ...
).
. The various options for applying the operator
and propagating tags to it (if not contracted in) can be found in
gate_TN_1D()
. Note that the ‘1D’ nature of the
TN is just for indexing, gates can be applied to arbitrary combinations of
sites within this ‘register’.
[1]:
%config InlineBackend.figure_formats = ['svg']
import quimb as qu
import quimb.tensor as qtn
9.1. Basic Usage¶
First of all here is a basic example of using the
Circuit
object with nearest neighbour only
gates:
[2]:
# 10 qubits and tag the initial wavefunction tensors
circ = qtn.Circuit(N=10, tags='PSI0')
# initial layer of hadamards
for i in range(10):
circ.apply_gate('H', i, gate_round=0)
# 8 rounds of entangling gates
for r in range(1, 9):
# even pairs
for i in range(0, 10, 2):
circ.apply_gate('CNOT', i, i + 1, gate_round=r)
# Yrotations
for i in range(10):
circ.apply_gate('RY', 1.234, i, gate_round=r)
# odd pairs
for i in range(1, 9, 2):
circ.apply_gate('CZ', i, i + 1, gate_round=r)
# final layer of hadamards
for i in range(10):
circ.apply_gate('H', i, gate_round=r + 1)
circ
[2]:
<Circuit(n=10, n_gates=172, gate_opts={'contract': 'autosplitgate', 'propagate_tags': 'register'})>
The tensor network representing the state is stored in the .psi
attribute, which we can then visualize:
[3]:
circ.psi.graph(color=['PSI0', 'H', 'CNOT', 'RY', 'CZ'])
Note by default the CNOT gates have been split into two parts acting on each site seperately. We can also graph the default (propagate_tags='register'
) method for adding site tags to the applied operators:
[4]:
circ.psi.graph(color=[f'I{i}' for i in range(10)])
Or since we supplied gate_round
as an keyword (which is optional), the tensors are also tagged in that way:
[5]:
circ.psi.graph(color=['PSI0'] + [f'ROUND_{i}' for i in range(10)])
All of these are helpful when addressing only certain tensors:
[6]:
# select the subnetwork of tensors with *all* following tags
circ.psi.select(['CNOT', 'I3', 'ROUND_3'], which='all')
[6]:
<TensorNetwork(tensors=1, indices=3, structure='I{}', nsites=10)>
The full list of currently implemented gates is here:
[7]:
list(qtn.circuit.GATE_FUNCTIONS)
[7]:
['H',
'X',
'Y',
'Z',
'S',
'T',
'X_1_2',
'Y_1_2',
'Z_1_2',
'W_1_2',
'HZ_1_2',
'CNOT',
'CX',
'CY',
'CZ',
'IS',
'ISWAP',
'IDEN',
'SWAP',
'RX',
'RY',
'RZ',
'U3',
'FS',
'FSIM']
To check the gates were entangling, we can contract the whole network into a dense representation of the state:
[8]:
psi_dense = circ.psi ^ all
# check the half chain entanglement
psi_dense.entropy(left_inds=['k0', 'k1', 'k2', 'k3', 'k4'])
[8]:
2.660737743656919
9.2. QASM Format and Advanced Manipulation¶
A common format to specifying quatum circuits is the QASM format. An example from
https://github.com/sboixo/GRCS can be found in quimb/docs/examples/inst_7x7_31_0.txt
for a 7x7 qubit circuit of depth 1 + 30 + 1 (the 0th and 31st round of gates
being unentangling). We will exactly compute a single amplitude for this circuit.
A Circuit
can be instantiated directly from a QASM
file or URL, which automatically applies all the gates:
[9]:
circ = qtn.Circuit.from_qasm_file('inst_7x7_31_0.txt', tags='PSI_i')
We’ll compute the output amplitude of a random bitstring:
[10]:
import random
random.seed(42)
bitstring = "".join(random.choice('01') for _ in range(49))
print(bitstring)
# the squeeze removes all size 1 bonds
psi_sample = qtn.MPS_computational_state(bitstring, tags='PSI_f').squeeze()
0010000010000000101100111001001011101010110000100
The amplitude tensor network is then given by:
[11]:
c_tn = circ.psi & psi_sample
c_tn
[11]:
<TensorNetwork(tensors=1402, indices=1669, structure='I{}', nsites=49)>
(Note if we instead left some indices open we could compute slices of amplitudes.)
We can now view the same three tagging schemes as earlier to visualize the network with:
[12]:
gate_tags = ['PSI_i', 'H', 'CZ', 'T', 'X_1/2', 'Y_1/2', 'PSI_f']
site_tags = [c_tn.site_tag(i) for i in range(c_tn.nsites)]
round_tags = ['PSI_i'] + [f"ROUND_{i}" for i in range(31)]
[13]:
c_tn.graph(color=gate_tags) # could be slow
[14]:
c_tn.graph(color=site_tags) # could be slow
Clearly this is a complex network (1363 tensors) which it will be difficult to find a contraction scheme for! One thing we might try is contracting all the rank1 and rank2 tensors. Doing this never increases the complexity and it is in this sense that single qubit operations are ‘free’. This can be done with the .rank_simplify
method:
[15]:
c_tn_derank = c_tn.rank_simplify()
c_tn_derank
[15]:
<TensorNetwork(tensors=534, indices=801, structure='I{}', nsites=49)>
[16]:
c_tn_derank.graph(color=site_tags)
Every tensor is now rank3 or more. We can check \(\log_2\) of the size of the largest tensor produced in the contraction by calling:
[17]:
c_tn_derank.contraction_width(optimize='randomgreedy128')
[17]:
44.0
(Note that for this particular contraction the 'randomgreedy'
approach is quite unoptimal, and one could use more advanced methods  see cotengra).
However this corresponds to far too much memory to naively contract so instead we will give the contraction some manual help, following the rough ideas set out in https://arxiv.org/abs/1811.09599. This involves several steps:
Contract vertical slices for each site, yielding a simple flat network
‘Cutting’ some bonds of the network to trade memory for time
We’ll also initially take of copy of the network in single precision:
[18]:
c_tn_flat = c_tn.astype('complex64')
# inplace contract every tensor tagged with each site tag
for i in range(c_tn.nsites):
c_tn_flat ^= i
# inplace fuse multibonds between tensors
c_tn_flat.squeeze_(fuse=True)
# compress every bond to about 'complex64' precision
c_tn_flat.compress_all(cutoff_mode='rel', cutoff=1e6)
[18]:
<TensorNetwork(tensors=49, indices=84, structure='I{}', nsites=49)>
We should now have a ‘flat’ tensor network of 49 tensors joined in a grid:
[19]:
# we can now easily specify grid positions as well
fix = {
c_tn_flat.site_tag(i): (i // 7, i % 7)
for i in range(c_tn_flat.nsites)
}
c_tn_flat.graph(site_tags, fix=fix)
Note that the new bonds are larger (and variable) but the sizes of the tensors is still manageable:
[20]:
{qu.log2(tensor.size) for tensor in c_tn_flat}
[20]:
{7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0}
And we have reduced the contraction width as well (it’s now small and planar enough that we can use the nearoptimal 'dp'
contraction path finder):
[21]:
c_tn_flat.contraction_width(optimize='dp')
[21]:
27.0
Although a single tensor with dtype='complex64'
of this size only takes
2GB of memory, taking overheads into account we probably still want to reduce
this memory requirement a bit  to do this we will ‘cut’ some bonds.
9.2.1. Cutting bonds¶
We can think of any bond in a tensor network as a trace over the rest of the network. In other words, we can enumerate then sum all the different values that bond can take, turning the network into a sum of simpler networks (as long as the tensor network with that bond removed has better contraction width).
For example lets make two incisions at the sides (an educated guess to make the ‘cut’ network simpler):
[22]:
# get bonds between 1421 and 4546
bnds = (
qtn.bonds(c_tn_flat[14], c_tn_flat[21]) +
qtn.bonds(c_tn_flat[45], c_tn_flat[46])
)
bnds
[22]:
('_7a0b7800003C9', '_7a0b7800003A9')
We can view what it would look like to cut these bonds by ‘selecting’ values for them, here 0:
[23]:
selector = {ix: 0 for ix in bnds}
c_tn_cut = c_tn_flat.isel(selector)
c_tn_cut.graph(color=site_tags, fix=fix)
The product of the sizes of the bonds is the number of tensor networks we will need to sum over:
[24]:
d_cut = qu.prod(c_tn_flat.ind_size(ix) for ix in bnds)
d_cut
[24]:
64
And we can check this ‘cut’ network has better contraction width:
[25]:
c_tn_cut.contraction_width(optimize='dp')
[25]:
21.0
In fact, this cut is optimum in the sense that the reduction in contraction
complexity (\(2^{27}  2^{21} = 64\)) exactly matches the number of independant tensor networks
produced, d_cut
.
To exactly compute the amplitude \(c = \langle \psi_i  \psi_f \rangle\) we need to sum over
every combination of values the bonds can take. The cut_iter()
method automatically generates all the tensor networks we need:
[26]:
gen_cuts = c_tn_flat.cut_iter(*bnds)
# use tqdm to show progress / time (can comment this out)
from tqdm import tqdm
gen_cuts = tqdm(gen_cuts, total=d_cut)
# sum over each value the bond can take
c = sum([
c_tn_cut.contract(all, optimize='dp')
for c_tn_cut in gen_cuts
])
100%██████████ 64/64 [00:06<00:00, 9.44it/s]
[27]:
c
[27]:
(5.52523513053238e102.4078453356168872e08j)
9.2.2. Parallel Contract¶
Since each contraction is independant, we can also trivially parallelize the work. The above calculation would have already been multithreaded so we don’t expect much speedup, but we could for example distribute the work across many computers.
Here, we’ll just use a loky
pool (rather than the mpi4py
pool
from quimb.get_mpi_pool
which can’t pickle interactively defined,
local functions):
[28]:
from joblib.externals import loky
# make sure we singlethread BLAS on each process
def initializer():
import os
os.environ['OMP_NUM_THREADS'] = '1'
pool = loky.get_reusable_executor(initializer=initializer)
[29]:
%%time
c = sum(pool.map(
lambda c_tn_cut: c_tn_cut.contract(all, optimize='dp'),
c_tn_flat.cut_iter(*bnds)
))
CPU times: user 2.47 s, sys: 173 ms, total: 2.64 s
Wall time: 7.37 s
[30]:
c
[30]:
(5.525352259061478e102.4078468149890675e08j)
Which as expected matches (to single precision) the nonparallel version, although is in fact a bit slower.
9.2.3. Dask Distributed Contract¶
Another convenient way to parallelize our computation and reduce its memory footprint is to use dask. Instead of many independant parallel runs, we then get a single task graph meaning that no bits of the calculation are ever redundantly computed.
The memory reduction comes from dask’s ability to slice, or ‘chunk’, arrays across specified dimensions such that the whole array is never needed in memory at once. Which dimensions should we chunk? Well, we know that the indices we chose to ‘cut’ earlier are critical indices that probably appear on the largest intermediate tensors, so those are a wise choice.
Finally how dask
stores it blocks of arrays is also customizable, and one
possibility here, if the blocks are not too big, is to use the GPU.
Here we’ll make use of cupy arrays which
should make all the numeric operations much faster! (You can comment out
the relevant lines if you have no GPU).
[31]:
import dask.array as da
import cupy as cp
# copy the *uncut* network to modify each tensor
dc_tn_flat = c_tn_flat.copy()
for t in dc_tn_flat:
# get the current tensor data
data = t.data
# convert data to GPU
data = cp.asarray(data)
# convert to dask array  only chunk along 'cut' bonds
chunks = [1 if ix in bnds else None for ix in t.inds]
data = da.from_array(data, chunks, asarray=False)
# modify the tensor's data inplace
t.modify(data=data)
We now have a tensor network of tensors backed by dask
arrays, each of which is itself backed by possibly many cupy
arrays.
To perform the full contraction however, we just need to specify 'dask'
as the backend to opt_einsum
:
[32]:
c_graph = dc_tn_flat.contract(all, optimize='dp', backend='dask')
You will notice this was essentially instant so far c
this is just a lazily represented computation graph with unknown value:
[33]:
c_graph
[33]:

In this way, dask
abstracts the computational graph from the method used to compute it, meaning we can choose to actually perform the calcuation in a number of ways (for example on a distributed server). Here we’ll just use the simplest, in process, scheduler:
[34]:
%%time
c = c_graph.compute(scheduler='sync')
CPU times: user 436 ms, sys: 36 ms, total: 472 ms
Wall time: 471 ms
[35]:
c
[35]:
array(5.5261307e102.407851e08j, dtype=complex64)
So about 15x faster than the parallel contract!