Explore ARCExplore ARC

Accessing the Internet from ARC-TS compute nodes

By |

Normally, compute nodes on ARC-TS clusters cannot directly access the Internet because they have private IP addresses. This increases cluster security while reducing the costs (IPv4 addresses are limited, and ARC-TS clusters do not currently support IPv6). However, this also means that jobs cannot install software, download files, or access databases on servers located outside of University of Michigan networks: the private IP addresses used by the cluster are routable on-campus but not off-campus.

If your work requires these tasks, there are three ways to allow jobs running on ARC-TS clusters to access the Internet, described below. The best method to use depends to a large extent on the software you are using. If your software supports HTTP proxying, that is the best method. If not, SOCKS proxying or SSH tunneling may be suitable.

HTTP proxying

HTTP proxying, sometimes called “HTTP forward proxying”  is the simplest and most robust way to access the Internet from ARC-TS clusters. However, there are two main limitations:

  • Some software packages do not support HTTP proxying.
  • HTTP proxying only supports HTTP, HTTPS and FTP protocols.

If either of these conditions apply (for example, if your software needs a database protocol such as MySQL), users should explore SOCKS proxying or SSH tunneling, described below.

Some popular software packages that support HTTP proxying include:

HTTP proxying is automatically set up when you log in to ARC-TS clusters and it should be used by any software which supports HTTP proxying without any special action on your part.

Here is an example that shows installing the Python package pyvcf from within an interactive job running on a Flux compute node:

[markmont@flux-login1 ~]$ module load anaconda2/latest
[markmont@flux-login1 ~]$ qsub -I -V -A example_flux -q flux -l nodes=1:ppn=2,pmem=3800mb,walltime=04:00:00,qos=flux
qsub: waiting for job 18927162.nyx.arc-ts.umich.edu to start
qsub: job 18927162.nyx.arc-ts.umich.edu ready

[markmont@nyx5792 ~]$ pip install –user pyvcf
Collecting pyvcf
Downloading PyVCF-0.6.7.tar.gz
Collecting distribute (from pyvcf)
Downloading distribute-0.7.3.zip (145kB)
100% |████████████████████████████████| 147kB 115kB/s
Requirement already satisfied (use –upgrade to upgrade): setuptools>=0.7 in
/usr/cac/rhel6/lsa/anaconda2/latest/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg (from distribute->pyvcf)
Building wheels for collected packages: pyvcf, distribute
Running setup.py bdist_wheel for pyvcf … done
Stored in directory: /home/markmont/.cache/pip/wheels/68/93/6c/fb55ca4381dbf51fb37553cee72c62703fd9b856eee8e7febf
Running setup.py bdist_wheel for distribute … done
Stored in directory: /home/markmont/.cache/pip/wheels/b2/3c/64/772be880a32a0c41e64b56b13c25450ff31cf363670d3bc576
Successfully built pyvcf distribute
Installing collected packages: distribute, pyvcf
Successfully installed distribute pyvcf
[markmont@nyx5792 ~]$

If HTTP proxying were not supported by pip (or was otherwise not working), you’d be unable to access the Internet to install the pyvcf package and receive “Connection timed out”, “No route to host”, or “Connection failed” error messages when you tried to install it.

Information for advanced users

HTTP proxying is controlled by the following environment variables which are automatically set on each compute node:

export http_proxy="http://proxy.arc-ts.umich.edu:3128/"
export https_proxy="http://proxy.arc-ts.umich.edu:3128/"
export ftp_proxy="http://proxy.arc-ts.umich.edu:3128/"
export no_proxy="localhost,,.localdomain,.umich.edu"
export HTTP_PROXY="${http_proxy}"
export HTTPS_PROXY="${https_proxy}"
export FTP_PROXY="${ftp_proxy}"
export NO_PROXY="${no_proxy}"

Once these are set in your environment, you can access the Internet from compute nodes — for example, you can install Python and R libraries from compute nodes. There’s no need to start any daemons as is needed with the first two solutions above. The HTTP proxy server proxy.arc-ts.umich.edu does support HTTPS but does not terminate the TLS session at the proxy; traffic is encrypted by the software the user runs and the traffic is not decrypted until it reaches the destination server on the Internet.

To prevent software from using HTTP proxying, run the following command:

unset http_proxy https_proxy ftp_proxy no_proxy HTTP_PROXY HTTPS_PROXY FTP_PROXY NO_PROXY

The above command will only affect software started from the current shell.  If you start a new shell (for example, if you open a new window or log in again) you’ll need to re-run the command above each time.  To permanently disable HTTP proxying for all software, add the command above to the end of your ~/.bashrc file.

Finally, note that HTTP proxying (which is forward proxying) should not be confused with reverse proxying.  Reverse proxying, which is done by the ARC Connect service, allows researchers to start web applications (including Jupyter notebooks, RStudio sessions, and Bokeh apps) on compute nodes and then access those web applications through the ARC Connect.


A second solution is available for any software that either supports the SOCKS protocol or that can be “made to work” with SOCKS. Most software does not support SOCKS, but here is an example using curl (which does have built-in support for SOCKS) to download a file from the Internet from inside an interactive job running on a Flux compute node. We use “ssh -D” to set up a “quick and dirty” SOCKS proxy server for curl to use:

[markmont@flux-login1 ~]$ qsub -I -V -A example_flux -q flux -l nodes=1:ppn=2,mem=8000mb,walltime=04:00:00,qos=flux
qsub: waiting for job 18927190.nyx.arc-ts.umich.edu to start
qsub: job 18927190.nyx.arc-ts.umich.edu ready

[markmont@nyx5441 ~]$ ssh -f -N -D 1080 flux-xfer.arc-ts.umich.edu
[markmont@nyx5441 ~]$ curl –socks localhost -O ftp://ftp.gnu.org/pub/gnu/bc/bc-1.06.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 272k 100 272k 0 0 368k 0 –:–:– –:–:– –:–:– 1789k
[markmont@nyx5441 ~]$ ls -l bc-1.06.tar.gz
-rw-r–r– 1 markmont lsa 278926 Feb 10 17:11 bc-1.06.tar.gz
[markmont@nyx5441 ~]$

A limitation of “ssh -D” is that it only handles TCP traffic, not UDP traffic (including DNS lookups, which happen over UDP). However, if you have a real SOCKS proxy accessible to you elsewhere on the U-M network (such as on a server in your lab), you can specify its hostname instead of “localhost” above and omit the ssh command in order to have UDP traffic handled.

For software that does not have built-in support for SOCKS, it’s possible to wrap the software with a library that intercepts networking calls and routes the traffic via the “ssh -D” SOCKS proxy (or a real SOCKS proxy, if you have one accessible to you on the U-M network). This will allow most software running on compute nodes to access the Internet. ARC-TS clusters provide one such SOCKS wrapper, socksify, by default:

[markmont@nyx5441 ~]$ telnet towel.blinkenlights.nl 666  # this won't work...
telnet: connect to address No route to host
Trying 2a02:898:17:8000::42...
[markmont@nyx5441 ~]$ ssh -f -N -D 1080 flux-xfer.arc-ts.umich.edu # if it's not still running from above
[markmont@nyx5441 ~]$ socksify telnet towel.blinkenlights.nl 666

=== The BOFH Excuse Server ===
the real ttys became pseudo ttys and vice-versa.

Connection closed by foreign host.
[markmont@nyx5441 ~]$

You can even surf the web in text mode from a compute node:

[markmont@nyx5441 ~]$ socksify links http://xsede.org/

socksify is the client part of the Dante SOCKS server.

Local SSH tunneling (“ssh -L”)

A final option for accessing the Internet from an ARC-TS  compute node is to set up a local SSH tunnel using the “ssh -L” command. This provides a local port on the compute node that processes can connect to to access a single specific remote port on a single specific host on a non-UM network.

MongoDB example

Here is an example that shows how to use a local tunnel to access a MongoDB database hosted off-campus from inside a job running on a compute node. First, on a cluster login node, run the following command in order to get the keys for flux-xfer.arc-ts.umich.edu added to your ~/.ssh/known_hosts file. This needs to be done interactively so that you can respond to the prompt that the ssh command gives you:

[markmont@flux-login1 ~]$ ssh flux-xfer.arc-ts.umich.edu
The authenticity of host 'flux-xfer.arc-ts.umich.edu (' can't be established.
RSA key fingerprint is 6f:8c:67:df:43:4f:e0:fc:80:5b:49:1a:eb:81:cc:54.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'flux-xfer.arc-ts.umich.edu' (RSA) to the list of known hosts.
Advanced Research Computing - Technology Services
University of Michigan

This machine is intended for transferring data to and from the cluster
flux using scp and sftp only. Use flux-login.arc-ts.umich.edu for
interactive use.

For usage information, policies, and updates, please see:

Thank you for using U-M information technology resources responsibly.

^CConnection to flux-xfer.arc-ts.umich.edu closed.
[markmont@flux-login1 ~]$

After you see the login banner, the connection will hang, so press Control-C to terminate it and get your shell prompt back.

You can now run the following commands in a job, either interactively or in a PBS script, in order to access the MongoDB database at db.example.com from a compute node:

# Start the tunnel so that port 27017 on the compute node connects to port 27017 on db.example.com:
ssh -N -L 27017:db.example.com:27017 flux-xfer.arc-ts.umich.edu &
# Give the tunnel time to completely start up:
sleep 5
# You can now access the MongoDB database at db.example.com by connecting to localhost instead.
# For example, if you have the “mongo” command installed in your current directory, you could run the
# following command to view the collections available in the “admin” database:
./mongo --username MY_USERNAME --password “MY_PASSWORD” localhost/admin --eval ‘db.getCollectionNames();’
# When you are all done using it, tear down the tunnel:
kill %1

scp example

Here is an example that shows how to use a local tunnel to copy a file using scp from a remote system (residing on a non-UM network) named “far-away.example.com” onto an ARC-TS cluster from inside a job running on a compute node.

You should run the following commands inside an interactive PBS job the first time so that you can respond to prompts to accept various keys, as well as enter your password for far-away.example.com when prompted.

# Start the tunnel so that port 2222 on the compute node connects to port 22 on far-away.example.com:
ssh -N -L 2222:far-away.example.com:22 flux-xfer.arc-ts.umich.edu &
# Give the tunnel time to completely start up:
sleep 5
# Copy the file “my-data-set.csv” from far-away.example.com to the compute node:
# Replace “your-user-name” with the username by which far-away.example.com knows you.
# If you don’t have public key authentication set up from the cluster for far-away.example.com, you’ll
# be prompted for your far-away.example.com password
scp -P 2222 your-user-name@localhost:my-data-set.csv .
# When you are all done using it, tear down the tunnel:
kill %1

Once you have run these commands once, interactively, from a compute node, they can then be used in non-interactive PBS batch jobs, if you’ve also set up public key authentication for far-away.example.com

Data Science Platform (Hadoop)

By |

The ARC-TS Data Science Platform is an upgraded Hadoop cluster currently available as a technology preview with no associated charges to U-M researchers. The ARC-TS Hadoop cluster is an on-campus resource that provides a different service level than most cloud-based Hadoop offerings, including:

  • high-bandwidth data transfer to and from other campus data storage locations with no data transfer costs
  • very high-speed inter-node connections using 40Gb/s Ethernet

The cluster provides 112TB of total usable disk space, 40GbE inter-node networking, Hadoop version 2.3.0, and several additional data science tools.

Aside from Hadoop and its Distributed File System, the ARC-TS data science service includes:

  • Pig, a high-level language that enables substantial parallelization, allowing the analysis of very large data sets.
  • Hive, data warehouse software that facilitates querying and managing large datasets residing in distributed storage using a SQL-like language called HiveQL.
  • Sqoop, a tool for transferring data between SQL databases and the Hadoop Distributed File System.
  • Rmr, an extension of the R Statistical Language to support distributed processing of large datasets stored in the Hadoop Distributed File System.
  • Spark, a general processing engine compatible with Hadoop data
  • mrjob, allows MapReduce jobs in Python to run on Hadoop

The software versions are as follows:

Title Version
Hadoop 2.5.0
Hive 0.13.1
Sqoop 1.4.5
Pig 0.12.0
R/rhdfs/rmr 3.0.3
Spark 1.2.0
mrjob 0.4.3-dev, commit


If a cloud-based system is more suitable for your research, ARC-TS can support your use of Amazon cloud resources through MCloud, the UM-ITS cloud service.

For more information on the Hadoop cluster, please see this documentation or contact us at data-science-support@umich.edu.

A Flux account is required to access the Hadoop cluster. Visit the Establishing a Flux allocation page for more information.