Accessing the Internet from ARC-TS compute nodes

Normally, compute nodes on ARC-TS clusters cannot directly access the Internet because they have private IP addresses. This increases cluster security while reducing the costs (IPv4 addresses are limited, and ARC-TS clusters do not currently support IPv6). However, this also means that jobs cannot install software, download files, or access databases on servers located outside of University of Michigan networks: the private IP addresses used by the cluster are routable on-campus but not off-campus.

If your work requires these tasks, there are three ways to allow jobs running on ARC-TS clusters to access the Internet, described below. The best method to use depends to a large extent on the software you are using. If your software supports HTTP proxying, that is the best method. If not, SOCKS proxying or SSH tunneling may be suitable.

HTTP proxying

HTTP proxying, sometimes called “HTTP forward proxying”  is the simplest and most robust way to access the Internet from ARC-TS clusters. However, there are two main limitations:

  • Some software packages do not support HTTP proxying.
  • HTTP proxying only supports HTTP, HTTPS and FTP protocols.

If either of these conditions apply (for example, if your software needs a database protocol such as MySQL), users should explore SOCKS proxying or SSH tunneling, described below.

Some popular software packages that support HTTP proxying include:

HTTP proxying is automatically set up when you log in to ARC-TS clusters and it should be used by any software which supports HTTP proxying without any special action on your part.

Here is an example that shows installing the Python package opencv-python from within an interactive job running on a Great Lakes compute node:

 


[user@gl-login ~]$ module load python3.7-anaconda/2019.07
[user@gl-login ~]$ srun --pty --account=test /bin/bash
[user@gl3288 ~]$ pip install --user opencv-python
Collecting opencv-python
Downloading https://files.pythonhosted.org/packages/34/a3/403dbaef909fee9f9f6a8eaff51d44085a14e5bb1a1ff7257117d744986a/opencv_python-4.2.0.32-cp37-cp37m-manylinux1_x86_64.whl (28.2MB)
|████████████████████████████████| 28.2MB 3.2MB/s
Requirement already satisfied: numpy>=1.14.5 in /sw/arcts/centos7/python3.7-anaconda/2019.07/lib/python3.7/site-packages (from opencv-python) (1.16.4)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.2.0.32

If HTTP proxying were not supported by pip (or was otherwise not working), you’d be unable to access the Internet to install the opencv-python package and receive “Connection timed out”, “No route to host”, or “Connection failed” error messages when you tried to install it.

Information for advanced users

HTTP proxying is controlled by the following environment variables which are automatically set on each compute node:

export http_proxy="http://proxy.arc-ts.umich.edu:3128/"
export https_proxy="http://proxy.arc-ts.umich.edu:3128/"
export ftp_proxy="http://proxy.arc-ts.umich.edu:3128/"
export no_proxy="localhost,127.0.0.1,.localdomain,.umich.edu"
export HTTP_PROXY="${http_proxy}"
export HTTPS_PROXY="${https_proxy}"
export FTP_PROXY="${ftp_proxy}"
export NO_PROXY="${no_proxy}"

Once these are set in your environment, you can access the Internet from compute nodes — for example, you can install Python and R libraries from compute nodes. There’s no need to start any daemons as is needed with the first two solutions above. The HTTP proxy server proxy.arc-ts.umich.edu does support HTTPS but does not terminate the TLS session at the proxy; traffic is encrypted by the software the user runs and the traffic is not decrypted until it reaches the destination server on the Internet.

To prevent software from using HTTP proxying, run the following command:

unset http_proxy https_proxy ftp_proxy no_proxy HTTP_PROXY HTTPS_PROXY FTP_PROXY NO_PROXY

The above command will only affect software started from the current shell.  If you start a new shell (for example, if you open a new window or log in again) you’ll need to re-run the command above each time.  To permanently disable HTTP proxying for all software, add the command above to the end of your ~/.bashrc file.

Finally, note that HTTP proxying (which is forward proxying) should not be confused with reverse proxying.  Reverse proxying, which is done by the ARC Connect service, allows researchers to start web applications (including Jupyter notebooks, RStudio sessions, and Bokeh apps) on compute nodes and then access those web applications through the ARC Connect.

SOCKS

A second solution is available for any software that either supports the SOCKS protocol or that can be “made to work” with SOCKS. Most software does not support SOCKS, but here is an example using curl (which does have built-in support for SOCKS) to download a file from the Internet from inside an interactive job running on a Great Lakes compute node. We use “ssh -D” to set up a “quick and dirty” SOCKS proxy server for curl to use:

[user@gl-login ~]$ module load python3.7-anaconda/2019.07
[user@gl-login ~]$ srun --pty --account=test /bin/bash
[user@gl3288 ~]$ ssh -f -N -D 1080 greatlakes-xfer.arc-ts.umich.edu
[user@gl3288 ~]$ curl --socks localhost -O ftp://ftp.gnu.org/pub/gnu/bc/bc-1.06.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 272k 100 272k 0 0 375k 0 --:--:-- --:--:-- --:--:-- 375k
[user@gl3288 ~]$ ls -l bc-1.06.tar.gz
-rw-rw-r-- 1 user user 278926 Feb 3 16:09 bc-1.06.tar.gz

A limitation of “ssh -D” is that it only handles TCP traffic, not UDP traffic (including DNS lookups, which happen over UDP). However, if you have a real SOCKS proxy accessible to you elsewhere on the U-M network (such as on a server in your lab), you can specify its hostname instead of “localhost” above and omit the ssh command in order to have UDP traffic handled.

Local SSH tunneling (“ssh -L”)

A final option for accessing the Internet from an ARC-TS  compute node is to set up a local SSH tunnel using the “ssh -L” command. This provides a local port on the compute node that processes can connect to to access a single specific remote port on a single specific host on a non-UM network.

MongoDB example

Here is an example that shows how to use a local tunnel to access a MongoDB database hosted off-campus from inside a job running on a compute node. First, on a cluster login node, run the following command in order to get the keys for greatlakes-xfer.arc-ts.umich.edu added to your ~/.ssh/known_hosts file. This needs to be done interactively so that you can respond to the prompt that the ssh command gives you:

[user@gl-login ~]$ ssh greatlakes-xfer.arc-ts.umich.edu
Warning: the ECDSA host key for 'greatlakes-xfer.arc-ts.umich.edu' differs from the key for the IP address '141.211.192.36'
Offending key for IP in /etc/ssh/ssh_known_hosts:78
Matching host key in /home/user/.ssh/known_hosts:3
Are you sure you want to continue connecting (yes/no)? yes
************************************************************************
* By your use of these resources, you agree to abide by Proper Use of *
* Information Resources, Information Technology, and Networks at the *
* University of Michigan (SPG 601.07), in addition to all relevant *
* state and federal laws. http://spg.umich.edu/policy/601.07 *
************************************************************************

After you see the login banner, the connection will hang, so press Control-C to terminate it and get your shell prompt back.

You can now run the following commands in a job, either interactively or in a batch script, in order to access the MongoDB database at db.example.com from a compute node:

# Start the tunnel so that port 27017 on the compute node connects to port 27017 on db.example.com:
ssh -N -L 27017:db.example.com:27017 greatlakes-xfer.arc-ts.umich.edu &
# Give the tunnel time to completely start up:
sleep 5
# You can now access the MongoDB database at db.example.com by connecting to localhost instead.
# For example, if you have the “mongo” command installed in your current directory, you could run the
# following command to view the collections available in the “admin” database:
./mongo --username MY_USERNAME --password “MY_PASSWORD” localhost/admin --eval ‘db.getCollectionNames();’
# When you are all done using it, tear down the tunnel:
kill %1

scp example

Here is an example that shows how to use a local tunnel to copy a file using scp from a remote system (residing on a non-UM network) named “far-away.example.com” onto an ARC-TS cluster from inside a job running on a compute node.

You should run the following commands inside an interactive Slurm job the first time so that you can respond to prompts to accept various keys, as well as enter your password for far-away.example.com when prompted.

# Start the tunnel so that port 2222 on the compute node connects to port 22 on far-away.example.com:
ssh -N -L 2222:far-away.example.com:22 greatlakes-xfer.arc-ts.umich.edu &
# Give the tunnel time to completely start up:
sleep 5
# Copy the file “my-data-set.csv” from far-away.example.com to the compute node:
# Replace “your-user-name” with the username by which far-away.example.com knows you.
# If you don’t have public key authentication set up from the cluster for far-away.example.com, you’ll
# be prompted for your far-away.example.com password
scp -P 2222 your-user-name@localhost:my-data-set.csv .
# When you are all done using it, tear down the tunnel:
kill %1

Once you have run these commands once, interactively, from a compute node, they can then be used in non-interactive Slurm batch jobs, if you’ve also set up public key authentication for far-away.example.com.