Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement WebHDFS to acces remote HDFS storage #7828

Open
ptallada opened this issue May 7, 2024 · 4 comments
Open

Implement WebHDFS to acces remote HDFS storage #7828

ptallada opened this issue May 7, 2024 · 4 comments

Comments

@ptallada
Copy link

ptallada commented May 7, 2024

Hi,

I've been playing with the HDFS remote (with some problems, I've opened a bug report).
I'm wondering if there is any plan to add support for WebHDFS or Httpfs to allow access to remote Hadoop clusters (probably via Knox).

The current HDFS remote requires full network visibility of all implicated machines (namenode and datanodes), and this is not always the case, as when transfering from the "outside".

@ncw
Copy link
Member

ncw commented May 9, 2024

WebHDFS appears to be a REST API for HDFS. Is that commonly deployed? How much use would it get?

Httpfs appears to be compatible with webhdfs so one backend would cover both of them.

So I think this would need a new backend to access. Maybe your organization would like to sponsor the development of such a thing?

(probably via Knox).

I don't know what that means - can you explain?

@ncw ncw added the new backend label May 9, 2024
@ptallada
Copy link
Author

ptallada commented May 9, 2024

WebHDFS is integrated with HDFS Namenode, so it is available with just a flick of configuration.
I am not sure how one could estimate the use of such a remote. I would say that if you want to export HDFS data to the outside, for instance integrating it with another service, WebHDFS is the way to go. Otherwise, you have to have full network visibility/connectivity with all datanodes, and they usually run inside a private network. I would feel very insecure exposing them to the internet.

So just exposing WebHDFS, is way safer and controled. Also, Apache Knox is an HTTP proxy that provides enhanced authentication/authorization omn top of Hadoop web services, such as WebHDFS. Using Knox, your HDFS namenode can sit in your internal network, and you just expose the Knox proxy (in your DMZ).

About sponsoring such a development, yes, that would be a possibility. Could you give me a gross estimate of how much it would cost? (we can discuss this in private)

@ncw
Copy link
Member

ncw commented May 10, 2024

WebHDFS is integrated with HDFS Namenode, so it is available with just a flick of configuration.

That is good to know.

I am not sure how one could estimate the use of such a remote. I would say that if you want to export HDFS data to the outside, for instance integrating it with another service, WebHDFS is the way to go. Otherwise, you have to have full network visibility/connectivity with all datanodes, and they usually run inside a private network. I would feel very insecure exposing them to the internet.

That is good to know as well. I have almost no practical experience with HDFS other than with the rclone backend - I don't have access to a real cluster only the docker test image we use!

So just exposing WebHDFS, is way safer and controled. Also, Apache Knox is an HTTP proxy that provides enhanced authentication/authorization omn top of Hadoop web services, such as WebHDFS. Using Knox, your HDFS namenode can sit in your internal network, and you just expose the Knox proxy (in your DMZ).

About sponsoring such a development, yes, that would be a possibility. Could you give me a gross estimate of how much it would cost? (we can discuss this in private)

Probably best to drop an email to sales@rclone.com and we can discuss. Thank you.

@ptallada
Copy link
Author

WebHDFS is integrated with HDFS Namenode, so it is available with just a flick of configuration.

That is good to know.

I am not sure how one could estimate the use of such a remote. I would say that if you want to export HDFS data to the outside, for instance integrating it with another service, WebHDFS is the way to go. Otherwise, you have to have full network visibility/connectivity with all datanodes, and they usually run inside a private network. I would feel very insecure exposing them to the internet.

That is good to know as well. I have almost no practical experience with HDFS other than with the rclone backend - I don't have access to a real cluster only the docker test image we use!

If you need access to a real Hadoop cluster, just let me know :)
I work for an academic institution working with massive scientific data, and most of our services are free for individual users.

So just exposing WebHDFS, is way safer and controled. Also, Apache Knox is an HTTP proxy that provides enhanced authentication/authorization omn top of Hadoop web services, such as WebHDFS. Using Knox, your HDFS namenode can sit in your internal network, and you just expose the Knox proxy (in your DMZ).
About sponsoring such a development, yes, that would be a possibility. Could you give me a gross estimate of how much it would cost? (we can discuss this in private)

Probably best to drop an email to sales@rclone.com and we can discuss. Thank you.

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants