2015年1月20日 星期二

C# WebHDFS Client 操作



一、在操作Hadoop時通常都以Java語言來撰寫居多,因為Hadoop就是Base on Java,故library的使用較為方便,那如果今天希望藉由C#來操作HDFS的話,那麼Hadoop也提供了HDFS RESTful 的介面,又稱為WebHDFS,因此可以藉由這樣的方式來進行操作。


二、首先說明一下開發環境為Microsoft Visual Studio Express 2013,並藉由NuGet套件來下載WebHDFS相關library。


三、我們可以看到Microsoft .NET SDK For Hadoop:http://hadoopsdk.codeplex.com/,其中我們只需要安裝WebClient,利用Express 2013內建的NuGet套件,

TOOLS->NuGet Package Manager-> Package Manager Console


install-package Microsoft.Hadoop.WebClient

四、引用命名空間

using Microsoft.Hadoop.WebHDFS;

五、操作HDFS     

      (一)首先創建一個HDFSOperation的類來進行操作,那建構子的部分由外部傳入WebHDFS的路徑及使用者帳號。

using System.Threading.Tasks;

namespace HDFSProject_CSharp
{
    class HDFSOperation
    {       
        private WebHDFSClient myClient;
        public HDFSOperation(string uriPath,string userName){
            Uri myUri = new Uri(uriPath);
            myClient = new WebHDFSClient(myUri, userName);
        }
    }
}

      (二)撰寫上傳、下載、查看目錄等方法

/**
        * 取得HDFS資料夾下所有檔案
        * @param destFolderName 目的端路徑(HDFS)
        */
      public string GetDirectoryStatus(string destFolderName)
        {
            StringBuilder sbResult = new StringBuilder();
            //myClient.GetDirectoryStatus(destFolderName).ContinueWith(ds => ds.Result.Files.ToList().ForEach(f => sbResult.Append(f.PathSuffix).Append("\n")));
            var fileLists = myClient.GetDirectoryStatus(destFolderName).Result.Files.ToList();
            foreach (var f in fileLists){
                sbResult.Append(f.PathSuffix).Append("\n");
            }
            return sbResult.ToString();
        }
        /**
        * 將本地文件(src)上傳到HDFS服務器指定路徑(desc)
        * @param srcPath 來源端路徑(本地)
        * @param descPath 目的端路徑(HDFS)
        */
        public void CopyFromLocal(string srcPath,string descPath)
        {
            Console.WriteLine("Start Upload....");
            if (Directory.Exists(srcPath))
            {
                var files = Directory.GetFiles(srcPath);
                files.ToList().ForEach(file => myClient.CreateFile(file, descPath + "/" +file.Substring(file.LastIndexOf("\\") + 1)).Wait());
            }
            else
            {
                myClient.CreateFile(srcPath, descPath).Wait();
            }
            Console.WriteLine("End Upload....");
        }
        /**
        * 從HDFS路徑下載至本地路徑
        * @param srcPath 來源端路徑(HDFS)
        * @param descPath 目的端路徑(本地)
        */
        public void DownloadFromHDFS(string srcPath, string descPath)
        {
            Console.WriteLine("Start Upload....");
            FileStream output = File.Create(descPath);
            myClient.OpenFile(srcPath)
           .ContinueWith(r => r.Result.Content.ReadAsStreamAsync()
             .ContinueWith(c => CopyStream(c.Result,output)));
            Console.WriteLine("End Upload....");
        }
        /**
        * 使用串流方式進行下載
        * @param input HDFS輸入串流
        * @param output 本地輸出串流
        */
        public static void CopyStream(Stream input, Stream output)
        {
            byte[] buffer = new byte[input.Length];
            int len;
            while ((len = input.Read(buffer, 0, buffer.Length)) > 0)
            {
                output.Write(buffer, 0, len);
            }
            output.Flush();
            output.Close();
        }

3 則留言:

  1. Hi, thanks for sharing this which is really useful! Now I meet a problem, do you know how to set up timeout for WebHDFSClient? It seems the default timeout is 60 seconds, I would like to set it to a much larger number.

    回覆刪除
  2. 作者已經移除這則留言。

    回覆刪除
  3. Sorry for the delay in responding to your question.
    You can pass timespan parameter when you're declaring the constructor, such as new WebHDFSClient (uri, userName, timespan).
    If you are interested in this issue,you can fllow it:

    https://hadoopsdk.codeplex.com/SourceControl/network/forks/olivier1234/WebHDFSLargeFileFix/contribution/5762#!/tab/changes

    回覆刪除