Flume TwitterAgent.conf and demo02

thedatasociety · Sep 13, 2019 · eda64c0 · eda64c0
1 parent 0471782
commit eda64c0
Show file tree

Hide file tree

Showing 2 changed files with 245 additions and 0 deletions.
diff --git a/labs/lab6-flume/lab6-flume.demo02.ipynb b/labs/lab6-flume/lab6-flume.demo02.ipynb
@@ -0,0 +1,200 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Criando uma App no Twitter\n",
+    "\n",
+    "## Acessar o endereço abaixo e criar uma App: https://apps.twitter.com/\n",
+    "\n",
+    "Criar login, senha e logar\n",
+    "\n",
+    "Criar uma nova App clicando em Create New App\n",
+    "\n",
+    "Definir os detalhes da aplicação: nome, descrição, website, etc\n",
+    "\n",
+    "**No menu \"Keys and Tokens\" gerar as chaves da App para usar na configuração do Flume e substituir no arquivo twitterAgent.conf abaixo.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Configurando o agent, source, channel e sink\n",
+    "\n",
+    "**Agent**: Apenas um agente chamado *TwitterAgent*\n",
+    "\n",
+    "**Source**: Twitter\n",
+    "\n",
+    "**Channel**: Memória\n",
+    "\n",
+    "**Sink**: Registra os dados no HDFS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "#Nome dos componentes do agente\n",
+      "TwitterAgent.sources = Twitter\n",
+      "TwitterAgent.channels = MemChannel\n",
+      "TwitterAgent.sinks = HDFS\n",
+      "\n",
+      "#Configuração do Source\n",
+      "TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource\n",
+      "TwitterAgent.sources.Twitter.consumerKey = 6BWSmQX6AUfKhcNsMeav9zhi2\n",
+      "TwitterAgent.sources.Twitter.consumerSecret = DcYHM3EFR5oJR7VEq8cBVtjTPQxftI9PMrST71P7oXW0BlGiZv\n",
+      "TwitterAgent.sources.Twitter.accessToken = 1046705580-MTYNfMbLL6XSyQQgeL3Sah9RejwDRK5caBO9GRZ\n",
+      "TwitterAgent.sources.Twitter.accessTokenSecret = 2FRFyHQEdAIFVCFgTxyxCvl4zqoNtTEMZuwJHCfhXW2jk\n",
+      "TwitterAgent.sources.Twitter.keywords = #hadoop, #flume, #bigdata\n",
+      "\n",
+      "#Configuração do Sink\n",
+      "TwitterAgent.sinks.HDFS.type = hdfs\n",
+      "TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/matheus\n",
+      "TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream\n",
+      "TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text\n",
+      "\n",
+      "#number of events written to file before it is flushed to HDFS\n",
+      "TwitterAgent.sinks.HDFS.hdfs.batchSize = 50 \n",
+      "\n",
+      "#File size to trigger roll, in bytes (0: never roll based on file size)\n",
+      "TwitterAgent.sinks.HDFS.hdfs.rollSize = 0\n",
+      "\n",
+      "#Number of events written to file before it rolled (0 = never roll based on number of events)\n",
+      "TwitterAgent.sinks.HDFS.hdfs.rollCount = 50\n",
+      "\n",
+      "#Number of seconds to wait before rolling current file (0 = never roll based on time interval)\n",
+      "TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0\n",
+      "\n",
+      "#Configuração do Channel\n",
+      "TwitterAgent.channels.MemChannel.type = memory\n",
+      "#The maximum number of events stored in the channel\n",
+      "TwitterAgent.channels.MemChannel.capacity = 100\n",
+      "#The maximum number of events the channel will take from a source or give to a sink per transaction\n",
+      "TwitterAgent.channels.MemChannel.transactionCapacity = 100\n",
+      "\n",
+      "#Conectando Source, Sink, Channel\n",
+      "TwitterAgent.sources.Twitter.channels = MemChannel\n",
+      "TwitterAgent.sinks.HDFS.channel = MemChannel\n",
+      "\n",
+      "\n",
+      "#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json\n",
+      "#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object"
+     ]
+    }
+   ],
+   "source": [
+    "!cat /home/jovyan/labs/lab6-flume/twitterAgent.conf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Nota: \n",
+    "Usando as configurações descritas no site do Flume - **TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource** - na configuração do source, o arquivo é gerado com caracteres ilegíveis. Assim, iremos utilizar **TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource** e temos que copiar o arquivo flume-sources-1.0-SNAPSHOT.jar para a pasta do Flume para o seu correto funcionamento."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cp resources/flume-sources-1.0-SNAPSHOT.jar ~/resources/local/flume-${FLUME_VERSION}/lib"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Abrir um terminal e iniciar o FlumeAgent\n",
+    "\n",
+    "``` bash\n",
+    "flume-ng agent --conf conf --conf-file labs/lab6-flume/twitterAgent.conf --name TwitterAgent -Dflume.looger=INFO,console\n",
+    "``` "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# TrendTopics\n",
+    "Criando um rank de palavras que contém #"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "No configs found; falling back on auto-configuration\n",
+      "No configs specified for hadoop runner\n",
+      "Looking for hadoop binary in /home/jovyan/resources/local/hadoop-2.9.2/bin...\n",
+      "Found hadoop binary: /home/jovyan/resources/local/hadoop-2.9.2/bin/hadoop\n",
+      "Using Hadoop version 2.9.2\n",
+      "Creating temp directory /tmp/mrjob-ex-3.jovyan.20190913.175842.269968\n",
+      "uploading working dir files to hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd...\n",
+      "Copying other local files to hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/\n",
+      "Running step 1 of 2...\n",
+      "  packageJobJar: [/tmp/hadoop-unjar4483334449690778934/] [] /tmp/streamjob4300404582507966699.jar tmpDir=null\n",
+      "  Connecting to ResourceManager at /0.0.0.0:8032\n",
+      "  Connecting to ResourceManager at /0.0.0.0:8032\n",
+      "  Total input files to process : 4\n",
+      "  Cleaning up the staging area /tmp/hadoop-yarn/staging/jovyan/.staging/job_1568395552187_0004\n",
+      "  Error Launching job : Not a file: hdfs://localhost:9000/user/matheus/output/output8\n",
+      "  Streaming Command Failed!\n",
+      "Attempting to fetch counters from logs...\n",
+      "Can't fetch history log; missing job ID\n",
+      "No counters found\n",
+      "Scanning logs for probable cause of failure...\n",
+      "Can't fetch history log; missing job ID\n",
+      "Can't fetch task logs; missing application ID\n",
+      "Step 1 of 2 failed: Command '['/home/jovyan/resources/local/hadoop-2.9.2/bin/hadoop', 'jar', '/home/jovyan/resources/local/hadoop-2.9.2/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar', '-files', 'hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/mrjob-ex-3.py#mrjob-ex-3.py,hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/files/wd/setup-wrapper.sh#setup-wrapper.sh', '-input', 'hdfs:///user/matheus/*', '-output', 'hdfs:///user/jovyan/tmp/mrjob/mrjob-ex-3.jovyan.20190913.175842.269968/step-output/0000', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 mrjob-ex-3.py --step-num=0 --mapper', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 mrjob-ex-3.py --step-num=0 --reducer']' returned non-zero exit status 1280.\n"
+     ]
+    }
+   ],
+   "source": [
+    "!python resources/mrjob-ex-3.py -r hadoop --hadoop-streaming-jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar hdfs:///user/matheus/*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/labs/lab6-flume/twitterAgent.conf b/labs/lab6-flume/twitterAgent.conf
@@ -0,0 +1,45 @@
+#Nome dos componentes do agente
+TwitterAgent.sources = Twitter
+TwitterAgent.channels = MemChannel
+TwitterAgent.sinks = HDFS
+
+#Configuração do Source
+TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
+TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
+TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
+TwitterAgent.sources.Twitter.accessToken = <accessToken>
+TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>
+TwitterAgent.sources.Twitter.keywords = #hadoop, #flume, #bigdata
+
+#Configuração do Sink
+TwitterAgent.sinks.HDFS.type = hdfs
+TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/matheus
+TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
+TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
+
+#number of events written to file before it is flushed to HDFS
+TwitterAgent.sinks.HDFS.hdfs.batchSize = 50 
+
+#File size to trigger roll, in bytes (0: never roll based on file size)
+TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
+
+#Number of events written to file before it rolled (0 = never roll based on number of events)
+TwitterAgent.sinks.HDFS.hdfs.rollCount = 50
+
+#Number of seconds to wait before rolling current file (0 = never roll based on time interval)
+TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
+
+#Configuração do Channel
+TwitterAgent.channels.MemChannel.type = memory
+#The maximum number of events stored in the channel
+TwitterAgent.channels.MemChannel.capacity = 100
+#The maximum number of events the channel will take from a source or give to a sink per transaction
+TwitterAgent.channels.MemChannel.transactionCapacity = 100
+
+#Conectando Source, Sink, Channel
+TwitterAgent.sources.Twitter.channels = MemChannel
+TwitterAgent.sinks.HDFS.channel = MemChannel
+
+
+#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json
+#https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object