1.datax for greenplum安装
下载地址
https://github.com/HashDataInc/DataX
安装准备
安装mevan
下载地址1:https://maven.apache.org/download.cgi
安装包版本3.5.4,下载二进制包,解压即可使用
下载地址2:
wget https://mirrors.cnnic.cn/apache/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz --no-check-certificate
2)解压安装maven软件包
tar -xf apache-maven-3.5.4-bin.tar.gz
mv apache-maven-3.5.4 /usr/local/maven
ln -s /usr/local/maven/bin/mvn /usr/bin/mvn # 与jenkins联合使用时,jenkins会到/usr/bin/下找mvn命令,如果没有回报错
ll /usr/local/maven/
ll /usr/bin/mvn
3)配置环境变量
echo " ">>/etc/profile
echo "# Made for mvn env by zhaoshuai on $(date +%F)">>/etc/profile
echo 'export MAVEN_HOME=/usr/local/maven'>>/etc/profile
echo 'export PATH=$MAVEN_HOME/bin:$PATH'>>/etc/profile
tail -4 /etc/profile
source /etc/profile
echo $PATH
4)查看安装的mvn版本号
which mvn
mvn -version
至此maven安装完成
开始安装源码版本
目录结构!!!这个是源码版本,因此目录结构不一样
adswriter elasticsearchwriter hbase094xwriter hdfsreader mongodbreader odpswriter otsstreamreader postgresqlreader rpm txtfilereader
common ftpreader hbase11xreader hdfswriter mongodbwriter oraclereader otswriter postgresqlwriter sqlserverreader txtfilewriter
core ftpwriter hbase11xsqlwriter images mysqlreader oraclewriter package.xml rdbmsreader sqlserverwriter userGuid.md
datax-opensource-dingding.png gpdbjsonwriter hbase11xwriter introduction.md mysqlwriter ossreader plugin-rdbms-util rdbmswriter streamreader
drdsreader gpdbwriter hbasereader license.txt ocswriter osswriter plugin-unstructured-storage-util README streamwriter
drdswriter hbase094xreader hbasewriter mongodbjsonreader odpsreader otsreader pom.xml README.md transformer
编译安装:
mvn -U clean package assembly:assembly -Dmaven.test.skip=true
最后结果
[WARNING] Assembly file: /app/DataX/target/datax-v1.0.4-hashdata is not a regular file (it may be a directory). It cannot be attached to the project build for installation or deployment.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] datax-all v1.0.4-hashdata .......................... SUCCESS [ 11.003 s]
[INFO] datax-common ....................................... SUCCESS [01:33 min]
[INFO] datax-transformer .................................. SUCCESS [ 47.629 s]
[INFO] datax-core ......................................... SUCCESS [ 26.107 s]
[INFO] plugin-rdbms-util .................................. SUCCESS [ 8.208 s]
[INFO] mysqlreader ........................................ SUCCESS [ 0.990 s]
[INFO] sqlserverreader .................................... SUCCESS [ 3.124 s]
[INFO] streamreader ....................................... SUCCESS [ 5.794 s]
[INFO] mysqlwriter ........................................ SUCCESS [ 0.730 s]
[INFO] streamwriter ....................................... SUCCESS [ 0.582 s]
[INFO] sqlserverwriter .................................... SUCCESS [ 0.715 s]
[INFO] gpdbwriter ......................................... SUCCESS [ 2.225 s]
[INFO] plugin-unstructured-storage-util v1.0.4-hashdata ... SUCCESS [01:39 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 05:11 min
[INFO] Finished at: 2020-12-24T11:01:03+08:00
[INFO] ------------------------------------------------------------------------
找到目录
打包成功后的DataX包位于 {DataX_source_code_home}/target/datax-v1.0.4-hashdata/datax/ ,结构如下:
这个与官网文档不一样,该目录位置在打包成功后的提示文档中!!注意查找!
[root@ares datax]# ls /app/DataX/target/datax-v1.0.4-hashdata
datax
[root@ares datax]# ls /app/DataX/target/datax-v1.0.4-hashdata/datax/
bin conf job lib plugin script tmp
自检脚本
自检脚本: python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
python /app/DataX/target/datax-v1.0.4-hashdata/datax/bin/datax.py /app/datax/job/job.json
我这里的job.json 是用的一键安装版的json,目的仅仅是测试下功能,他自带的那个job.json目的不明
2020-12-24 11:13:58.859 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-12-24 11:13:58.862 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-12-24 11:13:58.862 [main] INFO JobContainer - DataX jobContainer starts job.
2020-12-24 11:13:58.865 [main] INFO JobContainer - Set jobId = 0
2020-12-24 11:13:58.890 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2020-12-24 11:13:58.890 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2020-12-24 11:13:58.891 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2020-12-24 11:13:58.891 [job-0] INFO JobContainer - jobContainer starts to do split ...
2020-12-24 11:13:58.893 [job-0] INFO JobContainer - Job set Max-Byte-Speed to 10485760 bytes.
2020-12-24 11:13:58.894 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [1] tasks.
2020-12-24 11:13:58.895 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks.
2020-12-24 11:13:58.924 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2020-12-24 11:13:58.930 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2020-12-24 11:13:58.933 [job-0] INFO JobContainer - Running by standalone Mode.
2020-12-24 11:13:58.944 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2020-12-24 11:13:58.950 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-12-24 11:13:58.950 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2020-12-24 11:13:58.966 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-12-24 11:13:59.067 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[102]ms
2020-12-24 11:13:59.068 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2020-12-24 11:14:08.958 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.051s | All Task WaitReaderTime 0.065s | Percentage 100.00%
2020-12-24 11:14:08.958 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2020-12-24 11:14:08.959 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.
2020-12-24 11:14:08.960 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work.
2020-12-24 11:14:08.960 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2020-12-24 11:14:08.961 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /app/DataX/target/datax-v1.0.4-hashdata/datax/hook
2020-12-24 11:14:08.963 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2020-12-24 11:14:08.964 [job-0] INFO JobContainer - PerfTrace not enable!
2020-12-24 11:14:08.964 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.051s | All Task WaitReaderTime 0.065s | Percentage 100.00%
2020-12-24 11:14:08.965 [job-0] INFO JobContainer -
任务启动时刻 : 2020-12-24 11:13:58
任务结束时刻 : 2020-12-24 11:14:08
任务总计耗时 : 10s
任务平均流量 : 253.91KB/s
记录写入速度 : 10000rec/s
读出记录总数 : 100000
读写失败总数 : 0
自带json内容如下
{
"job": {
"setting": {
"speed": {
"byte": 1048576
}
},
"content": [
{
"reader": {
"name": "sqlserverreader",
"parameter": {
// 数据库连接用户名
"username": "ReadOnly01",
// 数据库连接密码
"password": "1qaz!QAZ",
"column": [
"id"
],
// "splitPk": "db_id",
"connection": [
{
"table": [
"table"
],
"jdbcUrl": [
"jdbc:sqlserver://192.168.0.65;DatabaseName=MyCost_Erp352"
]
}
]
}
},
"writer": {
"name": "sqlserverwriter",
"parameter": {
"username": "root",
"password": "root",
"column": [
"db_id",
"db_type",
"db_ip",
"db_port",
"db_role",
"db_name",
"db_username",
"db_password",
"db_modify_time",
"db_modify_user",
"db_description",
"db_tddl_info"
],
"connection": [
{
"table": [
"db_info_for_writer"
],
"jdbcUrl": "jdbc:sqlserver://[HOST_NAME]:PORT;DatabaseName=[DATABASE_NAME]"
}
],
"preSql": [
"delete from @table where db_id = -1;"
],
"postSql": [
"update @table set db_modify_time = now() where db_id = 1;"
]
}
}
}
]
}
2.datax-web安装
项目地址:
https://github.com/WeiYe-Jing/datax-web
我这里采用的是一键部署安装
https://pan.baidu.com/s/13yoqhGpD00I82K4lOYtQhg 提取码:cpsk
解压后目录格式如下:
[root@ares app]# ls datax-web-2.1.2
bin modules packages README.md userGuid.md
以下为全文转载
开始部署
1)解压安装包
在选定的安装目录,解压安装包
tar -zxvf datax-web-{VERSION}.tar.gz
2)执行一键安装脚本
进入解压后的目录,找到bin目录下面的install.sh文件,如果选择交互式的安装,则直接执行
./bin/install.sh
在交互模式下,对各个模块的package压缩包的解压以及configure配置脚本的调用,都会请求用户确认,可根据提示查看是否安装成功,如果没有安装成功,可以重复尝试; 如果不想使用交互模式,跳过确认过程,则执行以下命令安装
./bin/install.sh --force
3)数据库初始化
如果你的服务上安装有mysql命令,在执行安装脚本的过程中则会出现以下提醒:
Scan out mysql command, so begin to initalize the database
Do you want to initalize database with sql: [{INSTALL_PATH}/bin/db/datax-web.sql]? (Y/N)y
Please input the db host(default: 127.0.0.1):
Please input the db port(default: 3306):
Please input the db username(default: root):
Please input the db password(default: ):
Please input the db name(default: exchangis)
按照提示输入数据库地址,端口号,用户名,密码以及数据库名称,大部分情况下即可快速完成初始化。 如果服务上并没有安装mysql命令,则可以取用目录下/bin/db/datax-web.sql脚本去手动执行,完成后修改相关配置文件
vi ./modules/datax-admin/conf/bootstrap.properties
#Database
#DB_HOST=
#DB_PORT=
#DB_USERNAME=
#DB_PASSWORD=
#DB_DATABASE=
按照具体情况配置对应的值即可。
4) 配置
安装完成之后,
在项目目录: /modules/datax-admin/bin/env.properties 配置邮件服务(可跳过)
MAIL_USERNAME=""
MAIL_PASSWORD=""
此文件中包括一些默认配置参数,例如:server.port,具体请查看文件。
在项目目录下/modules/datax-execute/bin/env.properties 指定PYTHON_PATH的路径 非常重要!!!!
/app/DataX/target/datax-v1.0.4-hashdata/datax/bin/datax.py
vim /app/datax-web-2.1.2/modules/datax-executor/bin/env.properties
vi ./modules/{module_name}/bin/env.properties
### 执行datax的python脚本地址
PYTHON_PATH=
### 保持和datax-admin服务的端口一致;默认是9527,如果没改datax-admin的端口,可以忽略
DATAX_ADMIN_PORT=
此文件中包括一些默认配置参数,例如:executor.port,json.path,data.path等,具体请查看文件。
5)启动服务
- 一键启动所有服务
./bin/start-all.sh
中途可能发生部分模块启动失败或者卡住,可以退出重复执行,如果需要改变某一模块服务端口号,则:
vi ./modules/{module_name}/bin/env.properties
找到SERVER_PORT配置项,改变它的值即可。 当然也可以单一地启动某一模块服务:
./bin/start.sh -m {module_name}
- 一键取消所有服务
./bin/stop-all.sh
当然也可以单一地停止某一模块服务:
./bin/stop.sh -m {module_name}
6)查看服务(注意!注意!)
在Linux环境下使用JPS命令,查看是否出现DataXAdminApplication和DataXExecutorApplication进程,如果存在这表示项目运行成功
如果项目启动失败,请检查启动日志:modules/datax-admin/bin/console.out或者modules/datax-executor/bin/console.out
Tips: 脚本使用的都是bash指令集,如若使用sh调用脚本,可能会有未知的错误
7)运行
部署完成后,在浏览器中输入 http://ip:port/index.html 就可以访问对应的主界面(ip为datax-admin部署所在服务器ip,port为为datax-admin 指定的运行端口)
输入用户名 admin 密码 123456 就可以直接访问系统
8) 运行日志
部署完成之后,在modules/对应的项目/data/applogs下(用户也可以自己指定日志,修改application.yml 中的logpath地址即可),用户可以根据此日志跟踪项目实际启动情况
如果执行器启动比admin快,执行器会连接失败,日志报"拒绝连接"的错误,一般是先启动admin,再启动executor,30秒之后会重连,如果成功请忽略这个异常。
访问datax-web 记住务必加/index.html
http://172.18.1.25:9527/index.html
不加报错!
http://192.168.10.227:9527/
Whitelabel Error Page
This application has no explicit mapping for /error, so you are seeing this as a fallback.
Thu Dec 24 11:36:39 CST 2020
There was an unexpected error (type=Forbidden, status=403).
Access Denied
!!!未完之配置,邮件设置!!!
源码安装datax-web 非一键部署方式
文件目录
[root@ares datax-web-master]# ls /app/datax-web-master
bin datax-admin datax-assembly datax-core datax-executor datax-rpc doc LICENSE pom.xml README.md userGuid.md
执行打包,耗时较长,网速相关!
mvn clean install
[INFO] Building tar : /app/datax-web-master/build/datax-web-2.1.2.tar.gz
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] datax-web 2.1.2 .................................... SUCCESS [ 20.613 s]
[INFO] datax-rpc .......................................... SUCCESS [06:09 min]
[INFO] datax-core ......................................... SUCCESS [06:23 min]
[INFO] datax-admin ........................................ SUCCESS [44:46 min]
[INFO] datax-executor ..................................... SUCCESS [ 21.653 s]
[INFO] datax-assembly 2.1.2 ............................... SUCCESS [ 13.877 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 58:34 min
[INFO] Finished at: 2020-12-24T15:06:53+08:00
[INFO] ------------------------------------------------------------------------
1.linux环境部署
2.开发环境部署(或参考文档 Debug)
2.1 创建数据库
执行bin/db下面的datax_web.sql文件(注意老版本更新语句有指定库名)
2.2 修改项目配置
1.修改datax_admin下resources/application.yml文件
#数据源
datasource:
username: root
password: root
url: jdbc:mysql://localhost:3306/datax_web?serverTimezone=Asia/Shanghai&useLegacyDatetimeCode=false&useSSL=false&nullNamePatternMatchesAll=true&useUnicode=true&characterEncoding=UTF-8
driver-class-name: com.mysql.jdbc.Driver
修改数据源配置,目前仅支持mysql
# 配置mybatis-plus打印sql日志
logging:
level:
com.wugui.datax.admin.mapper: error
path: ./data/applogs/admin
修改日志路径path
# datax-web email
mail:
host: smtp.qq.com
port: 25
username: xxx@qq.com
password: xxx
properties:
mail:
smtp:
auth: true
starttls:
enable: true
required: true
socketFactory:
class: javax.net.ssl.SSLSocketFactory
修改邮件发送配置(不需要可以不修改)
2.修改datax_executor下resources/application.yml文件
# log config
logging:
config: classpath:logback.xml
path: ./data/applogs/executor/jobhandler
修改日志路径path
datax:
job:
admin:
### datax-web admin address
addresses: http://127.0.0.1:8080
executor:
appname: datax-executor
ip:
port: 9999
### job log path
logpath: ./data/applogs/executor/jobhandler
### job log retention days
logretentiondays: 30
executor:
jsonpath: /Users/mac/data/applogs
pypath: /Users/mac/tools/datax/bin/datax.py
修改datax.job配置
- admin.addresses datax_admin部署地址,如调度中心集群部署存在多个地址则用逗号分隔,执行器将会使用该地址进行"执行器心跳注册"和"任务结果回调";
- executor.appname 执行器AppName,每个执行器机器集群的唯一标示,执行器心跳注册分组依据;
- executor.ip 默认为空表示自动获取IP,多网卡时可手动设置指定IP,该IP不会绑定Host仅作为通讯实用;地址信息用于 "执行器注册" 和 "调度中心请求并触发任务";
- executor.port 执行器Server端口号,默认端口为9999,单机部署多个执行器时,注意要配置不同执行器端口;
- executor.logpath 执行器运行日志文件存储磁盘路径,需要对该路径拥有读写权限;
- executor.logretentiondays 执行器日志文件保存天数,过期日志自动清理, 限制值大于等于3时生效; 否则, 如-1, 关闭自动清理功能;
- executor.jsonpath datax json临时文件保存路径
- pypath DataX启动脚本地址,例如:xxx/datax/bin/datax.py
如果系统配置DataX环境变量(DATAX_HOME),logpath、jsonpath、pypath可不配,log文件和临时json存放在环境变量路径下。
四、启动项目
1.本地idea开发环境
- 1.运行datax_admin下 DataXAdminApplication
- 2.运行datax_executor下 DataXExecutorApplication
admin启动成功后日志会输出三个地址,两个接口文档地址,一个前端页面地址
五、启动成功
启动成功后打开页面(默认管理员用户名:admin 密码:123456)
http://localhost:8080/index.html#/dashboard