如何能win10 怎么让war3全屏n恢复,以前wwWwarnom正常的时候

douban.com, all rights reserved 北京豆网科技有限公司博主最新文章
博主热门文章
您举报文章:
举报原因:
原文地址:
原因补充:
(最多只允许输入30个字)使用ehcache时怎么持久化数据到磁盘,并且在应用服务器重启后不丢失数据 - 应用服务器当前位置:& &&&使用ehcache时怎么持久化数据到磁盘,并且在应用服使用ehcache时怎么持久化数据到磁盘,并且在应用服务器重启后不丢失数据www.MyException.Cn&&网友分享于:&&浏览:0次使用ehcache时如何持久化数据到磁盘,并且在应用服务器重启后不丢失数据1、如何持久化到磁盘
使用cache.flush(),每次写入到cache后调用cache.flush() ,这样ehcache 会将索引(xxx.index)回写到磁盘。这样就不用担心程序是否非正常退出导致缓存丢失了。
2、附上配置文件修改:
&ehcache xmlns:xsi=&http://www.w3.org/2001/XMLSchema-instance&
xsi:noNamespaceSchemaLocation=&ehcache.xsd& name=&ehcache&&
&cacheManagerPeerProviderFactory
class=&net.sf.ehcache.distribution.RMICacheManagerPeerProviderFactory&
properties=&peerDiscovery=manual&/&
&diskStore path=&d:/ehcache&/&
&cache name=&submitProcessInst& maxElementsInMemory=&1& eternal=&true&
overflowToDisk=&true& diskSpoolBufferSizeMB=&10& maxElementsOnDisk=&1000000&
diskPersistent=&true& memoryStoreEvictionPolicy=&LRU&&
&cacheEventListenerFactory
class=&net.sf.ehcache.distribution.RMICacheReplicatorFactory& /&
&!-- 比一般配置多了这个 --&
&bootstrapCacheLoaderFactory class=&net.sf.ehcache.distribution.RMIBootstrapCacheLoaderFactory&/&
&/ehcache&注意:当不需保存数据在内存中时,将maxElementsInMemory=&1&,而不是0,设置为0时,可以看到ehcache有warning:
10:44:28,469 WARN &net.sf.ehcache.config.CacheConfiguration.warnMaxEntriesLocalHeap(CacheConfiguration.java:1601) - Cache: submitProcessInst has a maxElementsInMemory of 0. This might lead to performance degradation or OutOfMemoryError at Terracotta
client.From Ehcache 2.0 onwards this has been changed to mean a store with no capacity limit. Set it to 1 if you want no elements cached in memory
3、系统初始化时添加:
System.setProperty(net.sf.ehcache.CacheManager.ENABLE_SHUTDOWN_HOOK_PROPERTY,&true&);
另外,持久化到硬盘的对象都需要是可序列化的,用以下方法处理:
a)如果类是你自己的,把他设置成可序列化
b)如果类中的某些属性是是第三方jar包的类,可以将它的字段设置成transient(不需序列化)
c)如果类中的某些属性是是第三方jar包但你一定要将所有属性都序列化,可以考虑将这些属性转化成json等
ehcache版本:ehcache-core-2.5.2.jar
集群中使用ehcache
http://www.ibm.com/developerworks/cn/java/j-lo-ehcache/
http://www.cnblogs.com/yangy608/archive//2200669.html
12345678910
12345678910
12345678910 上一篇:下一篇:文章评论相关解决方案 12345678910 Copyright & &&版权所有log4j里面的info,debug,error级别有什么区别_百度知道
log4j里面的info,debug,error级别有什么区别
我有更好的答案
一共分为五个级别:DEBUG、INFO、WARN、ERROR和FATAL。这五个级别是有顺序的,DEBUG & INFO & WARN & ERROR & FATAL,明白这一点很重要,这里Log4j有一个规则:假设设置了级别为P,如果发生了一个级别Q比P高,则可以启动,否则屏蔽掉。DEBUG: 这个级别最低的东东,一般的来说,在系统实际运行过程中,一般都是不输出的。因此这个级别的信息,可以随意的使用,任何觉得有利于在调试时更详细的了解系统运行状态的东东,比如变量的值等等,都输出来看看也无妨。INFO:这个应该用来反馈系统的当前状态给最终用户的,所以,在这里输出的信息,应该对最终用户具有实际意义,也就是最终用户要能够看得明白是什么意思才行。从某种角度上说,Info 输出的信息可以看作是软件产品的一部分(就像那些交互界面上的文字一样),所以需要谨慎对待,不可随便。WARN、ERROR和FATAL:警告、错误、严重错误,这三者应该都在系统运行时检测到了一个不正常的状态,他们之间的区别,要区分还真不是那么简单的事情。我大致是这样区分的:
所谓警告,应该是这个时候进行一些修复性的工作,应该还可以把系统恢复到正常状态中来,系统应该可以继续运行下去。
所谓错误,就是说可以进行一些修复性的工作,但无法确定系统会正常的工作下去,系统在以后的某个阶段,很可能会因为当前的这个问题,导致一个无法修复的错误(例如宕机),但也可能一直工作到停止也不出现严重问题。
所谓Fatal,那就是相当严重的了,可以肯定这种错误已经无法修复,并且如果系统继续运行下去的话,可以肯定必然会越来越乱。这时候采取的最好的措施不是试图将系统状态恢复到正常,而是尽可能地保留系统有效数据并停止运行。
也就是说,选择 Warn、Error、Fatal 中的具体哪一个,是根据当前的这个问题对以后可能产生的影响而定的,如果对以后基本没什么影响,则警告之,如果肯定是以后要出严重问题的了,则Fatal之,拿不准会怎么样,则 Error 之。
采纳率:39%
来自团队:
=== Debug ===
级别低东东般说系统实际运行程般都输
级别信息随意使用任何觉利于调试更详细解系统运行状态东东比变量值等等都输看看妨=== Info ===
应该用反馈系统前状态给终用户所输信息应该终用户具实际意义终用户要能够看明白意思才行
某种角度说Info 输信息看作软件产品部(像些交互界面文字)所需要谨慎待随便=== Warn、Error、Fatal ===
警告、错误、严重错误三者应该都系统运行检测状态间区别要区真简单事情我致区:
所谓警告应该候进行些修复性工作应该系统恢复状态系统应该继续运行
所谓错误说进行些修复性工作确定系统工作系统某阶段能前问题导致修复错误(例宕机)能直工作停止现严重问题
本回答被提问者和网友采纳
为您推荐:
其他类似问题
级别 debug error info log4j的相关知识
换一换
回答问题,赢新手礼包
个人、企业类
违法有害信息,请在下方选择后提交
色情、暴力
我们会通过消息、邮箱等方式尽快将举报结果通知您。Hadoop及HBase使用过程中的一些问题集
本文是我在使用Hbase的过程碰到的一些问题和相应的解决方法,现将这些经过总结分析,以免日后忘记。hadoop新增节点hadoop集群要增加3个节点。在3台主机分别配置了host、与集群内所有机器的ssh登录互信、jdk1.7 等。所有配置、目录等与集群其他机器保持一致。把安装文件拷贝到3台主机后,同时将修改后的配置文件分发到三台主机相应目录中。修改namenode的slaves文件,增加三台主机的host,然后分发到所有datanode。在三台主机中分别执行:yarn-daemon.sh start nodemanagerhadoop-daemon.sh start datanode两个命令分别启动nodemanager、datanode 。再执行数据分布负载均衡,start-balancer.sh -threshold 5保证hadoop的文件是平均分布在各个节点中。Hbase 排错INFO client.HConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, will automatically reconnect when needed.region server的超时时间需要设置为以下,防止FULL GC,带来的zookeeper超时。-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70&property&&name&hbase.regionserver.lease.period&/name&&value&240000&/value&&/property&&property&&name&hbase.rpc.timeout&/name&&value&280000&/value&&/property&集群假死,日志中出现zk通信超时的情况,不都是zk出现了问题,下面这两个问题在表现上是zk超时,实则是hbase或hadoop出现了问题。1、region server 所分配的内存堆过小,hb的数据量占用空间量大的情况下在系统的profile中,将此参数设置大一些:export HBASE_HEAPSIZE=30720在hb的hbase-env.sh中# The maximum amount of heap to use, in MB. Default is 1000.# export HBASE_HEAPSIZE=10002、zk是最大连接数过小默认值是300,在查询量与记录数量特大的集群中,可能会造成相互间通信连接过多从而拒绝连接服务。&property&&name&hbase.zookeeper.property.maxClientCnxns&/name&&value&15000&/value&&/property&3、HRegionServer启动不正常在namenode上执行jps,则可看到hbase启动是否正常,进程如下:[root@master bin]# jps26341 HMaster26642 Jps7840 ResourceManager7524 NameNode7699 SecondaryNameNode由上可见,hadoop启动正常。HBase少了一个进程,猜测应该是有个节点regionserver没有启动成功。进入节点slave1 ,执行jps查看启动进程:[root@master bin]# ssh slave1Last login: Thu Jul 17 17:29:11 2014 from master[root@slave1 ~]# jps4296 DataNode11261 HRegionServer11512 Jps11184 QuorumPeerMain由此可见Slave1节点正常。进入节点slave2节点,执行jps查看启动进程:[root@slave2 ~]# jps3795 DataNode11339 Jps11080 QuorumPeerMainOK,问题找到了 HRegionServer没有启动成功。进入HBase日志: 09:28:19,392 INFO& [regionserver60020] regionserver.HRegionServer: STOPPED: Unhandled: org.apache.hadoop.hbase.ClockOutOfSyncException: Server slave2,0498057 Reported time is too far out of sync with master.& Time difference of ms & max allowed of 30000msat org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:314)at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:215)at org.apache.hadoop.hbase.master.HMaster.regionServerStartup(HMaster.java:1292)at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5085)at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2185)at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1889)根据错误日志,可得到slave2和maste机器时间差太多,查看各个系统的时间,果真如此,同步即可。另外一种方法就是配置hbase的配置文件:配置:hbase.master.maxclockske&property&&&name&hbase.master.maxclockskew&/name&&&value&200000&/value&&&description&Time difference of regionserver from master&/description&&/property&(这种方法不推荐)4、Zookeeper启动不正常在启动hbase时,总是报错,提示zookeeper连接不上,查看zookeeper日志,发现:ClientCnxn$SendThread@966] - Opening socket connection to server slave1. Will not attempt to authenticate using SASL (无法定位登录配置)。经过百度可得由于hosts文件的问题,于是vi /etc/hosts 发现 ip slave1配置中ip错误。汗!幸亏hbase和zookeeper都有日志。于是重启zookeeper和hbase,上述问题解决。5、HBase shell执行list命令报错在Hbase shell执行list命令报错:...实在太长关键错误信息:client.HConnectionManager$HConnectionImplementation: Can't get connection to ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase。根据信息可以判断zk无法连接。执行jps查看zk都正常。查看hbase-site.xml中zk节点配置正常。根据经验,应该是防火墙没有关闭,2181端口无法访问。ok执行service iptables stop关闭防火墙,重启hbase。进入hbase shell,执行list:hbase(main):001:0& listTABLESLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/root/hadoop/hbase/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 14:06:26,013 WARN& [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable0 row(s) in 1.0070 seconds=& []一切正常,问题解决。6、HBase Shell 增删改异常在hbase shell上做增删改就会报异常:zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect。经判断是hbase版本的jar包和hadoop中的jar包不兼容的问题。解决方法:将hadoop中hadoop-2.2.0相关的jar包copy过来(${HABASE_HOME}/lib)替换即可。7、hb下面的regionserver全部掉线不是zk的问题 是hdfs的问题does not have any open files.&& 提高datanode节点间的最大传输数dfs.datanode.max.transfer.threadsEndOfStreamException: Unable to read additional data from client sessionid 0x4f6ce1baef1cc5, likely client has closed socket&& &at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)&& &at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)&& &at java.lang.Thread.run(Thread.java:662) 12:00:04,636 INFO& [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: Closed socket connection for client /172.16.0.175:39889 which had sessionid 0x4f6ce1baef1cc5 12:00:07,570 WARN& [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: caught end of stream exceptionEndOfStreamException: Unable to read additional data from client sessionid 0x4f6ce1baef1d4f, likely client has closed socket&& &at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)&& &at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)&& &at java.lang.Thread.run(Thread.java:662) 13:19:20,232 INFO& [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /172.16.0.161:55772 13:19:20,274 INFO& [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.ZooKeeperServer: Client attempting to establish new session at /172.16.0.161:55772 13:19:20,276 INFO& [CommitProcessor:0] server.ZooKeeperServer: Established session 0x4f6ce1baef1f96 with negotiated timeout 90000 for client /172.16.0.161:55772 13:20:21,000 INFO& [SessionTracker] server.ZooKeeperServer: Expiring session 0x24f6ce1bd0f207c, timeout of 90000ms exceeded 13:20:21,000 INFO& [ProcessThread(sid:0 cport:-1):] server.PrepRequestProcessor: Processed session termination for sessionid: 0x24f6ce1bd0f207c 13:24:21,037 WARN [QuorumPeer[myid=0]/0.0.0.0:2181] quorum.LearnerHandler: Closing connection to peer due to transaction timeout. 13:24:21,237 WARN [LearnerHandler-/192.168.40.35:56545] quorum.LearnerHandler: ****** GOODBYE /192.168.40.35:56545 ****** 13:24:21,237 WARN [LearnerHandler-/192.168.40.35:56545] quorum.LearnerHandler: Ignoring unexpected exception第二种可能情况hadoop的 namenode重新格式化以后,重启hbase,发现它的hmaster进程启动后马上消失,查看一大堆日志,最后在zookeeper的日志里发现如下问题:Unable to read additional data from client sessionid 0x14e, likely client has closed socket解决方法:删除掉hbase的hbase-site.xml中一下内容所配置路径下的目录,重启zookeeper集群,再重启hbase让该目录重新生成即可。&property&&& &&name&hbase.zookeeper.property.dataDir&/name&&& &&value&/home/freeoa/zookeeper/data&/value&&/property&hbase.zookeeper.property.dataDir:这个是ZooKeeper配置文件zoo.cfg中的dataDir。zookeeper存储数据库快照的位置。8、设置分数少于3时datanode节点失败导致hdfs写失败任务量大,集群节点少(4个DN),集群不断出现问题,导致收集数据出现错误,以致数据丢失。出现数据丢失,最先拿来开刀的就是数据收集,先看看错误日志:Caused by: java.io.IOException: Failed to add a datanode.& User may turn off this feature by setting dfs.client.block.write.replace-datanode-&br /&on-failure.policy in configuration, where the current policy is DEFAULT.& (Nodes: current=[10.0.2.163:5.2.164:50010], original=[10.0.2.163:5.2.164:50010])&br /&&& &at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:817)&br /&&& &at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:877)&br /&&& &at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:983)&br /&&& &at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:780)&br /&&& &at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)&& &错误:Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT从日志上看是说添加DN失败,需要关闭dfs.client.block.write.replace-datanode-on-failure.policy特性。但是我没有添加节点啊?看来问题不是这么简单。通过查看官方配置文档对上述的参数配置:参数 默认值 说明dfs.client.block.write.replace-datanode-on-failure.enabletrue && &If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policydfs.client.block.write.replace-datanode-on-failure.policy && & DEFAULT && & This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greate or (2) r is greater than n and the block is hflushed/appended.来自:然后寻找源码位置在dfsclient中,发现是客户端在pipeline写数据块时候的问题,也找到了两个相关的参数:dfs.client.block.write.replace-datanode-on-failure.enabledfs.client.block.write.replace-datanode-on-failure.policy前者是,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。后者是,更换策略的具体细节,默认是default。default在3个或以上备份的时候,是会尝试更换结点次数??次datanode。而在两个备份的时候,不更换datanode,直接开始写。由于我的节点只有4个,当集群负载太高的时候,同时两台以上DN没有响应,则出现HDFS写的问题。当集群比较小的时候我们可以关闭这个特性。9、hbase启动过程排错在进行灾难测试时,将其中的一台hdfs及region server直接进行了重启,那么如何在其启动之后来启动之前的服务呢。启动hdfs:sbin/hadoop-daemon.sh start datanode启动hbase regionserver:bin/hbase-daemon.sh& start regionserver但通过jps查看时,发现没有HQuorumPeer进程。经过在网上查证,说可能是服务器时间有较大的'漂移',重新对时后,发现快了30s。重新同步后,再重启hbase服务,发现该进程依然没有起来。看来只有重启整个hbase集群。hadoop@xdm:/usr/local/hbase$ bin/stop-hbase.sh stopping hbase...................xd1: no zookeeper to stop because no pid file /tmp/hbase-hadoop-zookeeper.pidxd0: stopping zookeeper.xd2: stopping zookeeper.原来这个进程是zookeeper的。bash bin/hbase-daemon.sh start zookeeper进程HQuorumPeer没有启动;原因:/hbase/conf/hbase-site.xml文件有关hbase.zookeeper.property.dataDir配置的目录不存在解决方案:hbase.zookeeper.property.dataDir属性对应的value值/home/grid/zookeeper,在本地建立/home/grid/zookeeper目录Hadoop一般通过主机名来与集群中的节点进行通信,因此我们要将所有节点的ip与主机名映射关系写入到/etc/hosts中来保证我们的通信正常。如果有的朋友在配置文件中使用了ip后操作不正常请修改成对应的主机名即可。hadoop与hbase内核版本不兼容问题,因为hbase的lib目录下的hadoop的包比我安装的0.20.2的版本要新,需要用0.20.2的hadoop包替换,这点官网的文档是有说明的。如果你使用的是cloudera公司的定制版hadoop和hbase那么就免去了替换jar的过程,因为cloudera公司已经把所有的兼容性问题都解决了。10、Hadoop新加datanode节点时碰到的一些问题在所有机器中加入新的主机名到/etc/hosts文件中将主节点的ssh-key加入到新机器中ssh-copy-id 192.168.0.83将新节点的主机名加入到slaves中(新加一行):vim /usr/local/hadoop/etc/hadoop/slaves重新同步配置文件到所有节点:for i in {80..83};do rsync -av /usr/local/hadoop/etc/ 192.168.0.$i:/usr/local/hadoop/etc/;done注意:对于所有非新加节点不用重启,只需要在新加节点上开启datanode,在开启成功后,要开启平衡脚本(/sbin/start-balancer.sh)。让其从其它节点同步数据过来,相应的其它节点所占用的系统空间会减少。还需要注意的是:数据节点之间的同步带宽默认只有1M,这个对有大量数据的集群来说实在太少了,可在新加节点上修改为10M:&property&&name&dfs.balance.bandwidthPerSec&/name&&value&&/value&&/property&开启datanode后,master的日志有如下: 15:26:31,738 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk__1202 on node 192.168.0.83:50010 size 60 15:26:31,738 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk__1203 on node 192.168.0.83:50010 size 60 15:26:31,739 INFO BlockStateChange: BLOCK* processReport: from storage DS-14d-4f53-db668729d node DatanodeRegistration(192.168.0.83, datanodeUuid=c82b8d07-5ffb-46d5-81f7-47c06e673384, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-c06612c4-aa83--e4faffa08074;nsid=;c=0), blocks: 48, hasStaleStorage: false, processing time: 5 msecs出现了很多的警告,在50070端口的页面中不断的刷新时,发现xd0与xd3交替的出现,原来我是将xd0的机器克隆为xd3,没有将datanode下面的目录清理,只能停机清理干净后在开启。主节点的目录记录如下: 15:38:51,415 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 170 Total time for transactions(ms): 13 Number of transactions batched in Syncs: 2 Number of syncs: 131 SyncTimes(ms): 2621
15:38:51,433 INFO BlockStateChange: BLOCK* addToInvalidates: blk__.0.81:8.0.82:8.0.80:50010
15:38:51,446 INFO BlockStateChange: BLOCK* addToInvalidates: blk__.0.82:8.0.81:8.0.80:50010
15:38:53,844 INFO BlockStateChange: BLOCK* BlockManager: ask 192.168.0.82:50010 to delete [blk__2164, blk__2165] 15:38:56,844 INFO BlockStateChange: BLOCK* BlockManager: ask 192.168.0.81:50010 to delete [blk__2164, blk__2165] 15:38:56,848 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 15:38:56,848 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 15:38:57,220 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(192.168.0.83, datanodeUuid=b5-4d0e-85ad-7ca6332181af, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-c06612c4-aa83--e4faffa08074;nsid=;c=0) storage b5-4d0e-85ad-7ca6332181af 15:38:57,220 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Number of failed storage changes from 0 to 0 15:38:57,220 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.0.83:50010 15:38:57,273 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Number of failed storage changes from 0 to 0 15:38:57,273 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID DS-40fd35ff-cb26-47e8-aa50-654dbdfbbd4a for DN 192.168.0.83:50010 15:38:57,291 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Processing first storage report for DS-40fd35ff-cb26-47e8-aa50-654dbdfbbd4a from datanode b5-4d0e-85ad-7ca6332181af 15:38:57,291 INFO BlockStateChange: BLOCK* processReport: from storage DS-40fd35ff-cb26-47e8-aa50-654dbdfbbd4a node DatanodeRegistration(192.168.0.83, datanodeUuid=b5-4d0e-85ad-7ca6332181af, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-c06612c4-aa83--e4faffa08074;nsid=;c=0), blocks: 0, hasStaleStorage: false, processing time: 0 msecs 15:38:59,844 INFO BlockStateChange: BLOCK* BlockManager: ask 192.168.0.80:50010 to delete [blk__2164, blk__2165] 15:39:26,848 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 15:39:26,848 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).本机节点的日志中也无报错。但这会影响原始机xd0的数据,之后发现上面的数据会有损坏,只好将其datanode下面的数据删除后再开启datanode。开启平衡同步脚本:hadoop@xd3:/usr/local/hadoop$ sbin/start-balancer.sh日志记录如下:hadoop@xd3:/usr/local/hadoop$ more logs/hadoop-hadoop-balancer-xd3.log
15:43:44,875 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 172 Total time for transactions(ms): 13 Number of transactions batched in Syncs: 2 Number of syncs: 133 SyncTimes(ms): 2650
15:43:44,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /system/balancer.id. BP-2.168.0.9-7 blk__2173{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-40fd35ff-cb26-47e8-aa50-654dbdfbbd4a:NORMAL:192.168.0.83:50010|RBW], ReplicaUnderConstruction[[DISK]DS-f7b-445a-f908d0a94:NORMAL:192.168.0.81:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6b3-4fce-8d46-fe:NORMAL:192.168.0.82:50010|RBW]]} 15:43:45,226 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /system/balancer.id for DFSClient_NONMAPREDUCE_ 15:43:45,285 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 192.168.0.82:50010 is added to blk__2173{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-40fd35ff-cb26-47e8-aa50-654dbdfbbd4a:NORMAL:192.168.0.83:50010|RBW], ReplicaUnderConstruction[[DISK]DS-f7b-445a-f908d0a94:NORMAL:192.168.0.81:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6b3-4fce-8d46-fe:NORMAL:192.168.0.82:50010|RBW]]} size 3 15:43:45,287 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 192.168.0.81:50010 is added to blk__2173{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-40fd35ff-cb26-47e8-aa50-654dbdfbbd4a:NORMAL:192.168.0.83:50010|RBW], ReplicaUnderConstruction[[DISK]DS-f7b-445a-f908d0a94:NORMAL:192.168.0.81:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6b3-4fce-8d46-fe:NORMAL:192.168.0.82:50010|RBW]]} size 3 15:43:45,308 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 192.168.0.83:50010 is added to blk__2173 size 3 15:43:45,309 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /system/balancer.id is closed by DFSClient_NONMAPREDUCE_ 15:43:45,331 INFO BlockStateChange: BLOCK* addToInvalidates: blk__.0.81:8.0.82:8.0.83:50010
15:43:47,861 INFO BlockStateChange: BLOCK* BlockManager: ask 192.168.0.81:50010 to delete [blk__2173] 15:43:47,862 INFO BlockStateChange: BLOCK* BlockManager: ask 192.168.0.82:50010 to delete [blk__2173] 15:43:50,862 INFO BlockStateChange: BLOCK* BlockManager: ask 192.168.0.83:50010 to delete [blk__2173]11、hadoop decommission一个节点Datanode,几万个block都同步过去了,但是唯独剩下2个block一直停留在哪,导致该节点几个小时也无法下线。hadoop UI中显示在Under Replicated Blocks里面有2个块始终无法消除。Under Replicated Blocks 2 Under Replicated Blocks In Files Under Construction 2Under Replicated Blocks 2Under Replicated Blocks In Files Under Construction 2Namenode日志里面一直有这样的滚动: 15:04:47,978 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Block: blk_41120, Expected Replicas: 3, live replicas: 2, corrupt replicas: 0, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.11.12.13:.12.14:.12.15:50010 , Current Datanode: 10.11.12.13:50010, Is current datanode decommissioning: true 15:04:47,978 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Block: blk_41120,Expected Replicas: 3, live replicas: 2, corrupt replicas: 0, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.11.12.13:.12.14:.12.15:50010 , Current Datanode: 10.11.12.13:50010, Is current datanode decommissioning:truegoogle了好久,貌似是一个hadoop的bug,https://issues.apache.org/jira/browse/HDFS-5579NameNode发现block的Replicas不够(期待应该有3个,实际有两个),或许是namenode认为数据不完整,执着地不让这个DataNode下架。。。最终尝试如下方式解决,把replications设置成2:hadoop fs -setrep -R 2 /执行完后很快,该节点就下线了,神奇的replications。移除任何一个Datanode都会导致某些文件不能满足replica factor的最低要求。当试图移除一个Datanode的时候,会一直处在Decommissioning的状态,因为它找不到别的机器来迁移它的数据了。这个问题通常容易出现在小集群上。一个解决办法就是临时把相应文件的replica factor调低。用如下命令来查看HDFS中所有文件的replica factorhdfs fsck / -files -blocks其中repl=1表示该文件的该block的replica factor为1,通过这个命令就可以找到那些replica factor比较高的文件了。调整文件的replicafactor需要注意的是,replica factor是文件的属性,而不是集群的属性,也就是说同一个集群中的文件可以有不同的replica factor。因此需要针对文件修改replica factor。相应的命令是:hdfs dfs -setrep [-R] [-w] &rep&&path&其中-R表示recursive,可以对一个目录及其子目录设置replica factor&rep&表示需要设置的replica factor的值&path&表示需要设置的replica factor的文件或目录路径-w表示等待复制完成,可能需要等待很长时间通过搜索引擎找到关于它们使用过程中的问题集的文章:感谢原作者!12、Hadoop开启安全模式引起的hbase列出表时出错在列出表时报错 hbase(main):001:0& listTABLE&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &ERROR: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet在其日志文件中会有如下报错: 10:33:14,576 INFO& [master:hadoop1:60000] catalog.CatalogTracker: Failed verification of hbase:meta,,1 at address=hadoop3,0257576, exception=org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: 6经查看原因原来是 hadoop 集群处于安全模式。执行如下指令:hdfs dfsadmin -safemode leave重新启动habase,问题可得到解决。相关参数:safemode enter|leave|get|wait安全模式维护命令,安全模式是Namenode的一个状态,这种状态:1.不接受对名字空间的更改(只读)2.不复制或删除块Namenode会在启动时自动进入安全模式,当配置的块最小百分比数满足最小的副本数条件时,会自动离开安全模式。安全模式可以手动进入,但是这样的话也必须手动关闭安全模式。13、测试datanode上的块恢复测试将一datanode机器上的一个数据存储目录(至少两个存储目录)清空hadoop@htcom:/usr/local/hadoop$ bin/hdfs fsck /17/03/17 15:57:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableConnecting to namenode via http://htcom:50070FSCK started by hadoop (auth:SIMPLE) from /192.168.0.9 for path / at Fri Mar 17 15:57:24 CST 2017...../hbase/data/default/wz/7c8b18620febde82c7ec60de1d26e362/.regioninfo:& Under replicated BP--192.168.0.9-0:blk__1284. Target Replicas is 3 but found 2 replica(s)../hbase/data/default/wz/7c8b18620febde82c7ec60de1d26e362/cf/efe459c273cc6edbb907d:& Under replicated BP--192.168.0.9-0:blk__1288. Target Replicas is 3 but found 2 replica(s)...../hbase/data/default/wz/9be511a6dcafcf95b32419/cf/22b36ccb7ccb:& Under replicated BP--192.168.0.9-0:blk__1232. Target Replicas is 3 but found 2 replica(s)....../hbase/data/default/wz/d2b01de90b3c07efdb72/.regioninfo:& Under replicated BP--192.168.0.9-0:blk__1198. Target Replicas is 3 but found 2 replica(s)../hbase/data/default/wz/d2b01de90b3c07efdb72/cf/ac84af97d3ed2:& Under replicated BP--192.168.0.9-0:blk__1202. Target Replicas is 3 but found 2 replica(s)....../hbase/data/default/wz/f639a4fddc8be893bf4bb67/.regioninfo:& Under replicated BP--192.168.0.9-0:blk__1312. Target Replicas is 3 but found 2 replica(s).............Status: HEALTHY&Total size:&& & B&Total dirs:&& &85&Total files:&& &57&Total symlinks:&& &&& &0 (Files currently being written: 3)&Total blocks (validated):&& &58 (avg. block size
B) (Total open file blocks (not validated): 3)&Minimally replicated blocks:&& &58 (100.0 %)&Over-replicated blocks:&& &0 (0.0 %)&Under-replicated blocks:&& &11 (18.965517 %)&Mis-replicated blocks:&& &&& &0 (0.0 %)&Default replication factor:&& &2&Average block replication:&& &2.6724138&Corrupt blocks:&& &&& &0&Missing replicas:&& &&& &11 (6.626506 %)&Number of data-nodes:&& &&& &3&Number of racks:&& &&& &1FSCK ended at Fri Mar 17 15:57:24 CST 2017 in 21 millisecondsThe filesystem under path '/' is HEALTHYnamenode的日志中会不断的报错: 16:02:40,150 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 16:02:40,150 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All required storage types are unavailable:& unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 16:02:40,150 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 16:02:40,150 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 16:02:40,150 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 16:02:40,150 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All required storage types are unavailable:& unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}datanode日志会报错: 15:44:16,301 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: xd1:50010:DataXceiver error processing READ_BLOCK operation& src: /192.168.0.81:47294 dst: /192.168.0.81:50010java.io.IOException: Block BP--192.168.0.9-0:blk__1288 is not valid. Expected block file at /opt/edfs/current/BP--192.168.0.9-0/current/finalized/subdir0/subdir1/blk_ does not exist.&& &at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockInputStream(FsDatasetImpl.java:585)&& &at org.apache.hadoop.hdfs.server.datanode.BlockSender.&init&(BlockSender.java:375)&& &at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:514)&& &at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116)&& &at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)&& &at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)&& &at java.lang.Thread.run(Thread.java:745)bin/hdfs dfsadmin -report会发现该datanode节点依然是正常状态,'fsck /'时起初不会报错,后来会慢慢地如上的错误,但文件系统依然是&HEALTHY&。被清空的目录也慢慢地生成新分配给它的数据块。重启后该数据节点会快速地将数据块同步过来。不知道会不会随着时间在不重启该数据节点的情况下,完成该节点数据块的重新同步。将另外一台datanode节点xd0的一半的目录清空(/opt/edfs),过几个小时看看。过几个小时去看日志时,还是有报错,第二天中午再看时对应的被清空目录文件大小已经恢复到之前,数据节点和namenode的日志中不在有块不存在的报警日志。从web(端口50070)页面中也能看到Blocks、Block pool used这两项比较均匀。同时核对测试表中的数据数量与上次也是相同的。重启对应的datanode重启成功后,该datanode会很快将相应目录缺少的Block从对应的机器上同步过来: 16:02:48,598 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received BP--192.168.0.9-0:blk__1052 src: /192.168.0.83:51588 dest: /192.168.0.81:50010 of size 73 16:02:48,643 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP--192.168.0.9-0:blk__1040 src: /192.168.0.83:51589 dest: /192.168.0.81:50010 16:02:51,373 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP--192.168.0.9-0:blk__1009 src: /192.168.0.80:52879 dest: /192.168.0.81:50010 16:02:51,383 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received BP--192.168.0.9-0:blk__1009 src: /192.168.0.80:52879 dest: /192.168.0.81:50010 of size 1745 16:02:51,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP--192.168.0.9-0:blk__1022 src: /192.168.0.80:52878 dest: /192.168.0.81:50010 16:02:51,527 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received BP--192.168.0.9-0:blk__1022 src: /192.168.0.80:52878 dest: /192.168.0.81:50010 of size 1045 16:02:51,728 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP--192.168.0.9-0:blk__1036 src: /192.168.0.83:51590 dest: /192.168.0.81:50010 16:02:51,751 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received BP--192.168.0.9-0:blk__1036 src: /192.168.0.83:51590 dest: /192.168.0.81:50010 of size 54 16:02:53,841 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received BP--192.168.0.9-0:blk__1306 src: /192.168.0.83:51586 dest: /192.168.0.81:50010 of size
16:02:55,065 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received BP--192.168.0.9-0:blk__1070 src: /192.168.0.80:52875 dest: /192.168.0.81:50010 of size
16:02:55,289 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received BP--192.168.0.9-0:blk__1056 src: /192.168.0.80:52877 dest: /192.168.0.81:50010 of size
16:07:40,683 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP--192.168.0.9-0:blk__1329 src: /192.168.0.81:50162 dest: /192.168.0.81:50010 16:07:40,717 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.81:50162, dest: /192.168.0.81:50010, bytes: 27729, op: HDFS_WRITE, cliID: DFSClient_hb_rs_xd1,, offset: 0, srvID: 06fc-4c43-884e-7be0d65a6aed, blockid: BP--192.168.0.9-0:blk__1329, duration:
16:07:40,718 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP--192.168.0.9-0:blk__1329, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 16:09:04,905 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP--192.168.0.9-0:blk__1330 src: /192.168.0.81:50168 dest: /192.168.0.81:50010 16:09:16,552 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP--192.168.0.9-0:blk__1170 16:10:32,560 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP--192.168.0.9-0:blk__1306namenode 上的日志 16:03:16,065 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 16:03:46,064 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 16:03:46,064 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 16:04:14,715 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 102
16:04:14,734 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/WALs/xd1,0648396/xd1%2C.4.meta. BP--192.168.0.9-0 blk__1328{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-032e61b4-2a55-4b0f-b497-0856d1aef4ee:NORMAL:192.168.0.81:50010|RBW], ReplicaUnderConstruction[[DISK]DS-68cfba4b--864d-fc:NORMAL:192.168.0.83:50010|RBW], ReplicaUnderConstruction[[DISK]DS-79a15dc0-0c3e-4cf3-e6f25788:NORMAL:192.168.0.80:50010|RBW]]} 16:04:14,974 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /hbase/WALs/xd1,0648396/xd1%2C.4.meta for DFSClient_hb_rs_xd1, 16:04:16,064 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 16:04:16,065 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 16:09:04,998 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 19 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 10 SyncTimes(ms): 199
16:09:05,022 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/WALs/xd1,0648396/xd1%2C.7. BP--192.168.0.9-0 blk__1330{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-493b36f7-1dbf-4d2f-83d4-:NORMAL:192.168.0.81:50010|RBW], ReplicaUnderConstruction[[DISK]DS-79a15dc0-0c3e-4cf3-e6f25788:NORMAL:192.168.0.80:50010|RBW], ReplicaUnderConstruction[[DISK]DS-60259b9b-c57d-46d8-841e-100fcbf9ff18:NORMAL:192.168.0.83:50010|RBW]]} 16:09:05,052 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /hbase/WALs/xd1,0648396/xd1%2C.7 for DFSClient_hb_rs_xd1,从web页面的&Datanode Information&页的&Block pool used&能再次看到其所使用的空间与其它节点相近。如果不想在日志文件看到这些报警,可以在配置文件中就是设定。&property&&&& &name&dfs.datanode.data.dir&/name&&&& &value&/freeoa/pdb,/freeoa/pdc,/freeoa/pdd,&/value&&/property&这些目录都是独立硬盘的挂载点,当它们中有磁盘出现i/o错误时就会报上述的错误。如果不想让hadoop报出这些,可将&dfs.datanode.failed.volumes.tolerated&这个选项设置为1,默认为0。The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown.&property&&&& &name&dfs.datanode.failed.volumes.tolerated&/name&&&& &value&1&/value&&/property&14、修复hadoop因强制退出而又开启平衡后corrupt问题因同事将hadoop v2.2.0 的一台datanode强制关闭(因长时间不能Decommissioned),而后在另外一台新加入的datanode开启balancer.sh平衡,导致下面的错误警告....................................................................................................Status: CORRUPT&Total size:&& &36 B (Total open files size: 4 B)&Total dirs:&& &18291&Total files:&& &16499&Total symlinks:&& &&& &0 (Files currently being written: 9)&Total blocks (validated):&& &146222 (avg. block size
B) (Total open file blocks (not validated): 11939)& ********************************& CORRUPT FILES:&& &2& MISSING BLOCKS:&& &70& MISSING SIZE:&& &&& & B& CORRUPT BLOCKS: && &70& ********************************&Minimally replicated blocks:&& &.952126 %)&Over-replicated blocks:&& &0 (0.0 %)&Under-replicated blocks:&& &0 (0.0 %)&Mis-replicated blocks:&& &&& &0 (0.0 %)&Default replication factor:&& &2&Average block replication:&& &1.9990425&Corrupt blocks:&& &&& &70&Missing replicas:&& &&& &0 (0.0 %)&Number of data-nodes:&& &&& &13&Number of racks:&& &&& &1FSCK ended at Mon Apr 17 10:31:04 CST 2017 in 1357 millisecondsThe filesystem under path '/' is CORRUPT看了下官方的FAQ,大致意思就是,有几个块的数据,在现有的DataNode节点上,没有一个存储的,但是在NameNode的元数据里却存在。观察hbase master web ui的主页,也会有大量的警告。怎么解决这个问题呢,下面介绍一个hadoop的健康监测命令fsck。fsck工具来检验HDFS中的文件是否正常可用。这个工具可以检测文件块是否在DataNode中丢失,是否低于或高于文件副本。 fsck命令用法如下:注:此处必须是启动hadoop hdfs的账号才有权查看Usage: DFSck &path& [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]]&&path&& 检查的起始目录&-move&& 将损坏的文件移到到/lost+found&-delete 删除损坏的文件&-files& 打印出所有被检查的文件&-openforwrite&& 打印出正在写的文件&-list-corruptfileblocks print out list of missing blocks and files they belong to&-blocks 打印出block报告&-locations&&&&& 打印出每个block的位置&-racks& 打印出data-node的网络拓扑结构默认情况下,fsck会忽略正在写的文件,使用-openforwrite可以汇报这种文件。最后一点,需要注意的是,这个命令在namenode文件信息较大的时候,会比较影响hadoop性能,所以应该慎用,通常可以在集群空闲的时间段,执行一次,查看整体的HDFS副本健康状况!重启hbase或hadoop也不能解决该问题。可以将那台强制退出的那台datanode的机器重新加入回来,停止平衡脚本,让hadoop其自动恢复,在fsck时没有错误报出时再将这台datanode机器移除;如果这台机器不可找回时,可以考虑下面的方法(数据会丢失,但整个hadoop系统会起来)。尝试使用 hbase hbck -fix以及 hbase hbck -repair 命令来修复,结果失败通过hdfs fsck / -delete bad_region_file 直接干掉坏掉的hbase corrupt blocks,然后重启hbase集群,发现region全部online,问题解决。在看看块文件hdfs fsck / -files –blockshdfs fsck / | egrep -v '^\.+$' | grep -i &corrupt blockpool&| awk '{print $1}' |sort |uniq |sed -e 's/://g' &corrupted.flsthdfs dfs -rm /path/to/corrupted.flsthdfs dfs -rm -skipTrash /path/to/corrupted.flstHow would I repair a corrupted file if it was not easy to replace?This might or might not be possible, but the first step would be to gather information on the file's location, and blocks.hdfs fsck /path/to/filename/fileextension -locations -blocks -fileshdfs fsck hdfs://ip.or.hostname.of.namenode:50070/path/to/filename/fileextension -locations -blocks -files注意:通过 hdfs fsck / -delete 方式删除了坏掉的hdfs block会造成数据丢失。15、HBase创建快照(snapshot)出现异常的处理方法在hbase中创建快照的时候遇到了如下错误:& snapshot 'freeoa', 'freeoa-snapshot-'ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot ...Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@69db0cb4:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:2, End:2, diff:60000, max:60000 ms &&&& at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83) &&&& at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:320) &&&& at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:332) &&&& … 10 more &Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:2, End:2, diff:60000, max:60000 ms &&&& at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70) &&&& at java.util.TimerThread.mainLoop(Timer.java:555) &&&& at java.util.TimerThread.run(Timer.java:505)出现这种问题的原因是因为和服务器通信超时导致的。所以需要将下面两个参数的默认值进行调整。1、hbase.snapshot.region.timeout2、hbase.snapshot.master.timeoutMillis这两个值的默认值为60000,单位是毫秒,也即1min。如果通信时间超过该值,就会报上面的错误。Snapshot就是一个metadata info集合,它能够让admin将一个table回复到先前的的一个状态。Operations:Take a snapshot: 对一个指定的table创建snapshot,在table进行balance,split,compact时,可能会失败;Clone a snapshot: 基于上述创建的snapshot,创建一个新的table,该table和上述的table有相同的schema和data, 新表的操作不会影响原始表;Restore a snapshot: 将一个table回复到一个snapshot状态;Delete a snapshot: 删除一个snapshot,释放空间,不会影响clone的表和其他的Export a snapshot: 将一个snapshot的metadata和data copy到另一个集群中,HDFS层面的操作,不会影响RS;&&& &具体操作:hbase& snapshot ‘tableName’, ‘snapshotName’hbase& clone_snapshot 'snapshotName', 'newTableName'hbase& delete_snapshot 'snapshotName'hbase& restore_snapshot 'snapshotName'hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot SnapshotName -copy-to hdfs:///srv2:8082/hbase一些局限涉及到snapshot的region合并时,在snapshot和clone table中会丢失数据。一个带有replication属性的table进行恢复到一个snapshot状态时,该table在另外一個集群里replica不会进行恢复。这里提供一个自写的hbase snapshot 脚本,它通过crontab调用,生产snapshot的同时会删除7天前的快照。use v5.12;use utf8;use Euse Mojo::Luse Time::Puse Data::Duse Time::Suse HBase::JSONRuse Cwd qw(abs_path realpath);use File::Basename qw(dirname);binmode(STDIN, &:encoding(utf8)&);binmode(STDOUT, &:encoding(utf8)&);my $mydir=dirname(abs_path($0));chdir($mydir);my $cdt=my $pdt7=$cdt - 7 * ONE_DAY;my $log = Mojo::Log-&new(path =& 'log/hbase.snapshot.log');$log = $log-&format(sub {& my ($time, $level, @lines) = @_;& #my $idt=strftime(&%Y-%m-%d %H:%M:%S&,localtime);& my $idt=localtime-&& return qq{[$idt] [$level] @lines \n};});#my $ymd=$cdt-&#取得当前年月日my $ymd=$cdt-&strftime(&%Y%m%d&);#同上,但其中不含有'-'my $ymd7=$pdt7-&strftime(&%Y%m%d&);#同上,前7天的#Hbasemy ($hostname,$hbdir)=('192.168.1.120:8080','/usr/local/hbase');my $hbase=HBase::JSONRest-&new(host=&$hostname);#取得所有表的名称my $hbtabs=$hbase-&foreach my $tab (@$hbtabs){&& &#say 'Take SnapShot for table:'.$tab-&{name};&& &make_snap_shot($tab-&{name});&& &purge_snap_shot($tab-&{name});}#做快照sub make_snap_shot{&& &my $tab=&& &my $cmd=qq[echo &snapshot '$tab', 'snapshot_$tab\_$ymd'& | $hbdir/bin/hbase shell];&& &my ($rs,$henv)=(system($cmd));&& &#$henv.=&$_:$ENV{$_}\n& foreach (keys %ENV);&& &$log-&info(&Take snapshot on $tab,return code is:$rs.&);}#删除之前的快照sub purge_snap_shot{&& &my $tab=&& &my $cmd=qq[echo &delete_snapshot 'snapshot_$tab\_$ymd7'& | $hbdir/bin/hbase shell];&& &my $rs=system($cmd);&& &$log-&info(&Delete snapshot for table:$tab by snapshot name:snapshot_$tab\_$ymd7,return code is:$rs.&);}16、快照时间过短导致的快照不能完成且日志中有大量临时文件找不到的错误hbase(main):004:0& snapshot 'table','snapshot_freeoa'ERROR: Snapshot 'snapshot_freeoa' wasn't completed in expectedTime:60000 msHere is some help for this command:Take a snapshot of specified table. Examples:& hbase& snapshot 'sourceTable', 'snapshotName'& hbase& snapshot 'namespace:sourceTable', 'snapshotName', {SKIP_FLUSH =& true}默认的60s在有许多Regions的表上显得不够用,需要加长一些。hbase-site.xml&property&&& &&name&hbase.snapshot.enabled&/name&&& &&value&true&/value&&/property&&property&&& &&name&hbase.snapshot.master.timeoutMillis&/name&&& &&value&1800000&/value&&/property&&property&&& &&name&hbase.snapshot.region.timeout&/name&&& &&value&1800000&/value&&/property&17、大量写入导致整个系统变慢的问题 delaying flush up to 90000ms 12:35:21,506 WARN& [B.DefaultRpcServer.handler=2969,queue=469,port=60020] hdfs.DFSClient: Failed to connect to /192.168.20.50:50010 for file /hbase/data/default/wz_content/aa010d3b1f1d325063edb83d9c971057/content/bf6454cd2bcd49c890cea6afaa5fe99b for block BP--192.168.20.125-5:blk__:java.io.IOException: Connection reset by peer 12:35:21,506 WARN& [B.DefaultRpcServer.handler=2969,queue=469,port=60020] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 049063 msec. 12:35:21,521 WARN& [B.DefaultRpcServer.handler=1790,queue=290,port=60020] hdfs.DFSClient: Failed to connect to /192.168.20.50:50010 for file /hbase/data/default/wz_content/aa010d3b1f1d325063edb83d9c971057/content/bf6454cd2bcd49c890cea6afaa5fe99b for block BP--192.168.20.125-5:blk__:java.net.ConnectException: Connection timed out 12:35:21,981 WARN& [B.DefaultRpcServer.handler=3370,queue=370,port=60020] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 229255 msec. 12:35:22,091 WARN& [B.DefaultRpcServer.handler=2525,queue=25,port=60020] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 661871 msec. 12:39:59,789 WARN& [regionserver60020.logRoller] regionserver.ReplicationSource: Queue size: 1394 exceeds value of replication.source.log.queue.warn: 2 12:47:58,802 WARN& [regionserver60020.logRoller] regionserver.ReplicationSource: Queue size: 1395 exceeds value of replication.source.log.queue.warn: 2 12:48:34,586 WARN& [MemStoreFlusher.1] regionserver.MemStoreFlusher: Region filter_content,c.198fe43f10a1c7d980f38a4ff661957c. has
delaying flush up to 90000ms 12:48:49,947 WARN& [B.DefaultRpcServer.handler=1031,queue=31,port=60020] hdfs.DFSClient: Failed to connect to /192.168.20.50:50010 for file /hbase/data/default/filter_content/cf69a9be117f48f8c084d61bf9c71290/content/c20b398c292d4adebc410cda249c29c1 for block BP--192.168.20.125-5:blk__:java.net.ConnectException: Connection timed out 12:48:49,947 WARN& [B.DefaultRpcServer.handler=1031,queue=31,port=60020] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 658.6 msec.参考来源: 18、压缩功能在某些无法在RegionServer上开启的问题WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable[regionserver60020-smallCompactions-5] compress.CodecPool: Got brand-new decompressor [.gz][B.DefaultRpcServer.handler=3900,queue=400,port=60020] compress.CodecPool: Got brand-new decompressor [.gz]问题表现:在新增的两台RegionServer的日志里,'compress.CodecPool'此类的日志刷过不停。经过分析初步定位是与压缩功能的问题,因为我有相当多的表开启了压缩功能。难道是这两台RS上的压缩功能支持有问题,测试一下:$ ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker 15:11:38,717 WARN& [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableNative library checking:hadoop: falsezlib:&& falsesnappy: falselz4:&&& falsebzip2:& false 15:11:38,863 INFO& [main] util.ExitUtil: Exiting with status 1而在老的RS上至少支持zlib与lz4,看来是没有正常安装了。在'/usr/local'目录下,只发现了hbase目录,没有与之平级hadoop目录,而在另外老的三台RS上是有的。这样问题找到了,将老RS上的hadoop复制到'/usr/local'下,设置好hadoop路径变量(/etc/profile.d/hadoop.sh),问题得以解决。$ ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker 15:16:59,739 WARN& [main] bzip2.Bzip2Factory (Bzip2Factory.java:isNativeBzip2Loaded(73)) - Failed to load/initialize native-bzip2 library system-native, will use pure-Java version 15:16:29,744 INFO& [main] zlib.ZlibFactory (ZlibFactory.java:&clinit&(48)) - Successfully loaded & initialized native-zlib libraryNative library checking:hadoop: true /usr/local/hadoop/lib/native/libhadoop.so.1.0.0zlib:&& true /lib64/libz.so.1snappy: false lz4:&&& true revision:43bzip2:& false原因分析:hbase是基于hadoop开发而来,其中的一些高级功能还是需要调用hadoop中的接口,因此,在其运行环境中提供相兼容的hadoop环境是必须的。参考来源:19、datanode节点的数据盘出现损坏时意外退出或拒绝启动日志如下:... 10:42:03,361 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Setting up storage: nsid=;bpid=BP--192.168.20.125-5;lv=-47;nsInfo=lv=-47;cid=CID-f0a-4034-a71d-e7c7ebcb13nsid=;c=0;bpid=BP--192.168.20.125-5 10:42:03,380 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP--192.168.20.125-5 (storage id DS--192.168.20.51-5973516) service to nameNode.hadoop1/192.168.20.125:9000org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 5, volumes configured: 6, volumes failed: 1, volume failures tolerated: 0&& &at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.&init&(FsDatasetImpl.java:201)... 10:42:05,482 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode 10:42:05,485 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0 10:42:05,488 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: /************************************************************SHUTDOWN_MSG: Shutting down DataNode at dataNode11.hadoop1/192.168.20.51************************************************************/这台dn有6块硬盘,其中一块损坏了,导致进程中止并拒绝启动,后修改了hdfs-site.xml配置文件,将这块硬盘的挂载点从配件文件中移除,该节点可以被启动,但容量会少去这部分。分析一下不能启动的原因:dfs.datanode.failed.volumes.tolerated - By default this is set to 0 in hdfs-site.xml其含义是:The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown. 即datanode可以忍受的磁盘损坏的个数,默认为0。在hadoop集群中,经常会发生磁盘损坏的情况。datanode在启动时会使用dfs.datanode.data.dir下配置的文件夹(用来存储block),若是有一些不可以用且个数&配置的值,DataNode就会启动失败。volFailuresTolerated和volsConfigured的值都为1,所以会导致代码里的设定起作用从而拒绝启动。

我要回帖

更多关于 win10 怎么让war3全屏 的文章

 

随机推荐